Dataset¶
BaseDataset¶
- class openddi.data.BaseDataset.BaseDataset(args: Namespace)[source]¶
Bases:
objectBase dataset class integrating all functional modules.
Provides a unified data processing interface supporting multi-class and multi-label tasks.
- load_data(val_ratio: float = 0.1, test_ratio: float = 0.2)[source]¶
Main entry point for loading data.
- Parameters:
val_ratio – Validation set ratio.
test_ratio – Test set ratio.
- build_pairs_labels_splits(val_ratio: float = 0.1, test_ratio: float = 0.2, random_seed: int | None = None, return_original_ids: bool = True) Dict[str, Tuple[ndarray, ndarray]][source]¶
Load paired data from the given args.matrix or matrix_path, and perform a random split in the same style as BaseDataset. Return (pairs, labels) for train/val/test splits.
For multi-class tasks: labels are int64 of shape [N]
For multi-label tasks: labels are float32 of shape [N, C]
- Parameters:
val_ratio (float) – Validation set ratio.
test_ratio (float) – Test set ratio.
random_seed (int, optional) – Random seed (default from args.seed or 1).
return_original_ids (bool, optional) – Whether to return original ID (string) pairs. If False, returns index pairs.
- Returns:
- {
‘train’: (pairs, labels), ‘val’: (pairs, labels), ‘test’: (pairs, labels)
}
- Return type:
dict
DataLoadingModule¶
- class openddi.data.BaseDataset.DataLoadingModule[source]¶
Bases:
objectData loading and preprocessing module.
Responsible for reading embedding files, paired data files, and basic data preprocessing.
- static read_id_embedding_pt(embedding_paths: str | List[str]) Tuple[Dict[str, ndarray], int][source]¶
Read single or multiple modality embedding files and concatenate embedding vectors.
- Parameters:
embedding_paths – Path or list of paths to embedding files.
- Returns:
id2vec: Dictionary mapping IDs to concatenated embedding vectors
total_dim: Total dimension of concatenated embedding vectors
- Return type:
Tuple containing
- static read_id_embedding_pt_split(embedding_paths: List[str]) Tuple[Dict[int, Dict[str, ndarray]], List[int]][source]¶
Read multiple modality embedding files and store embedding vectors separately.
- Parameters:
embedding_paths – List of paths to embedding files.
- Returns:
modal2id2vec: Dictionary mapping modality indices to id2vec dictionaries
dims: List of embedding vector dimensions for each modality
- Return type:
Tuple containing
- static read_multi_pairs_and_remap(matrix_path: str) Tuple[DataFrame, int][source]¶
Read multi-class paired data and perform label remapping.
- Parameters:
matrix_path – Path to multi-class data file.
- Returns:
df: Processed DataFrame containing id1, id2, ddi columns
num_relations: Number of relation types
- Return type:
Tuple containing
FeatureProcessingModule¶
- class openddi.data.BaseDataset.FeatureProcessingModule[source]¶
Bases:
objectFeature processing and noise injection module.
Responsible for feature matrix construction, standardization, noise injection, etc.
- static build_feature_matrix(drug_list: List[str], id2vec: Dict[str, ndarray], emb_dim: int, args: Namespace) torch.Tensor[source]¶
Build drug feature matrix.
- Parameters:
drug_list – List of drug IDs.
id2vec – Dictionary mapping IDs to embedding vectors.
emb_dim – Embedding dimension.
args – Configuration parameters.
- Returns:
Standardized feature matrix.
- static normalize_features(feats: ndarray) ndarray[source]¶
Standardize features.
- Parameters:
feats – Original feature matrix.
- Returns:
Standardized feature matrix.
DataSplittingModule¶
- class openddi.data.BaseDataset.DataSplittingModule[source]¶
Bases:
objectData splitting and sampling module.
Responsible for dataset splitting, label noise injection, sparse sampling, etc.
- static split_data(triples: ndarray, val_ratio: float = 0.1, test_ratio: float = 0.2, random_seed: int = 1) Tuple[ndarray, ndarray, ndarray][source]¶
Split dataset.
- Parameters:
triples – Triple data.
val_ratio – Validation set ratio.
test_ratio – Test set ratio.
random_seed – Random seed.
- Returns:
train_data: Training set
val_data: Validation set
test_data: Test set
- Return type:
Tuple containing
- static split_data_generalization(triples: ndarray, val_ratio: float = 0.1, test_ratio: float = 0.2, random_seed: int = 1) Tuple[ndarray, ndarray, ndarray][source]¶
Based on a dataset categorized by drug entities, we ensure that some drugs in the test set do not appear in the training/validation sets.
- static split_multilabel_data(triples: ndarray, labels: ndarray, val_ratio: float = 0.1, test_ratio: float = 0.2, random_seed: int = 1) Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray][source]¶
Split multi-label dataset.
- Parameters:
triples – Triple data.
labels – Label data.
val_ratio – Validation set ratio.
test_ratio – Test set ratio.
random_seed – Random seed.
- Returns:
train_triples, train_labels: Training set triples and labels
val_triples, val_labels: Validation set triples and labels
test_triples, test_labels: Test set triples and labels
- Return type:
Tuple containing
- static split_multilabel_data_generalization(triples: ndarray, labels: ndarray, val_ratio: float = 0.1, test_ratio: float = 0.2, random_seed: int = 1) Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray][source]¶
Split multi-label dataset (drug entity split) to ensure some drugs in the test set do not appear in the training/validation sets.
- static add_label_noise_multiclass(train_data: ndarray, num_classes: int, noise_ratio: float, random_seed: int = 1) ndarray[source]¶
Inject label noise for multi-class classification.
- Parameters:
train_data – Training data.
num_classes – Number of classes.
noise_ratio – Noise ratio.
random_seed – Random seed.
- Returns:
Training data with added label noise.
- static add_label_noise_multilabel(labels: ndarray, noise_ratio: float, flip_per_label: int = 50, random_seed: int = 1) ndarray[source]¶
Inject label noise for multi-label classification.
- Parameters:
labels – Label matrix.
noise_ratio – Noise ratio.
flip_per_label – Number of flipped labels per sample.
random_seed – Random seed.
- Returns:
Label matrix with added noise.
- static sparse_sampling_multiclass(train_data: ndarray, num_classes: int, sparse_sample_rate: float, random_seed: int = 1) ndarray[source]¶
Perform sparse sampling for multi-class classification.
- Parameters:
train_data – Training data.
num_classes – Number of classes.
sparse_sample_rate – Sampling rate.
random_seed – Random seed.
- Returns:
Sampled training data.
GraphConstructionModule¶
- class openddi.data.BaseDataset.GraphConstructionModule[source]¶
Bases:
objectGraph construction module.
Responsible for building graph structures from training data, supporting multi-relation and single-relation graphs.
- static build_multigraph(train_data: ndarray, network_ratio: float = 1.0, random_seed: int = 1) Tuple[torch.Tensor, torch.Tensor][source]¶
Build multi-relation graph.
- Parameters:
train_data – Training data.
network_ratio – Edge usage ratio.
random_seed – Random seed.
- Returns:
edge_index: Edge index tensor
edge_type: Edge type tensor
- Return type:
Tuple containing
- static build_single_relation_graph(train_triples: ndarray, network_ratio: float = 1.0, random_seed: int = 1) Tuple[torch.Tensor, torch.Tensor][source]¶
Build single-relation graph.
- Parameters:
train_triples – Training triples.
network_ratio – Edge usage ratio.
random_seed – Random seed.
- Returns:
edge_index: Edge index tensor
edge_type: Edge type tensor (all zeros)
- Return type:
Tuple containing
DataLoaderCreationModule¶
- class openddi.data.BaseDataset.DataLoaderCreationModule[source]¶
Bases:
objectDataLoader creation module.
Responsible for creating and configuring DataLoaders.
- static create_dataloader_config(args: Namespace) Dict[source]¶
Create DataLoader configuration.
- Parameters:
args – Configuration parameters.
- Returns:
DataLoader configuration dictionary.
- static create_multiclass_dataloaders(train_data: ndarray, val_data: ndarray, test_data: ndarray, args: Namespace) Tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader][source]¶
Create multi-class DataLoaders.
- Parameters:
train_data – Training data.
val_data – Validation data.
test_data – Test data.
args – Configuration parameters.
- Returns:
train_loader: Training DataLoader
val_loader: Validation DataLoader
test_loader: Test DataLoader
- Return type:
Tuple containing
- static create_multilabel_dataloaders(train_triples: ndarray, train_labels: ndarray, val_triples: ndarray, val_labels: ndarray, test_triples: ndarray, test_labels: ndarray, args: Namespace) Tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader][source]¶
Create multi-label DataLoaders.
- Parameters:
train_triples – Training triples.
train_labels – Training labels.
val_triples – Validation triples.
val_labels – Validation labels.
test_triples – Test triples.
test_labels – Test labels.
args – Configuration parameters.
- Returns:
train_loader: Training DataLoader
val_loader: Validation DataLoader
test_loader: Test DataLoader
- Return type:
Tuple containing
Base Dataset Classes¶
MRCGNN_dataset¶
- class openddi.data.MRCGNN_dataset.MRCGNN_dataset(args: ArgumentParser)[source]¶
Bases:
BaseDatasetDataset class for MRCGNN model with support for multi-class and multi-label data.
This class extends BaseDataset to provide MRCGNN-specific data loading logic, including construction of adversarial samples for contrastive learning.
- Parameters:
args (argparse.ArgumentParser) – Command line arguments containing configuration parameters.
UnifiedDataset¶
- class openddi.data.Unified_dataset.UnifiedDataset(args: Namespace)[source]¶
Bases:
BaseDatasetUnified dataset class that inherits from BaseDataset.
Features:
Unified reading of id->embedding (supports multi-modal concatenation, see args.embedding_path/embedding_dir + –modality)
Multi-class: Uses real relationship types as edge_type (used by RGCN)
Multi-label: Uses single relationship graph (edge_type=0)
Supports feature Gaussian noise (noise_std) and label flip noise (noise_ratio)
DataLoader: pin_memory=False, persistent_workers=False; prefetch_factor=1 when workers>0
Supports sparse sampling (sparse_sample_rate) and sparse dropping (sparse_drop_rate)
- __init__(args: Namespace)[source]¶
Initialize the Unified dataset.
- Parameters:
args –
Namespace object containing the following parameters:
matrix: Data type (‘multilabel’, ‘twosides’ or other multi-class types)
embedding_path: Embedding file path
matrix_path: Matrix data file path
batch: Batch size
workers: Number of worker processes (optional)
noise_std: Feature Gaussian noise standard deviation (optional)
noise_ratio: Label noise ratio (optional)
sparse_sample_rate: Sparse sampling rate (optional)
sparse_drop_rate: Sparse drop rate (optional)
network_ratio: Graph edge usage ratio (optional)
flip_per_label: Multi-label flip bits (optional, default 50)
ZeroDDI_dataset¶
- class openddi.data.ZeroDDI_dataset.ZeroDDI_dataset(args: Namespace)[source]¶
Bases:
BaseDatasetZeroDDI dataset class with support for multi-modal node features and zero-shot learning.
Features: - Node modalities (supports concatenation of multiple CSV/PT files) - Training set graph construction (DDI graph) - Supports regular/multi-label classification - Feature and label noise injection during loading phase (training set only) - Zero-shot learning protocols (CZSL, GZSL)
GoGNN_dataset¶
- class openddi.data.GoGNN_dataset.GoGNN_dataset(args: ArgumentParser)[source]¶
Bases:
BaseDatasetGoGNN dataset class for graph-based DDI prediction.
Features: - Molecular graph construction from SMILES - Node feature extraction (element, atomic properties, hybridization) - Edge feature processing (bond types) - Supports multiclass and multilabel classification
MUFFIN_dataset¶
- class openddi.data.MUFFIN_dataset.MUFFIN_dataset(args: ArgumentParser)[source]¶
Bases:
BaseDatasetMUFFIN dataset class for multi-modal DDI prediction.
Features: - Dual embedding support (entity and structure embeddings) - Pre-trained embedding integration - Supports both multiclass and multilabel classification
MVA_dataset¶
- class openddi.data.MVA_dataset.MVA_dataset(args: ArgumentParser)[source]¶
Bases:
BaseDatasetMVA dataset class for multi-view attention DDI prediction.
Features: - Molecular graph representation from SMILES - BPE encoding for SMILES sequences - Atom feature extraction with multiple attribute types - Supports both multiclass and multilabel classification
TIGER_dataset¶
- class openddi.data.TIGER_dataset.TIGER_dataset(args: ArgumentParser)[source]¶
Bases:
BaseDatasetTIGER dataset class for knowledge graph-enhanced DDI prediction.
Features: - Molecular graph representation from SMILES - Knowledge graph subgraph extraction via random walks - Dual graph representation (molecular + knowledge graph) - Supports both multiclass and multilabel classification
dataset_manager¶
- class openddi.data.dataset_manager.dataset_manager(args: ArgumentParser)[source]¶
Bases:
objectA manager class for handling different dataset types based on the model.
This class maps model names to their corresponding dataset classes and provides functionality to load the appropriate dataset based on the specified model.
- Parameters:
args (argparse.ArgumentParser) – Command line arguments containing model specification and other parameters.