AlphaFold3 protein_dataset模块 ProteinDataset
类主要负责从结构化的蛋白质数据中构建一个可供模型训练/推理使用的数据集,ProteinDataset
类的 __init__
方法用于初始化一个蛋白质数据集对象。
源代码:
def __init__(self,dataset_folder,features_folder="./data/tmp/",clustering_dict_path=None,max_length=None,rewrite=False,use_fraction=1,load_to_ram=False,debug=False,interpolate="none",node_features_type="zeros",debug_file_path=None,entry_type="biounit", # biounit, chain, pairclasses_to_exclude=None, # heteromers, homomers, single_chainsshuffle_clusters=True,min_cdr_length=None,feature_functions=None,classes_dict_path=None,cut_edges=False,mask_residues=True,lower_limit=15,upper_limit=100,mask_frac=None,mask_whole_chains=False,mask_sequential=False,force_binding_sites_frac=0.15,mask_all_cdrs=False,load_ligands=False,pyg_graph=False,patch_around_mask=False,initial_patch_size=128,antigen_patch_size=128,require_antigen=False,require_light_chain=False,require_no_light_chain=False,require_heavy_chain=False,):"""Initialize the dataset.Parameters----------dataset_folder : strthe path to the folder with proteinflow format x files (assumes that files are named {biounit_id}.pickle)features_folder : str, default "./data/tmp/"the path to the folder where the ProteinMPNN features will be savedclustering_dict_path : str, optionalpath to the pickled clustering dictionary (keys are cluster ids, values are (biounit id, chain id) tuples)max_length : int, optionalentries with total length of chains larger than `max_length` will be disregardedrewrite : bool, default Falseif `False`, existing feature files are not overwrittenuse_fraction : float, default 1the fraction of the clusters to use (first N in alphabetic order)load_to_ram : bool, default Falseif `True`, the data will be stored in RAM (use with caution! if RAM isn'timesteps big enough the machine might crash)debug : bool, default Falseonly process 1000 filesinterpolate : {"none", "only_middle", "all"}`"none"` for no interpolation, `"only_middle"` for only linear interpolation in the middle, `"all"` for linear interpolation + ends generationnode_features_type : {"zeros", "dihedral", "sidechain_orientation", "chemical", "secondary_structure" or combinations with "+"}the type of node features, e.g. `"dihedral"` or `"sidechain_orientation+chemical"`debug_file_path : str, optionalif not `None`, open this single file instead of loading the datasetentry_type : {"biounit", "chain", "pair"}the type of entries to generate (`"biounit"` for biounit-level complexes, `"chain"` for chain-level, `"pair"`for chain-chain pairs (all pairs that are seen in the same biounit and have intersecting coordinate clouds))classes_to_exclude : list of str, optionala list of classes to exclude from the dataset (select from `"single_chain"`, `"heteromer"`, `"homomer"`)shuffle_clusters : bool, default Trueif `True`, a new representative is randomly selected for each cluster at each epoch (if `clustering_dict_path` is given)min_cdr_length : int, optionalfor SAbDab datasets, biounits with CDRs shorter than `min_cdr_length` will be excludedfeature_functions : dict, optionala dictionary of functions to compute additional features (keys are the names of the features, values are the functions)classes_dict_path : str, optionala path to a pickled dictionary with biounit classes (single chain / heteromer / homomer)cut_edges : bool, default Falseif `True`, missing values at the edges of the sequence will be cut offmask_residues : bool, default Trueif `True`, the masked residues will be added to the outputlower_limit : int, default 15the lower limit of the number of residues to maskupper_limit : int, default 100the upper limit of the number of residues to maskmask_frac : float, optionalif given, the number of residues to mask is `mask_frac` times the length of the chainmask_whole_chains : bool, default Falseif `True`, the whole chain is maskedmask_sequential : bool, default Falseif `True`, the masked residues will be neighbors in the sequence; otherwise a geometricmask is applied based on the coordinatesforce_binding_sites_frac : float, default 0.15if `force_binding_sites_frac` > 0 and `mask_whole_chains` is `False`, in the fraction of cases where a chainfrom a polymer is sampled, the center