Reading data files
pyAscore relies on the functionality of pyteomics to read XML files of spectra and PSMs, with tsv file reading being provided by pandas. While this system is a powerful way to extract information from files, pyAscore also provides some convenience functions to extract just the relevant information for scoring localizations from files in a standard format. This functionality can be accessed through the SpectraParser and IdentificationParser classes.
Reading spectra
Spectra file reading is fairly straight forward, and can be achieved by specifying the file name and file type. An optional argument for MSn level is also provided, but the default MSn level of 2 should be correct for most purposes. Spectra are read in parallel if multiple cores are available and can be transformed to a list of dictionaries or a dictionary of dictionaries.
spectra_file = "spectra.mzML"
spectra_parser = SpectraParser(spectra_file, "mzML")
# To a list of dictionaries:
spectra_list = spectra_parser.to_list()
# To a dictionary of dictionaries:
spectra_dict = spectra_parser.to_dict()
Reading PSMs
Reading PSM files happens in much the same way as spectra by specifying a file name and the file type.
psm_file = "psms.pep.xml"
psm_parser = IdentificationParser(psm_file, "pepXML")
# To a list of dictionaries:
psm_list = psm_parser.to_list()
# To a dictionary of dictionaries:
psm_dict = psm_parser.to_dict()
Likely, the most useful extra feature of this module is the mass correction that allows you to make sure that the correct modification mass is associated with residues when they are read in. This is important since many search engines and other programs will truncate the mass. This functionality is supplied by the MassCorrector class. It comes with several modifications built in, but if you happen to have one that isn’t recognized, then it would be good to use the following code.
modifications = {"n": 42.010565, # N-term acetylation
"M": 15.9949, # Methionine oxidation
"S": 79.966331, # Serine Phoshorylation
"T": 79.966331, # Threonine Phosphorylation
"Y": 79.966331, # Tyrosine Phosphorylation
"C": 57.021464} # Cysteine Carbamidomethylation
mass_corrector = MassCorrector(modifications, mz_tol=1.5)
psm_file = "psms.pep.xml"
psm_parser = IdentificationParser(psm_file,
"pepXML",
mass_corrector)
Class Reference
- class pyascore.SpectraParser(spec_file_name, spec_file_format, ms_level=2, custom_filter=None)
Parser to read spectra from mzML and mzXML files
This class is designed to provide ease of access to spectra from popular file formats. By providing the file of interest and the MSn level, the user can receive spectra in either list form or dictionary form depending on need. Spectral entries from individual file formats are normalized to a single output schema, which has the information necessary for running pyAscore. This output can be further filtered by supplying a custom filter, and future versions will make this option more powerful.
- Parameters:
- spec_file_namestr
Path to spectral file
- spec_file_typestr
The spectra file’s type. One of mzML or mzXML
- ms_levelint
MSn level to be returned to the user
- custom_filtercallable
A callable which takes a spectral object and returns a boolean which states whether a spectra should be retained. The spectral objects passed to this parameter are currently the same as the ones returned to the user
Methods
to_dict()Return spectra from file in dictionary form
to_list()Return spectra from file in list form
- to_dict()
Return spectra from file in dictionary form
- Returns:
- dict
Dict of scans from file with schema: {scan number : spectra}
- to_list()
Return spectra from file in list form
- Returns:
- list
List of scans from file sorted by scan number
- class pyascore.IdentificationParser(id_file_name, id_file_format, mass_corrector=<pyascore.parsing.id_parsers.MassCorrector object>, score_string=None, score_threshold=None, score_lower_better=True, score_func=None, static_mods={'C': 57.021464}, spec_file_name=None)
Parser for modification information coming from PSM file formats
This class is designed to provide ease of access to PSMs from popular file formats. By providing the file of interest, the user can receive PSMs in either list form or dictionary form depending on need. PSM entries from individual file format are normalized to a single output format, which has the information necessary for running pyAscore. Sometimes modifications are not formatted in a straight forward way and there is usually a good chance that their masses will be trunctated. This class will attempt to normalize modifications so that all can be interpreted similarly.
- Parameters:
- id_file_namestr
Path to file containing PSMs
- spec_file_formatstr
The format of the PSM containing file type. One of mzIdentML, pepXML, percolatorTXT, or mokapotTXT
- mass_correctorMassCorrector
Corrector class to normalize peptide modifications
- score_stringstr
String for score to extract from PSMs
- score_thresholdfloat
Threshold to filter PSMs based on score
- score_lower_betterbool
Whether a lower score is better than a higher score
- score_funccallable
Transformation for scores
- static_modsdict
Dictionary of static mods to be used by some extractors.
- spec_file_namestr
Currently not used
Methods
to_dict()Return modified hits as dict
to_list()Return modified hits as list of dicts
- to_dict()
Return modified hits as dict
- Returns:
- dict
Dict of PSMs from file with schema: {scan number : spectra}
- to_list()
Return modified hits as list of dicts
- Returns:
- list
List of PSMs from file sorted by scan number
- class pyascore.MassCorrector(mod_mass_dict={'C': 57.021464, 'K': 8.014199, 'M': 15.9949, 'S': 79.966331, 'T': 79.966331, 'Y': 79.966331, 'n': 42.010565}, aa_mass_dict={'A': 71.03711378471, 'C': 103.00918478471, 'D': 115.02694302383001, 'E': 129.04259308796998, 'F': 147.06841391298997, 'G': 57.02146372057, 'H': 137.05891185845002, 'I': 113.08406397713001, 'J': 113.08406397713001, 'K': 128.09496301399997, 'L': 113.08406397713001, 'M': 131.04048491299, 'N': 114.04292744114001, 'O': 237.14772686284996, 'P': 97.05276384885, 'Q': 128.05857750527997, 'R': 156.10111102359997, 'S': 87.03202840427001, 'T': 101.04767846841, 'U': 150.95363508471, 'V': 99.06841391299, 'W': 186.07931294985997, 'Y': 163.06332853254997}, mz_tol=1.5, n_mod_ind=0)
A class to provide modification mass correction
Often the mass of a modification is rounded when it comes from PSM files, or the mass of the modification may be combined with the mass of the amino acid and needs to be decouple. These situations make the mod difficult to work with, so this class corrects rounded and combined modification masses.
- Parameters:
- mod_mass_dictdict
A dictionary of known modifications to be used for correction
- aa_mass_dictdict
Correct masses for the amino acids in the peptide sequence
- mz_tolfloat
How far off can a rounded mass be to a known mod before it is counted as unknown
- n_mod_indint
Is an n-terminal mod considered to be before the first AA (0) or after the first AA (1)
Methods
correct(res, pos, mass)Perform modification correction
correct_multiple(peptide, positions, masses)Run the correction function on a list of mods for a single peptide
correct_numpy(peptide, positions, masses)Correct masses taking advantage of numpy internals.