Reading data files

pyAscore relies on the functionality of pyteomics to read XML files of spectra and PSMs, with tsv file reading being provided by pandas. While this system is a powerful way to extract information from files, pyAscore also provides some convenience functions to extract just the relevant information for scoring localizations from files in a standard format. This functionality can be accessed through the SpectraParser and IdentificationParser classes.

Reading spectra

Spectra file reading is fairly straight forward, and can be achieved by specifying the file name and file type. An optional argument for MSn level is also provided, but the default MSn level of 2 should be correct for most purposes. Spectra are read in parallel if multiple cores are available and can be transformed to a list of dictionaries or a dictionary of dictionaries.

spectra_file = "spectra.mzML"
spectra_parser = SpectraParser(spectra_file, "mzML")

# To a list of dictionaries:
spectra_list = spectra_parser.to_list()

# To a dictionary of dictionaries:
spectra_dict = spectra_parser.to_dict()

Reading PSMs

Reading PSM files happens in much the same way as spectra by specifying a file name and the file type.

psm_file = "psms.pep.xml"
psm_parser = IdentificationParser(psm_file, "pepXML")

# To a list of dictionaries:
psm_list = psm_parser.to_list()

# To a dictionary of dictionaries:
psm_dict = psm_parser.to_dict()

Likely, the most useful extra feature of this module is the mass correction that allows you to make sure that the correct modification mass is associated with residues when they are read in. This is important since many search engines and other programs will truncate the mass. This functionality is supplied by the MassCorrector class. It comes with several modifications built in, but if you happen to have one that isn’t recognized, then it would be good to use the following code.

modifications = {"n": 42.010565, # N-term acetylation
                 "M": 15.9949,   # Methionine oxidation
                 "S": 79.966331, # Serine Phoshorylation
                 "T": 79.966331, # Threonine Phosphorylation
                 "Y": 79.966331, # Tyrosine Phosphorylation
                 "C": 57.021464} # Cysteine Carbamidomethylation
mass_corrector = MassCorrector(modifications, mz_tol=1.5)

psm_file = "psms.pep.xml"
psm_parser = IdentificationParser(psm_file,
                                  "pepXML",
                                  mass_corrector)

Class Reference

class pyascore.SpectraParser(spec_file_name, spec_file_format, ms_level=2, custom_filter=None)

Parser to read spectra from mzML and mzXML files

This class is designed to provide ease of access to spectra from popular file formats. By providing the file of interest and the MSn level, the user can receive spectra in either list form or dictionary form depending on need. Spectral entries from individual file formats are normalized to a single output schema, which has the information necessary for running pyAscore. This output can be further filtered by supplying a custom filter, and future versions will make this option more powerful.

Parameters:

spec_file_namestr: Path to spectral file
spec_file_typestr: The spectra file’s type. One of mzML or mzXML
ms_levelint: MSn level to be returned to the user
custom_filtercallable: A callable which takes a spectral object and returns a boolean which states whether a spectra should be retained. The spectral objects passed to this parameter are currently the same as the ones returned to the user

Methods

`to_dict`()	Return spectra from file in dictionary form
`to_list`()	Return spectra from file in list form

to_dict()

Return spectra from file in dictionary form

Returns:

dict: Dict of scans from file with schema: {scan number : spectra}

to_list()

Return spectra from file in list form

Returns:

list: List of scans from file sorted by scan number

class pyascore.IdentificationParser(id_file_name, id_file_format, mass_corrector=<pyascore.parsing.id_parsers.MassCorrector object>, score_string=None, score_threshold=None, score_lower_better=True, score_func=None, static_mods={'C': 57.021464}, spec_file_name=None)

Parser for modification information coming from PSM file formats

This class is designed to provide ease of access to PSMs from popular file formats. By providing the file of interest, the user can receive PSMs in either list form or dictionary form depending on need. PSM entries from individual file format are normalized to a single output format, which has the information necessary for running pyAscore. Sometimes modifications are not formatted in a straight forward way and there is usually a good chance that their masses will be trunctated. This class will attempt to normalize modifications so that all can be interpreted similarly.

Parameters:

id_file_namestr: Path to file containing PSMs
spec_file_formatstr: The format of the PSM containing file type. One of mzIdentML, pepXML, percolatorTXT, or mokapotTXT
mass_correctorMassCorrector: Corrector class to normalize peptide modifications
score_stringstr: String for score to extract from PSMs
score_thresholdfloat: Threshold to filter PSMs based on score
score_lower_betterbool: Whether a lower score is better than a higher score
score_funccallable: Transformation for scores
static_modsdict: Dictionary of static mods to be used by some extractors.
spec_file_namestr: Currently not used

Methods

`to_dict`()	Return modified hits as dict
`to_list`()	Return modified hits as list of dicts

to_dict()

Return modified hits as dict

Returns:

dict: Dict of PSMs from file with schema: {scan number : spectra}

to_list()

Return modified hits as list of dicts

Returns:

list: List of PSMs from file sorted by scan number

class pyascore.MassCorrector(mod_mass_dict={'C': 57.021464, 'K': 8.014199, 'M': 15.9949, 'S': 79.966331, 'T': 79.966331, 'Y': 79.966331, 'n': 42.010565}, aa_mass_dict={'A': 71.03711378471, 'C': 103.00918478471, 'D': 115.02694302383001, 'E': 129.04259308796998, 'F': 147.06841391298997, 'G': 57.02146372057, 'H': 137.05891185845002, 'I': 113.08406397713001, 'J': 113.08406397713001, 'K': 128.09496301399997, 'L': 113.08406397713001, 'M': 131.04048491299, 'N': 114.04292744114001, 'O': 237.14772686284996, 'P': 97.05276384885, 'Q': 128.05857750527997, 'R': 156.10111102359997, 'S': 87.03202840427001, 'T': 101.04767846841, 'U': 150.95363508471, 'V': 99.06841391299, 'W': 186.07931294985997, 'Y': 163.06332853254997}, mz_tol=1.5, n_mod_ind=0)

A class to provide modification mass correction

Often the mass of a modification is rounded when it comes from PSM files, or the mass of the modification may be combined with the mass of the amino acid and needs to be decouple. These situations make the mod difficult to work with, so this class corrects rounded and combined modification masses.

Parameters:

mod_mass_dictdict: A dictionary of known modifications to be used for correction
aa_mass_dictdict: Correct masses for the amino acids in the peptide sequence
mz_tolfloat: How far off can a rounded mass be to a known mod before it is counted as unknown
n_mod_indint: Is an n-terminal mod considered to be before the first AA (0) or after the first AA (1)

Methods

`correct`(res, pos, mass)	Perform modification correction
`correct_multiple`(peptide, positions, masses)	Run the correction function on a list of mods for a single peptide
`correct_numpy`(peptide, positions, masses)	Correct masses taking advantage of numpy internals.