Reading data files

pyAscore relies on the functionality of pyteomics to read XML files of spectra and PSMs, with tsv file reading being provided by pandas. While this system is a powerful way to extract information from files, pyAscore also provides some convenience functions to extract just the relevant information for scoring localizations from files in a standard format. This functionality can be accessed through the SpectraParser and IdentificationParser classes.

Reading spectra

Spectra file reading is fairly straight forward, and can be achieved by specifying the file name and file type. An optional argument for MSn level is also provided, but the default MSn level of 2 should be correct for most purposes. Spectra are read in parallel if multiple cores are available and can be transformed to a list of dictionaries or a dictionary of dictionaries.

spectra_file = "spectra.mzML"
spectra_parser = SpectraParser(spectra_file, "mzML")

# To a list of dictionaries:
spectra_list = spectra_parser.to_list()

# To a dictionary of dictionaries:
spectra_dict = spectra_parser.to_dict()

Reading PSMs

Reading PSM files happens in much the same way as spectra by specifying a file name and the file type.

psm_file = "psms.pep.xml"
psm_parser = IdentificationParser(psm_file, "pepXML")

# To a list of dictionaries:
psm_list = psm_parser.to_list()

# To a dictionary of dictionaries:
psm_dict = psm_parser.to_dict()

Likely, the most useful extra feature of this module is the mass correction that allows you to make sure that the correct modification mass is associated with residues when they are read in. This is important since many search engines and other programs will truncate the mass. This functionality is supplied by the MassCorrector class. It comes with several modifications built in, but if you happen to have one that isn’t recognized, then it would be good to use the following code.

modifications = {"n": 42.010565, # N-term acetylation
                 "M": 15.9949,   # Methionine oxidation
                 "S": 79.966331, # Serine Phoshorylation
                 "T": 79.966331, # Threonine Phosphorylation
                 "Y": 79.966331, # Tyrosine Phosphorylation
                 "C": 57.021464} # Cysteine Carbamidomethylation
mass_corrector = MassCorrector(modifications, mz_tol=1.5)

psm_file = "psms.pep.xml"
psm_parser = IdentificationParser(psm_file,
                                  "pepXML",
                                  mass_corrector)

Class Reference

class pyascore.SpectraParser(spec_file_name, spec_file_format, ms_level=2, custom_filter=None)

Parser to read spectra from mzML and mzXML files

This class is designed to provide ease of access to spectra from popular file formats. By providing the file of interest and the MSn level, the user can receive spectra in either list form or dictionary form depending on need. Spectral entries from individual file formats are normalized to a single output schema, which has the information necessary for running pyAscore. This output can be further filtered by supplying a custom filter, and future versions will make this option more powerful.

Parameters:
spec_file_namestr

Path to spectral file

spec_file_typestr

The spectra file’s type. One of mzML or mzXML

ms_levelint

MSn level to be returned to the user

custom_filtercallable

A callable which takes a spectral object and returns a boolean which states whether a spectra should be retained. The spectral objects passed to this parameter are currently the same as the ones returned to the user

Methods

to_dict()

Return spectra from file in dictionary form

to_list()

Return spectra from file in list form

to_dict()

Return spectra from file in dictionary form

Returns:
dict

Dict of scans from file with schema: {scan number : spectra}

to_list()

Return spectra from file in list form

Returns:
list

List of scans from file sorted by scan number

class pyascore.IdentificationParser(id_file_name, id_file_format, mass_corrector=<pyascore.parsing.id_parsers.MassCorrector object>, score_string=None, score_threshold=None, score_lower_better=True, score_func=None, static_mods={'C': 57.021464}, spec_file_name=None)

Parser for modification information coming from PSM file formats

This class is designed to provide ease of access to PSMs from popular file formats. By providing the file of interest, the user can receive PSMs in either list form or dictionary form depending on need. PSM entries from individual file format are normalized to a single output format, which has the information necessary for running pyAscore. Sometimes modifications are not formatted in a straight forward way and there is usually a good chance that their masses will be trunctated. This class will attempt to normalize modifications so that all can be interpreted similarly.

Parameters:
id_file_namestr

Path to file containing PSMs

spec_file_formatstr

The format of the PSM containing file type. One of mzIdentML, pepXML, percolatorTXT, or mokapotTXT

mass_correctorMassCorrector

Corrector class to normalize peptide modifications

score_stringstr

String for score to extract from PSMs

score_thresholdfloat

Threshold to filter PSMs based on score

score_lower_betterbool

Whether a lower score is better than a higher score

score_funccallable

Transformation for scores

static_modsdict

Dictionary of static mods to be used by some extractors.

spec_file_namestr

Currently not used

Methods

to_dict()

Return modified hits as dict

to_list()

Return modified hits as list of dicts

to_dict()

Return modified hits as dict

Returns:
dict

Dict of PSMs from file with schema: {scan number : spectra}

to_list()

Return modified hits as list of dicts

Returns:
list

List of PSMs from file sorted by scan number

class pyascore.MassCorrector(mod_mass_dict={'C': 57.021464, 'K': 8.014199, 'M': 15.9949, 'S': 79.966331, 'T': 79.966331, 'Y': 79.966331, 'n': 42.010565}, aa_mass_dict={'A': 71.03711378471, 'C': 103.00918478471, 'D': 115.02694302383001, 'E': 129.04259308796998, 'F': 147.06841391298997, 'G': 57.02146372057, 'H': 137.05891185845002, 'I': 113.08406397713001, 'J': 113.08406397713001, 'K': 128.09496301399997, 'L': 113.08406397713001, 'M': 131.04048491299, 'N': 114.04292744114001, 'O': 237.14772686284996, 'P': 97.05276384885, 'Q': 128.05857750527997, 'R': 156.10111102359997, 'S': 87.03202840427001, 'T': 101.04767846841, 'U': 150.95363508471, 'V': 99.06841391299, 'W': 186.07931294985997, 'Y': 163.06332853254997}, mz_tol=1.5, n_mod_ind=0)

A class to provide modification mass correction

Often the mass of a modification is rounded when it comes from PSM files, or the mass of the modification may be combined with the mass of the amino acid and needs to be decouple. These situations make the mod difficult to work with, so this class corrects rounded and combined modification masses.

Parameters:
mod_mass_dictdict

A dictionary of known modifications to be used for correction

aa_mass_dictdict

Correct masses for the amino acids in the peptide sequence

mz_tolfloat

How far off can a rounded mass be to a known mod before it is counted as unknown

n_mod_indint

Is an n-terminal mod considered to be before the first AA (0) or after the first AA (1)

Methods

correct(res, pos, mass)

Perform modification correction

correct_multiple(peptide, positions, masses)

Run the correction function on a list of mods for a single peptide

correct_numpy(peptide, positions, masses)

Correct masses taking advantage of numpy internals.