search_systems API reference¶
-
class
macsypy.search_systems.
Cluster
(systems_to_detect)[source]¶ Stores a set of contiguous hits. The Cluster object can have different states regarding its content in different genes’ systems:
- ineligible: not a cluster to analyze
- clear: a single system is represented in the cluster
- ambiguous: several systems are represented in the cluster => might need a disambiguation
-
__init__
(systems_to_detect)[source]¶ Parameters: systems_to_detect (a list of macsypy.system.System
) – the list of systems to be detected in this run
-
__len__
()[source]¶ Returns: the length of the Cluster, i.e., the number of hits stored in it Return type: integer
-
__str__
()[source]¶ print of the Cluster’s hits stored in terms of components, and corresponding sequence identifier and positions
-
__weakref__
¶ list of weak references to the object (if defined)
-
add
(hit)[source]¶ Add a Hit to a Cluster. Hits are always added at the end of the cluster (appended to the list of hits). Thus, ‘begin’ and ‘end’ positions of the Cluster are always the position of the 1st and of the last hit respectively.
Parameters: hit (a macsypy.report.Hit
) – the Hit to addRaise: a macsypy.macsypy_error.SystemDetectionError
-
compatible_systems
¶ Returns: the list of the names of compatible systems represented by the cluster Return type: string
-
putative_system
¶ Returns: the name of the putative system represented by the cluster Return type: string
-
save
(force=False)[source]¶ Check the status of the cluster regarding systems which have hits in it. Update systems represented, and assign a putative system (self._putative_system), which is the system with most hits in the cluster. The systems represented are stored in a dictionary in the self.systems variable. The execution of this function can be forced, even if it has already run for the cluster with the option force=True.
-
state
¶ Returns: the state of the Cluster of hits Return type: string
-
class
macsypy.search_systems.
ClustersHandler
[source]¶ Deals with sets of clusters found in a dataset. Conceived to store only clusters from a same replicon.
-
__init__
()[source]¶ Parameters: cfg ( macsypy.config.Config
) – The configuration object built from default and user parameters.
-
__weakref__
¶ list of weak references to the object (if defined)
-
circularize
(rep_info, end_hits, systems_to_detect)[source]¶ This function takes into account the circularity of the replicon by merging clusters when appropriate (typically at replicon’s ends). It has to be called only if the replicon_topology is set to “circular”.
Parameters: - rep_info (a namedTuple “RepliconInfo”
macsypy.database.RepliconInfo
) – an entry extracted from themacsypy.database.RepliconDB
- end_hits – a set of hits at ends of the replicon that were not introduced in clusters,
and that might be part of a system overlapping the two “ends” of the replicon :type end_hits: a list of
macsypy.report.Hit
:param systems_to_detect: the set of systems to detect in this run :type systems_to_detect: a list of :class:`macsypy.system.System- rep_info (a namedTuple “RepliconInfo”
-
-
class
macsypy.search_systems.
SystemNameGenerator
[source]¶ Creates and stores the names of detected systems. Ensures the uniqueness of the names.
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
macsypy.search_systems.
SystemOccurence
(system)[source]¶ This class is instantiated for a specific system that has been asked for detection. It can be filled step by step with hits. A decision can then be made according to the parameters defined e.g. quorum of genes.
- The SystemOccurence object has a “state” parameter, with the possible following values:
- “empty” if the SystemOccurence has not yet been filled with genes of the decision rule of the system
- “no_decision” if the filling process has started but the decision rule has not yet been applied to this occurence
- “single_locus”
- “multi_loci”
- “uncomplete”
-
__init__
(system)[source]¶ Parameters: system ( macsypy.system.System
) – the system to “fill” with hits.
-
__str__
()[source]¶ Returns: Information of the component content of the SystemOccurence. Return type: string
-
__weakref__
¶ list of weak references to the object (if defined)
-
compute_missing_genes_list
(gene_dict)[source]¶ Parameters: gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values Returns: the list of genes with no occurence in the gene counter. Return type: list
-
compute_system_length
(rep_info)[source]¶ Returns the length of the system, all loci gathered, in terms of protein number (even those not matching any system gene)
Parameters: rep_info (a namedTuple “RepliconInfo” macsypy.database.RepliconInfo
) – an entry extracted from themacsypy.database.RepliconDB
Return type: integer
-
count_genes
(gene_dict)[source]¶ Counts the number of genes with at least one occurrence in a dictionary with a counter of genes.
Parameters: gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values Return type: integer
-
count_genes_tot
(gene_dict)[source]¶ Counts the number of matches in a dictionary with a counter of genes, independently of the nb of genes matched.
Parameters: gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values Return type: integer
-
count_missing_genes
(gene_dict)[source]¶ Counts the number of genes with no occurrence in the gene counter.
Parameters: gene_dict (dict) – a dictionary with gene’s names as keys and number of occurrences as values Return type: integer
-
decision_rule
()[source]¶ - This function applies the decision rules for system assessment in terms of quorum:
- the absence of forbidden genes is checked
- the minimal number of mandatory genes is checked (“min_mandatory_genes_required”)
- the minimal number of genes in the system is checked (“min_genes_required”)
When a decision is made, the status (self.status) of the
macsypy.search_systems.SystemOccurence
is set either to:- “single_locus” when a complete system in the form of a single cluster was found
- “multi_loci” when a complete system in the form of several clusters was found
- “uncomplete” when no system was assessed (quorum not reached)
- “empty” when no gene for this system was found
- “exclude” when no system was assessed (at least one forbidden gene was found)
Returns: a printable message of the output decision with this SystemOccurrence Return type: string
-
fill_with_cluster
(cluster)[source]¶ Adds hits from a cluster to a system occurence, and check which are their status according to the system definition. Set the system occurence state to “no_decision” after calling of this function.
Parameters: cluster ( macsypy.search_systems.Cluster
) – the set of contiguous genes to treat formacsypy.search_systems.SystemOccurence
inclusion.
-
fill_with_hits
(hits, include_forbidden)[source]¶ Adds hits to a system occurence, and check what are their status according to the system definition. Set the system occurence state to “no_decision” after calling of this function.
Note
Forbidden genes will only be included if they do belong to the current system (and not to another specified with “system_ref” in the current system’s definition).
Parameters: hits – a list of Hits to treat for macsypy.search_systems.SystemOccurence
inclusion.
-
fill_with_multi_systems_genes
(multi_systems_hits)[source]¶ This function fills the SystemOccurrence with genes putatively coming from other systems (feature “multi_system”). Those genes are used only if the occurrence of the corresponding gene was not yet filled with a gene from a cluster of the system.
Parameters: multi_systems_hits – a list of hits of genes that are “multi_system” which correspond to mandatory or accessory genes from the current system for which to fill a SystemOccurrence
-
get_gene_counter_output
(forbid_exclude=False)[source]¶ Parameters: forbid_exclude (boolean) – exclude the forbidden components if set to True. False by default. Returns: A dictionary ready for printing in system summary, with genes (mandatory, accessory and forbidden if specified) occurences in the system occurrence.
-
get_gene_ref
(gene)[source]¶ Parameters: gene ( macsypy.gene.Gene
, ormacsypy.gene.Homolog
ormacsypy.gene.Analog
object) – the gene to get it’s gene referenceReturns: object macsypy.gene.Gene
or NoneReturn type: macsypy.gene.Gene
object or NoneRaise: KeyError if the system does not contain any gene gene.
-
get_summary
(replicon_name, rep_info)[source]¶ Gives a summary of the system occurrence in terms of gene content and localization.
Parameters: - replicon_name (string) – the name of the replicon
- rep_info (a namedTuple “RepliconInfo”
macsypy.database.RepliconInfo
) – an entry extracted from themacsypy.database.RepliconDB
Returns: a tabulated summary of the
macsypy.search_systems.SystemOccurence
Return type: string
-
get_summary_header
()[source]¶ Returns a string with the description of the summary returned by self.get_summary()
Return type: string
-
get_summary_unordered
(replicon_name)[source]¶ Gives a summary of the system occurrence in terms of gene content only (specific of “unordered” datasets).
Parameters: replicon_name (string) – the name of the replicon Returns: a tabulated summary of the macsypy.search_systems.SystemOccurence
Return type: string
-
get_system_name_unordered
(suffix='_putative')[source]¶ - Attributes a name to the system occurrence for an “unordered” dataset => generating a generic name based
- on the system name and the suffix given.
Parameters: suffix (string) – the suffix to be used for generating the systemOccurrence’s name Returns: a name for a system in an “unordered” dataset to the macsypy.search_systems.SystemOccurence
Return type: string
-
get_system_unique_name
(replicon_name)[source]¶ Attributes unique name to the system occurrence with the class
macsypy.search_systems.SystemNameGenerator
. Generates the name if not already set.Parameters: replicon_name (string) – the name of the replicon Returns: the unique name of the macsypy.search_systems.SystemOccurence
Return type: string
-
is_complete
()[source]¶ Test for SystemOccurrence completeness.
Returns: True if the state of the SystemOccurrence is “single_locus” or “multi_loci”, False otherwise. Return type: boolean
-
nb_syst_genes
¶ This value is set after a decision was made on the system in
macsypy.search_systems.SystemOccurence:decision_rule()
Returns: the number of mandatory and accessory genes with at least one occurence (number of different accessory genes) :rtype: integer
-
state
¶ Returns: the state of the systemOccurrence. Return type: string
-
macsypy.search_systems.
analyze_clusters_replicon
(clusters, systems, multi_systems_genes)[source]¶ Analyzes sets of contiguous hits (clusters) stored in a ClustersHandler for system detection:
- split clusters if needed
- delete them if they are not relevant
- add eventual genes from other systems “multi_system” genes
- check the QUORUM for each system to detect, i.e. mandatory + accessory - forbidden
Only for “ordered” datasets representing a whole replicon. Reports systems occurence.
Parameters: - clusters (
macsypy.search_systems.ClustersHandler
) – the set of clusters to analyze - systems (a list of
macsypy.system.System
) – the set of systems to detect - multi_systems_genes – a dictionary with genes that could belong to multiple systems (keys are system names)
Returns: a set of systems occurence filled with hits found in clusters
Return type: a list of
macsypy.search_systems.SystemOccurence
-
macsypy.search_systems.
build_clusters
(hits, systems_to_detect, rep_info)[source]¶ Gets sets of contiguous hits according to the minimal inter_gene_max_space between two genes. Only for “ordered” datasets.
Parameters: - hits (a list of
macsypy.report.Hit
) – a list of Hmmer hits to analyze - systems_to_detect (a list of
macsypy.system.System
) – the list of systems to detect - cfg (
macsypy.config.Config
) – the configuration object built from default and user parameters. - rep_info (a namedTuple “RepliconInfo”
macsypy.database.RepliconInfo
) – an entry extracted from themacsypy.database.RepliconDB
Returns: a set of clusters and a dictionary with “multi_system” genes stored in a system-wise way for further utilization.
Return type: - hits (a list of
-
macsypy.search_systems.
disambiguate_cluster
(cluster)[source]¶ This disambiguation step is used on clusters with hits for multiple systems (when cluster.state is set to “ambiguous”). It returns a “cleansed” list of clusters, ready to use for system occurence detection (and that are “clear” cases). It:
- splits the cluster in two if it seems that two systems are nearby
- removes single hits that are not forbidden for the “main” system and that are at one end of the current cluster in this case, check that they are not “loners”, cause “loners” can be stored.
Parameters: cluster ( macsypy.search_systems.Cluster
) – the cluster to “disambiguate”
-
macsypy.search_systems.
get_best_hits
(hits, tosort=False, criterion='score')[source]¶ Returns from a putatively redundant list of hits, a list of best matching hits. Analyzes quorum and co-localization if required for system detection. By default, hits are already sorted by position, and the hit with the best score is kept, then the best i-evalue. Possible criteria are:
- maximal score (criterion=”score”)
- minimal i-evalue (criterion=”i_eval”)
- maximal percentage of the profile covered by the alignment with the query sequence (criterion=”profile_coverage”)
Parameters: - tosort (boolean) – tells if the hits have to be sorted
- criterion (string) – the criterion to base the sorting on
Returns: the list of best matching hits
Return type: list of
macsypy.report.Hit
Raise:
-
macsypy.search_systems.
get_compatible_systems
(systems_list1, systems_list2)[source]¶ Returns the intersection of the two input systems lists.
Parameters: systems_list2 (systems_list1,) – two lists of systems Returns: a list of systems, or an empty list if no common system Return type: a list of macsypy.system.System
-
macsypy.search_systems.
search_systems
(hits, systems, cfg)[source]¶ Runs search of systems from a set of hits. Criteria for system assessment will depend on the kind of input dataset provided:
- analyze quorum and co-localization for “ordered_replicon” and “gembase” datasets.
- analyze quorum only (and in a limited way) for “unordered_replicon” and “unordered” datasets.
Parameters: - hits (list of
macsypy.report.Hit
) – the list of hits for input systems components - systems (list of
macsypy.system.System
) – the list of systems asked for detection - cfg (
macsypy.config.Config
) – the configuration object
-
class
macsypy.search_systems.
systemDetectionReportOrdered
(replicon_name, systems_occurrences_list, cfg)[source]¶ - Stores the detected systems to report for each replicon:
- by system name,
- by state of the systems (single vs multi loci)
-
__init__
(replicon_name, systems_occurrences_list, cfg)[source]¶ Parameters: - replicon_name (string) – the name of the replicon
- systems_occurrences_list (list of
macsypy.search_systems.SystemOccurence
) – the list of system’s occurrences to consider
-
_match2json
(valid_hit, so)[source]¶ Parameters: - valid_hit (class:macsypy.search_system.ValidHit object.) – the valid hit to transform in to json.
- so (class:macsypy.search_system.SystemOccurence.) – the system occurence where the valid hit come from.
-
counter_output
()[source]¶ Builds a counter of systems per replicon, with different “states” separated (single-locus vs multi-loci systems)
Returns: the counter of systems Return type: Counter
-
report_output
(reportfilename, print_header=False)[source]¶ Writes a report of sequences forming the detected systems, with information in their status in the system, their localization on replicons, and statistics on the Hits.
Parameters: - reportfilename (string) – the output file name
- print_header (boolean) – True if the header has to be written. False otherwise
-
summary_output
(reportfilename, rep_info, print_header=False)[source]¶ Writes a report with the summary of systems detected in replicons. For each system, a summary is done including:
- the number of mandatory/accessory genes in the reference system (as defined in XML files)
- the number of mandatory/accessory genes detected
- the number and list of missing genes
- the number of loci encoding the system
Parameters: - rep_info (a namedTuple “RepliconInfo”
macsypy.database.RepliconInfo
) – an entry extracted from themacsypy.database.RepliconDB
- print_header (boolean) – True if the header has to be written. False otherwise
-
system_2_json
(rep_db)[source]¶ Generates the report in json format
Parameters: - path (string) – the path to a file where to write the report in json format
- rep_db (a class:macsypy.database.RepliconDB object) – the replicon database
-
tabulated_output
(system_occurrence_states, system_names, reportfilename, print_header=False)[source]¶ Write a tabulated output with number of detected systems for each replicon.
Parameters: - system_occurrence_states (list of string) – the different forms of detected systems to consider
- reportfilename (string) – the output file name
- print_header (boolean) – True if the header has to be written. False otherwise
Return type: string
-
class
macsypy.search_systems.
systemDetectionReportUnordered
(systems_occurrences_list, cfg)[source]¶ - Stores a report for putative detected systems gathering all hits from a search in an unordered dataset:
- by system.
Mandatory and accessory genes only are reported in the “json” and “report” output, but all hits matching a system component are reported in the “summary”.
-
__init__
(systems_occurrences_list, cfg)[source]¶ Parameters: systems_occurrences_list (list of macsypy.search_systems.SystemOccurence
) – the list of system’s occurrences to consider
-
json_output
(json_path)[source]¶ Generates the report in json format
Parameters: path (string) – the path to a file where to write the report in json format
-
report_output
(reportfilename, print_header=False)[source]¶ Writes a report of sequences forming the detected systems, with information in their status in the system, their localization on replicons, and statistics on the Hits.
Parameters: - reportfilename (string) – the output file name
- print_header (boolean) – True if the header has to be written. False otherwise
-
summary_output
(reportfilename, print_header=False)[source]¶ Writes a report with the summary for putative systems in an unordered dataset. For each system, a summary is done including:
- the number of mandatory/accessory genes in the reference system (as defined in XML files)
- the number of mandatory/accessory genes detected
Parameters: - reportfilename (string) – the output file name
- print_header (boolean) – True if the header has to be written. False otherwise
-
class
macsypy.search_systems.
validSystemHit
(hit, detected_system, gene_status)[source]¶ Encapsulates a
macsypy.report.Hit
This class stores a Hit that has been attributed to a detected system. Thus, it also stores:- the system,
- the status of the gene in this system,
It also aims at storing information for results extraction:
- system extraction (e.g. genomic positions)
- sequence extraction
-
__init__
(hit, detected_system, gene_status)[source]¶ Parameters: - hit (
macsypy.report.Hit
) – a hit to base the validSystemHit on - detected_system (string) – the name of the predicted System
- gene_status (string) – the “role” of the gene in the predicted system
- hit (
-
__weakref__
¶ list of weak references to the object (if defined)