Bifrost
|
Represent a Compacted de Bruijn graph. More...
Public Types | |
typedef unitigIterator< U, G, false > | iterator |
An iterator for the unitigs of the graph. More... | |
typedef unitigIterator< U, G, true > | const_iterator |
A constant iterator for the unitigs of the graph. More... | |
Public Member Functions | |
CompactedDBG (const int kmer_length=31, const int minimizer_length=-1) | |
Constructor (set up an empty compacted dBG). More... | |
CompactedDBG (const CompactedDBG< U, G > &o) | |
Copy constructor (copy a compacted de Bruijn graph). More... | |
CompactedDBG (CompactedDBG< U, G > &&o) | |
Move constructor (move a compacted de Bruijn graph). More... | |
virtual | ~CompactedDBG () |
Destructor. | |
CompactedDBG< U, G > & | operator= (const CompactedDBG< U, G > &o) |
Copy assignment operator (copy a compacted de Bruijn graph). More... | |
CompactedDBG< U, G > & | operator= (CompactedDBG< U, G > &&o) |
Move assignment operator (move a compacted de Bruijn graph). More... | |
CompactedDBG< U, G > & | operator+= (const CompactedDBG< U, G > &o) |
Addition assignment operator (merge a compacted de Bruijn graph). More... | |
bool | operator== (const CompactedDBG< U, G > &o) const |
Equality operator. More... | |
bool | operator!= (const CompactedDBG< U, G > &o) const |
Inequality operator. More... | |
void | clear () |
Clear the graph: empty the graph and reset its parameters. | |
bool | build (CDBG_Build_opt &opt) |
Build the Compacted de Bruijn graph. More... | |
bool | simplify (const bool delete_short_isolated_unitigs=true, const bool clip_short_tips=true, const bool verbose=false) |
Simplify the Compacted de Bruijn graph: clip short (< 2k length) tips and/or delete short (< 2k length) isolated unitigs. More... | |
bool | write (const string &output_fn, const size_t nb_threads=1, const bool GFA_output=true, const bool FASTA_output=false, const bool BFG_output=false, const bool write_index_file=true, const bool compressed_output=false, const bool verbose=false) const |
Write the Compacted de Bruijn graph to disk (GFA1 format). More... | |
bool | read (const string &input_graph_fn, const size_t nb_threads=1, const bool verbose=false) |
Load a Compacted de Bruijn graph from disk (GFA1 or FASTA format). More... | |
bool | read (const string &input_graph_fn, const string &input_index_fn, const size_t nb_threads=1, const bool verbose=false) |
Read a Compacted de Bruijn graph from disk (GFA1, FASTA or BFG format) using an index file (BFI format). More... | |
UnitigMap< U, G > | find (const Kmer &km, const bool extremities_only=false) |
Find the unitig containing the queried k-mer in the Compacted de Bruijn graph. More... | |
const_UnitigMap< U, G > | find (const Kmer &km, const bool extremities_only=false) const |
Find the unitig containing the queried k-mer in the Compacted de Bruijn graph. More... | |
UnitigMap< U, G > | findUnitig (const char *s, const size_t pos, const size_t len) |
Find the unitig containing the k-mer starting at a given position in a query sequence and extends the mapping (if the k-mer is found, the function extends the mapping from the k-mer as long as the query sequence and the unitig matches). More... | |
const_UnitigMap< U, G > | findUnitig (const char *s, const size_t pos, const size_t len) const |
Find the unitig containing the k-mer starting at a given position in a query sequence and extends the mapping (if the k-mer is found, the function extends the mapping from the k-mer as long as the query sequence and the unitig matches). More... | |
vector< pair< size_t, UnitigMap< U, G > > > | searchSequence (const string &s, const bool exact, const bool insertion, const bool deletion, const bool substitution, const bool or_exclusive_match=false) |
Performs exact and/or inexact search of the k-mers of a sequence query in the Compacted de Bruijn graph. More... | |
vector< pair< size_t, const_UnitigMap< U, G > > > | searchSequence (const string &s, const bool exact, const bool insertion, const bool deletion, const bool substitution, const bool or_exclusive_match=false) const |
Performs exact and/or inexact search of the k-mers of a sequence query in the Compacted de Bruijn graph. More... | |
bool | add (const string &seq, const bool verbose=false) |
Add a sequence to the Compacted de Bruijn graph. More... | |
bool | remove (const const_UnitigMap< U, G > &um, const bool verbose=false) |
Remove a unitig from the Compacted de Bruijn graph. More... | |
bool | merge (const CompactedDBG &o, const size_t nb_threads=1, const bool verbose=false) |
Merge a compacted de Bruijn graph. More... | |
bool | merge (const vector< CompactedDBG > &v, const size_t nb_threads=1, const bool verbose=false) |
Merge multiple compacted de Bruijn graphs. More... | |
iterator | begin () |
Create an iterator to the first unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically). More... | |
const_iterator | begin () const |
Create an constant iterator to the first unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically). More... | |
iterator | end () |
Create an iterator to the "past-the-last" unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically). More... | |
const_iterator | end () const |
Create a constant iterator to the "past-the-last" unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically). More... | |
size_t | length () const |
Return the sum of the unitigs length. More... | |
size_t | nbKmers () const |
Return the number of k-mers in the graph. More... | |
bool | isInvalid () const |
Return a boolean indicating if the graph is invalid (wrong input parameters/files, error occurring during a method, etc.). More... | |
int | getK () const |
Return the length of k-mers of the graph. More... | |
int | getG () const |
Return the length of minimizers of the graph. More... | |
size_t | size () const |
Return the number of unitigs in the graph. More... | |
G * | getData () |
Return a pointer to the graph data. More... | |
const G * | getData () const |
Return a constant pointer to the graph data. More... | |
Represent a Compacted de Bruijn graph.
The two template parameters of this class corresponds to the type of data to associate with the unitigs of the graph (unitig data) and the type of data to associate with the graph (graph data). If no template parameters are specified or if the types are void, no data are associated with the unitigs nor the graph and no memory will be allocated for such data.
If data are to be associated with the unitigs, these data must be wrapped into a class that inherits from the abstract class CDBG_Data_t, such as in:
Because CDBG_Data_t is an abstract class, all the methods from the base class (CDBG_Data_t) must be implemented in your wrapper (the derived class, aka MyUnitigData in this example). IMPORTANT: If you do not implement those methods in your class, default ones that have no effect will be applied.
typedef unitigIterator<U, G, true> CompactedDBG< Unitig_data_t, Graph_data_t >::const_iterator |
A constant iterator for the unitigs of the graph.
No specific order is assumed.
typedef unitigIterator<U, G, false> CompactedDBG< Unitig_data_t, Graph_data_t >::iterator |
An iterator for the unitigs of the graph.
No specific order is assumed.
CompactedDBG< Unitig_data_t, Graph_data_t >::CompactedDBG | ( | const int | kmer_length = 31 , |
const int | minimizer_length = -1 |
||
) |
Constructor (set up an empty compacted dBG).
kmer_length | is the length k of k-mers used in the graph (each unitig is of length at least k). |
minimizer_length | is the length g of minimizers (g < k) used in the graph. |
CompactedDBG< Unitig_data_t, Graph_data_t >::CompactedDBG | ( | const CompactedDBG< U, G > & | o | ) |
Copy constructor (copy a compacted de Bruijn graph).
This function is expensive in terms of time and memory as the content of a compacted de Bruijn graph is copied. After the call to this function, the same graph exists twice in memory.
o | is a constant reference to the compacted de Bruijn graph to copy. |
CompactedDBG< Unitig_data_t, Graph_data_t >::CompactedDBG | ( | CompactedDBG< U, G > && | o | ) |
Move constructor (move a compacted de Bruijn graph).
The content of o is moved ("transfered") to a new compacted de Bruijn graph. The compacted de Bruijn graph referenced by o will be empty after the call to this constructor.
o | is a reference on a reference to the compacted de Bruijn graph to move. |
bool CompactedDBG< Unitig_data_t, Graph_data_t >::add | ( | const string & | seq, |
const bool | verbose = false |
||
) |
Add a sequence to the Compacted de Bruijn graph.
Non-{A,C,G,T} characters such as Ns are discarded. The function automatically breaks the sequence into unitig(s). Those unitigs can be stored as the reverse-complement of the input sequence.
seq | is a string containing the sequence to insert. |
verbose | is a boolean indicating if information messages must be printed during the function execution. |
iterator CompactedDBG< Unitig_data_t, Graph_data_t >::begin | ( | ) |
Create an iterator to the first unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically).
const_iterator CompactedDBG< Unitig_data_t, Graph_data_t >::begin | ( | ) | const |
Create an constant iterator to the first unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically).
bool CompactedDBG< Unitig_data_t, Graph_data_t >::build | ( | CDBG_Build_opt & | opt | ) |
Build the Compacted de Bruijn graph.
opt | is a structure from which the members are parameters of this function. See CDBG_Build_opt. |
iterator CompactedDBG< Unitig_data_t, Graph_data_t >::end | ( | ) |
Create an iterator to the "past-the-last" unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically).
const_iterator CompactedDBG< Unitig_data_t, Graph_data_t >::end | ( | ) | const |
Create a constant iterator to the "past-the-last" unitig of the Compacted de Bruijn graph (unitigs are NOT sorted lexicographically).
UnitigMap<U, G> CompactedDBG< Unitig_data_t, Graph_data_t >::find | ( | const Kmer & | km, |
const bool | extremities_only = false |
||
) |
Find the unitig containing the queried k-mer in the Compacted de Bruijn graph.
km | is the queried k-mer (see Kmer class). It does not need to be a canonical k-mer. |
extremities_only | is a boolean indicating if the k-mer must be searched only in the unitig heads and tails (extremities_only = true). By default, the k-mer is searched everywhere (extremities_only = false) but is is slightly slower than looking only in the unitig heads and tails. |
const_UnitigMap<U, G> CompactedDBG< Unitig_data_t, Graph_data_t >::find | ( | const Kmer & | km, |
const bool | extremities_only = false |
||
) | const |
Find the unitig containing the queried k-mer in the Compacted de Bruijn graph.
km | is the queried k-mer (see Kmer class). It does not need to be a canonical k-mer. |
extremities_only | is a boolean indicating if the k-mer must be searched only in the unitig heads and tails (extremities_only = true). By default, the k-mer is searched everywhere (extremities_only = false) but is is slightly slower than looking only in the unitig heads and tails. |
UnitigMap<U, G> CompactedDBG< Unitig_data_t, Graph_data_t >::findUnitig | ( | const char * | s, |
const size_t | pos, | ||
const size_t | len | ||
) |
Find the unitig containing the k-mer starting at a given position in a query sequence and extends the mapping (if the k-mer is found, the function extends the mapping from the k-mer as long as the query sequence and the unitig matches).
s | is a pointer to an array of character containing the sequence to query. |
pos | is the position of the first k-mer to find in the sequence to query. |
len | is the length of s. |
const_UnitigMap<U, G> CompactedDBG< Unitig_data_t, Graph_data_t >::findUnitig | ( | const char * | s, |
const size_t | pos, | ||
const size_t | len | ||
) | const |
Find the unitig containing the k-mer starting at a given position in a query sequence and extends the mapping (if the k-mer is found, the function extends the mapping from the k-mer as long as the query sequence and the unitig matches).
s | is a pointer to an array of character containing the sequence to query. |
pos | is the position of the first k-mer to find in the sequence to query. |
len | is the length of s. |
|
inline |
Return a pointer to the graph data.
Pointer is nullptr if type of graph data is void.
|
inline |
Return a constant pointer to the graph data.
Pointer is nullptr if type of graph data is void.
|
inline |
Return the length of minimizers of the graph.
|
inline |
Return the length of k-mers of the graph.
|
inline |
Return a boolean indicating if the graph is invalid (wrong input parameters/files, error occurring during a method, etc.).
size_t CompactedDBG< Unitig_data_t, Graph_data_t >::length | ( | ) | const |
Return the sum of the unitigs length.
bool CompactedDBG< Unitig_data_t, Graph_data_t >::merge | ( | const CompactedDBG< Unitig_data_t, Graph_data_t > & | o, |
const size_t | nb_threads = 1 , |
||
const bool | verbose = false |
||
) |
Merge a compacted de Bruijn graph.
After merging, all unitigs of o have been added to and compacted with the current compacted de Bruijn graph (this). If the unitigs of o had data of type "MyUnitigData" associated, they have been added to the current compacted de Bruijn graph using the functions of the class MyUnitigData which are also present in its base class CDBG_Data_t<MyUnitigData>. Note that if multiple compacted de Bruijn graphs have to be merged, it is more efficient to call CompactedDBG::merge with a vector of CompactedDBG as input.
o | is a constant reference to the compacted de Bruijn graph to merge. |
nb_threads | is an integer indicating how many threads can be used during the merging. |
verbose | is a boolean indicating if information messages must be printed during the execution of the function. |
bool CompactedDBG< Unitig_data_t, Graph_data_t >::merge | ( | const vector< CompactedDBG< Unitig_data_t, Graph_data_t > > & | v, |
const size_t | nb_threads = 1 , |
||
const bool | verbose = false |
||
) |
Merge multiple compacted de Bruijn graphs.
After merging, all unitigs of the compacted de Bruijn graphs have been added to and compacted with the current compacted de Bruijn graph (this). If the unitigs had data of type "MyUnitigData" associated, they have been added to the current compacted de Bruijn graph using the functions of the class MyUnitigData which are also present in its base class CCDBG_Data_t<MyUnitigData>.
v | is a constant reference to a vector of colored and compacted de Bruijn graphs to merge. |
nb_threads | is an integer indicating how many threads can be used during the merging. |
verbose | is a boolean indicating if information messages must be printed during the execution of the function. |
size_t CompactedDBG< Unitig_data_t, Graph_data_t >::nbKmers | ( | ) | const |
Return the number of k-mers in the graph.
|
inline |
Inequality operator.
CompactedDBG<U, G>& CompactedDBG< Unitig_data_t, Graph_data_t >::operator+= | ( | const CompactedDBG< U, G > & | o | ) |
Addition assignment operator (merge a compacted de Bruijn graph).
After merging, all unitigs of o have been added to and compacted with the current compacted de Bruijn graph (this). If the unitigs of o had data of type "MyUnitigData" associated, they have been added to the current compacted de Bruijn graph using the functions of the class MyUnitigData which are in base class CDBG_Data_t<MyUnitigData>. This function is similar to CompactedDBG::merge except that it uses only one thread while CompactedDBG::merge can work with multiple threads (number of threads provided as a parameter). Note that if multiple compacted de Bruijn graphs have to be merged, it is more efficient to call CompactedDBG::merge with a vector of CompactedDBG as input.
o | is a constant reference to the compacted de Bruijn graph to merge. |
CompactedDBG<U, G>& CompactedDBG< Unitig_data_t, Graph_data_t >::operator= | ( | CompactedDBG< U, G > && | o | ) |
Move assignment operator (move a compacted de Bruijn graph).
The content of o is moved ("transfered") to a new compacted de Bruijn graph. The compacted de Bruijn graph referenced by o will be empty after the call to this operator.
o | is a reference on a reference to the compacted de Bruijn graph to move. |
CompactedDBG<U, G>& CompactedDBG< Unitig_data_t, Graph_data_t >::operator= | ( | const CompactedDBG< U, G > & | o | ) |
Copy assignment operator (copy a compacted de Bruijn graph).
This function is expensive in terms of time and memory as the content of a compacted de Bruijn graph is copied. After the call to this function, the same graph exists twice in memory.
o | is a constant reference to the compacted de Bruijn graph to copy. |
bool CompactedDBG< Unitig_data_t, Graph_data_t >::operator== | ( | const CompactedDBG< U, G > & | o | ) | const |
Equality operator.
bool CompactedDBG< Unitig_data_t, Graph_data_t >::read | ( | const string & | input_graph_fn, |
const size_t | nb_threads = 1 , |
||
const bool | verbose = false |
||
) |
Load a Compacted de Bruijn graph from disk (GFA1 or FASTA format).
This function detects if an index file (BFI format) exists (same prefix as graph) for the input graph and will use it to load the graph. Otherwise, loading will be slower than read() with the index graph file. If the input GFA file has not been built by Bifrost or if the input is FASTA format, it is your responsibility to make sure that the graph is correctly compacted and to set correctly the parameters of the graph (such as the k-mer length) before the call to this function.
input_graph_fn | is a string containing the name of the graph file to read. |
nb_threads | is a number indicating how many threads can be used to read the graph from disk. |
verbose | is a boolean indicating if information messages must be printed during the function execution. |
bool CompactedDBG< Unitig_data_t, Graph_data_t >::read | ( | const string & | input_graph_fn, |
const string & | input_index_fn, | ||
const size_t | nb_threads = 1 , |
||
const bool | verbose = false |
||
) |
Read a Compacted de Bruijn graph from disk (GFA1, FASTA or BFG format) using an index file (BFI format).
Index files make the loading much faster than the other function read() without meta graph file. If the input GFA file has not been built by Bifrost or if the input is FASTA format, it is your responsibility to make sure that the graph is correctly compacted and to set correctly the parameters of the graph (k-mer length and g-mer) before the call to this function.
input_graph_fn | is a string containing the name of the graph file to read. |
input_index_fn | is a string containing the name of the index file to read. |
nb_threads | is a number indicating how many threads can be used to read the graph from disk. |
verbose | is a boolean indicating if information messages must be printed during the function execution. |
bool CompactedDBG< Unitig_data_t, Graph_data_t >::remove | ( | const const_UnitigMap< U, G > & | um, |
const bool | verbose = false |
||
) |
Remove a unitig from the Compacted de Bruijn graph.
um | is a UnitigMap object containing the information of the unitig to remove from the graph. |
verbose | is a boolean indicating if information messages must be printed during the execution of the function. |
vector<pair<size_t, UnitigMap<U, G> > > CompactedDBG< Unitig_data_t, Graph_data_t >::searchSequence | ( | const string & | s, |
const bool | exact, | ||
const bool | insertion, | ||
const bool | deletion, | ||
const bool | substitution, | ||
const bool | or_exclusive_match = false |
||
) |
Performs exact and/or inexact search of the k-mers of a sequence query in the Compacted de Bruijn graph.
s | is a string representing the sequence to be searched (the query). |
exact | is a boolean indicating if the exact k-mers of string s must be searched. |
insertion | is a boolean indicating if the inexact k-mers of string s, with one insertion, must be searched. |
deletion | is a boolean indicating if the inexact k-mers of string s, with one deletion, must be searched. |
substitution | is a boolean indicating if the inexact k-mers of string s, with one substitution, must be searched. |
or_exclusive_match | is a boolean indicating to NOT search for the inexact k-mers at any given position in s if the exact corresponding k-mer at that position is found in the graph. This option might lead to a substantial running time decrease. |
vector<pair<size_t, const_UnitigMap<U, G> > > CompactedDBG< Unitig_data_t, Graph_data_t >::searchSequence | ( | const string & | s, |
const bool | exact, | ||
const bool | insertion, | ||
const bool | deletion, | ||
const bool | substitution, | ||
const bool | or_exclusive_match = false |
||
) | const |
Performs exact and/or inexact search of the k-mers of a sequence query in the Compacted de Bruijn graph.
s | is a string representing the sequence to be searched (the query). |
exact | is a boolean indicating if the exact k-mers of string s must be searched. |
insertion | is a boolean indicating if the inexact k-mers of string s, with one insertion, must be searched. |
deletion | is a boolean indicating if the inexact k-mers of string s, with one deletion, must be searched. |
substitution | is a boolean indicating if the inexact k-mers of string s, with one substitution, must be searched. |
or_exclusive_match | is a boolean indicating to NOT search for the inexact k-mers at any given position in s if the exact corresponding k-mer at that position is found in the graph. This option might lead to a substantial running time decrease. |
bool CompactedDBG< Unitig_data_t, Graph_data_t >::simplify | ( | const bool | delete_short_isolated_unitigs = true , |
const bool | clip_short_tips = true , |
||
const bool | verbose = false |
||
) |
Simplify the Compacted de Bruijn graph: clip short (< 2k length) tips and/or delete short (< 2k length) isolated unitigs.
delete_short_isolated_unitigs | is a boolean indicating short isolated unitigs must be removed. |
clip_short_tips | is a boolean indicating short tips must be clipped. |
verbose | is a boolean indicating if information messages must be printed during the function execution. |
|
inline |
Return the number of unitigs in the graph.
bool CompactedDBG< Unitig_data_t, Graph_data_t >::write | ( | const string & | output_fn, |
const size_t | nb_threads = 1 , |
||
const bool | GFA_output = true , |
||
const bool | FASTA_output = false , |
||
const bool | BFG_output = false , |
||
const bool | write_index_file = true , |
||
const bool | compressed_output = false , |
||
const bool | verbose = false |
||
) | const |
Write the Compacted de Bruijn graph to disk (GFA1 format).
output_fn | is a string containing the name of the file in which the graph will be written. |
nb_threads | is a number indicating how many threads can be used to write the graph to disk. |
GFA_output | indicates if the graph will be output in GFA format. |
FASTA_output | indicates if the graph will be output in FASTA format. |
BFG_output | indicates if the graph will be output in BFG/BFI format. |
write_index_file | indicates if an index file is written to disk. Index files enable faster graph loading. This parameter is discarded if BFG format output is selected (index output is required then). |
compressed_output | indicates if the output file is compressed. |
verbose | is a boolean indicating if information messages must be printed during the function execution. |