This example uses real public data from three Shigella sonnei genomes, two of which were sequenced twice and so allow for reproducibility testing. As Shigella sonnei is a sublineage of E. coli, we must use the E. coli MLST scheme. We will also use the resistance genes database provided with srst2, resistance.fasta (modified from ResFinder). 1. Download the files for the E. coli MLST scheme, using the script provided: getmlst.py --species "Escherichia coli#1" For SRST2, remember to check what separator is being used in this allele database Looks like --mlst_delimiter '-' >adk-1 --> --> ('adk', '-', '1') Suggested srst2 command for use with this MLST database: srst2 --output test --input_pe *.fastq.gz --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --mlst_delimiter '-' Note, this is correctly guessing that we should use the default --mlst_delimiter '-' with this database. The log file will tell you exactly what files were downloaded. Check that the allele sequences have been downloaded and compiled into a single fasta file: Escherichia_coli#1.fasta Check that the ST definitions have been downloaded: ecoli.txt 2. Download Illumina paired end read sets from the ENA. Note if you have limited download capacity or time to spare, you could download the first two files only (total ~30MB) for basic testing of srst2. # strain 20081885 from ERP000182 (14M, 13M; 51M, 44M) wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024070/ERR024070_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024070/ERR024070_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028678/ERR028678_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028678/ERR028678_2.fastq.gz # strain 20031275 from ERP000182 (89M, 94M; 88M, 74M) wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024082/ERR024082_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024082/ERR024082_2.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028690/ERR028690_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028690/ERR028690_2.fastq.gz # strain IB694 from ERP000182 (80M, 64M) wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024619/ERR024619_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024619/ERR024619_2.fastq.gz 3. Try running MLST and gene detection against a tiny read set with low coverage (~3x): srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta Note that, as the databases have not been indexed for bowtie2 yet, that srst2 has to run bowtie2-build which spits out a lot of messages to stdout. This is not a problem! Bowtie2 also reports its mapping stats, it should indicate that there were only 400,882 reads in this data set, of which 0.02% mapped to our MLST/resistance loci. This probably won't be enough to get good quality information. Check the log file (shigella1.log) to see what happened. Outputs are printed to: shigella1__mlst__Escherichia_coli#1__results.txt - mlst results only shigella1__fullgenes__ARGannot__results.txt - resistance results, one line per gene shigella1__genes__ARGannot__results.txt - resistance results only, tabulated shigella1__compiledResults.txt - MLST and resistance genes, tabulated shigella1__ERR024070.Escherichia_coli#1.scores - full score & alignment info for all MLST alleles shigella1__ERR024070.ARGannot.scores - full score & alignment info for resistance genes with >90% coverage Because we ran with the --save_scores flag, we have also got a scores file for each database, which details the score and mapping information for every allele in that database. The ST was not called correctly because depth was too low (average read depth 2.7; this is printed in the MLST result table. 2 resistance genes were detected with >90% coverage (the default threshold), which look like variants of ampC and strB (marked with * to indicate they were not precise matches, and ? to indicate uncertainty because some bases weren't covered). This shows us that this level of data (which was considered a failed sequencing run) isn't enough to get good results. Luckily this genome was resequenced; see what happens when you analyse that data set. 4. Try running MLST and gene detection against a proper read set, and compile together with the results from the poor read set for comparison. srst2 --input_pe ERR028678*.fastq.gz --output shigella2 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta --prev_output shigella1__compiledResults.txt Check the log file to see what happened. Outputs are printed to: shigella2__mlst__Escherichia_coli#1__results.txt - mlst results only shigella2__fullgenes__ARGannot__results.txt - resistance results, one line per gene shigella2__genes__ARGannot__results.txt - resistance results only, tabulated shigella2__compiledResults.txt - MLST and resistance genes, tabulated This time there is an ST called, which should look like this: Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth maxMAF ERR028678 152? 11? 63 7? 1 14 7 7 0 adk-11/edge2.0;gyrB-7/edge2.0 13.9595714286 0.125 Note that the ST has a question mark (?) after it, indicating that we have some uncertainty about the call. The table shows this uncertainty comes from two loci, adk and gyrB, which are themselves labelled with question marks. The 'uncertainty' column indicates that the problem is a drop-off in depth at the edge of these loci, down to 2x, which is the default minimum to report a potential uncertainty (--min_edge_depth 2). There are also lots of resistance genes identified. Because we supplied a prior results file, shigella2__compiledResults.txt contains the new results as well as the old results. 5. Now try running on the other three read sets: srst2 --input_pe ERR024082*.fastq.gz ERR028690*.fastq.gz ERR024619*.fastq.gz --output shigella3 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta 6. Now compile the results from all 5 read sets: srst2 --output all --prev_output shigella2__compiledResults.txt shigella3__compiledResults.txt This outputs a file, all__compiledResults.txt, containing the compilation of all results for MLST and resistance genes across the 5 strains.