Sequence Versions

In several cases, we have made incremental improvements in the genome sequences. Our convention has been to append a letter or number to the sequence id. For example, H37RvLPb is an updated version of H37RvLP. and H37RvAE3 is an updated version H37RvAE2, and so on. Please look for the most up to date version of your sequence. The versions of the Beijing strains described in our BMC Genomics paper and deposited in Genbank are: HN878d, X122g (R220 cluster representative), and R1207g (R86 cluster representative).

For the the 6 genomes published in our J. Bact paper, the versions described (and deposited in Genbank as contigs) correspond to:

Level Of Analysis

We distinguish 3 levels of analysis (levels 1, 2, and 3), based on how much effort has been put into determining genome sequences, and this has important implications for assumptions about their quality/reliability. The levels express a tradeoff: level 1 is quick but least accurate (mainly for identifying simple SNPs), and level 3 takes weeks of work but produces as accurate of a genome sequence as we can.

Most strains are analyzed using method for detecting SNPs and using level_of_analysis=2, using contig-building to detect small indels (1-100 bp). In this case, most SNPs can be called confidently, but not all indels can be corrected relative to the reference strain, especially in low-coverage regions. To check the confidence of a SNP, look at the coverage and purity on the Coverage Statistics page. If coverage is low (less than 5) or purity is low (less than 70%), this you might want to treat it with some skepticism; in fact, such a site might be caused by an local indel that has not been fixed yet.

Even with this level of editting, there are limits to sequence accuracy that can be achieved with short reads. In particular, copy-number of tandem repeats like MIRU sequences in VNTR regions cannot be accurately determined, so don't count on these for genotyping. Similarly, we do not systematically search for large-scale insertions and deletions (including IS6110 elements) with level_of_analysis=2, so you should be aware that they might be in the strain sequenced but not built into the genome.

Strains with level_of_analysis=1 have only been determined by mapping reads against the reference sequence and calling SNPs; no contig-building has been applied and thus no indels have been edited.

For strains with level_of_analysis=3, a thorough effort has been applied to identifying large-scale insertions and deletions (>100 bp), including transposition of IS6110 sequences.