Instructions for using the Mycobacterial Genome Database

The site is a database of genomes of mycobacterial strains sequenced in the lab of James C. Sacchettini at Texas A&M University. It provides access to sequence data (including coverage statistics) and comparison of polymorphisms among various strains of tuberculosis, with a focus on drug-resistance. The sequencing is done on an Illumina GenomeAnalyzer II (short reads). The data was analyzed using customized sequence-assembly methods written by Tom Ioerger and his group at Texas A&M. The data provided on this site is intended for research and collaboration purposes only. The site was originally developed by Krishna Ganesula, and is currently maintained by Tom Ioerger. Any questions may be emailed to

1. Browsing the Genome Database

If you already have a login, you will be redirected to the Genome Database main page once you have logged in. The main page displays a summary of the strains sequenced in the Sacchettini lab. All the information relevant to the analysis of a strain is shown in a table having various information fields as columns and strain names as rows.

i) Overview for each field

A unique name/identifier for the strain.

Applicable if the same strain has been sequenced multiple times. The default version number is 1.

A rough placement of the strain into the closest strain family by spoligotype, based on SpolDB4 (ref), or into a SNP group cluster (ref - Minimal SNPs, Alland).

Some information for the strain available prior to sequencing.

Location(country/city) where the strain was obtained from.

Contributor for the strain.

Species the strain originated from. eg: Mycobacterium tuberculosis.

Determined "virtually" by analysis of sequencing data (matching reads to oligos for 43 spacers in direct-repeats region), and written in octal format

a) Reference - A strain for which the sequence is known apriori so it can be used as a reference genome for sequencing new strains.
b) Lab - Generally based on a reference strain, with some modifications, these are also used for analysis of derived strains.
c) Isogenic - A strain originating or derived from a certain parent.
d) Clinical Isolate

Applicable for Isogenic strains only. This field specifies the parent strain.

Lists the drugs to which the strain has been verified as resistant.

Lists the drugs to which the strain has been verified as sensitive.

Date when the sequence was analysed.

Author of the final sequence.

The strain that was actually used for sequence determination (comparative mapping). Even if the strain is a known family, there might not be a very close reference strain to use whose genome sequence is available. This is especially true for clinical isolates. For isogenic strains, the reference strain will usually be the parental strain.

Refers to the internal database management of pairwise alignments between strains. Each strain must be aligned to one other in the database. Often this is the same as the reference strain used to build the sequence, but not necessarily.

Average number of reads covering a location in the genome.

Proportion of sites in the genome with a coverage greater than 0.

Approx. count of the Nucleotide Substitutions with respect to the reference Genome. Note: This is essentially an overestimate as it includes polymorphisms in PGRS and PPE genes.

Approx. count of the insertions/deletions with respect to the reference Genome. Note: This is essentially an overestimate as it includes polymorphisms in PGRS and PPE genes.

Location where the entire sequence or the set of reads were obtained from. eg. NCBI for reference strains, solexa for derived strains.

Applicable only to Reference strains like those obtained from NCBI.

Run on which the strain was sequenced.

Lane the strain was placed on.

Unique tag for demultiplexing.

Supporting documents, if any for this strain.

1=Partial Analysis. Single Nucleotide Polymorphisms(SNPs) only; No insertions/deletions(indels) isolated
2=Contig Building. SNPs and short indels are identified
3=Exhaustive Analysis. Large-scale indels are also identified
Level 1: Be cautious, SNPs may not be real; often used when coverage is low; aligns exactly with reference sequence (biased)
Level 2: Reasonable confidence in SNPs and indels, but large-scale polymorphisms have not been detected; genome sequence is not "complete" but useful for understanding mutations in most regions; paired-end data helps disambiguate reads in repetitive regions
Level 3: Represents a best effort at generating a complete new sequence,including finding insertions of novel DNA, and movement of transposons; relies heavily on paired-end data

Type of sequencing data, reads can be single ended or paired end.

Length of reads used for generating the sequence.

ii) Features for browsing

a) Strains can be sorted by any desired field by clicking on the header of the corresponding column.

b) To display the values for all fields, click on "Display All" (located bottom right).

c) To select a specific set of fields, choose "Customize Columns", mark the required fields and click Submit.

d) To compare the strains based on drug resistance, choose "Add Drugs", mark the required set of Drugs and click Submit. A column will be added for each selected drug. 'R' denotes Resistant and 'S' denotes sensitive, and '-' if the strain has not been tested against this drug.

e) Clicking on a strain id takes you to a detailed page for this strain. Here you can view/download the sequence of this strain. You can also see how it aligns with respect to its parent sequence.

To align the sequences of multiple strains, mark the desired rows and run 'Align Selected'.
Note: You might be prompted to rerun this step, if the parent strains for any strain is not including in your selection. In this case just reselect and submit again after including the parent strain. This takes you to the multiple alignment page.

2. The Multiple Alignment Page

This step potentially aligns the sequence of each strain to all the other strains in the selection. The aligned sequence of each strain is displayed horizontally against the id of the strains. The gene names and their corresponding boundaries are also annotated in the multiple alignment. The numbering of each row refers to the location in the multiple alignment and not for any particular strain. Each page displays a range of 5000 bp.

i) Navigation

a) You can navigate to the next and previous 5000 bases by clicking on "Next" and "Prev" respectively.
b) You can also enter a location at 'Jump to Location/Gene' and click Submit this instantly takes you to this location in the multiple alignment.
c) Instead of a location, you can also give the gene name or the Rv name to locate the alignment of the strains at this gene.

ii) Mutations

a) To navigate through the list of mutations on an alignment page, use tabs.
b) The color coding on the table is as follows:
   i) SNPs/Indels which have already been identified are highlighted in yellow, with text in red.
   ii) Regions with 0 coverage hint at possible deletions. These regions are highlighted in gray, with text in black.
c) Clicking on a mutation gives more information about the exact nucleotide/frame numbers where the mutation occurs and the corresponding amino acid translation.
d) Click on "Coverage Statistics" to see more data(in vertical format) on coverage and confidence for each site. Confidence means the percent of residues at a site that represents the majority base call. Typically, this value is greater than 90% and a value less than 70% signals heterogeneity which may be a problem. '*' marks lines where there is a polymorphism

iii) Loci

Loci represent regions of interest in the multiple alignment. Say we wish to find the regions which satisfiy all these conditions in the multiple alignment - strain 1 is different from strain 2, strain 1 is different from strain 4 and strain 3 is different from strain 4. Additionally to ensure that these are all valid mutations, we need the a minimum coverage of 10 and a confidence of 75% at these locations.

This can be done easily using the get loci section(on top of the multiple alignment page). First we construct our loci rule as 1.2,1.4,3.4 (translates to -> 1 is different from 2,1 is different from 4,3 is different from 4). Enter this value in 'Specify Rule' along with 'Specify Coverage Threshold' as 10 and 'Specify Confidence Threshold' as 75. and click 'Get Loci'. This gives us the entire list of loci satisfying this rule. From here you can find the exact locus in the multiple alignment by clicking on it, or else you can browse the loci file directly.

Note:- This also works if the user wishes to check loci which satisfy atleast 1 rule among those listed. For this choose 'Any' instead of 'All' in the dropdown adjacent to the rule.