Introduction

The software available here is a python implementation of the Hidden Markov Model referenced above. It utilizes read information obtained from sequencing libraries of transposon mutants, to determine the essentiality of genes. Using a HMM framework, observed read-counts are modeled through the Geometric distribution and the state sequence which best explains the observations is determined through the Viterbi algorithm. The HMM is implemented with 4 states, representing categories of essentiality with increasing read-counts (i.e. essential, growth-defect, non-essential, growth-advantage).

Source Code

Source code is written in Python, and is available online at http://saclab.tamu.edu/essentiality/Tn-HMM/

Requirements:

Python 2.7.1+ www.python.org
Scipy 0.6.0+ www.scipy.org/Download
Numpy 1.2.1+ www.scipy.org/Download

Instructions

Before running the python script, please make sure you have installed all the necessary prerequisites listed above, and have downloaded the latest version of the script. Once the prerequisite software and libraries are installed, to run the script simply type the following command:

python tn_hmm.py -f test.wig > output_file.txt

Where "-f" is a flag for the input file, and "test.wig" is an example file name. Below the input file formats are described, followed by description of the input flags available, and finally a description of the output format for the results.

Input File Format

Read data must be contained in a WIG formatted file, with two columns representing the position and the read count observed at all TA sites in the genome (i.e. with or without insertions), as shown below. TA sites without insertions must be included in the file, and their read counts are represented with zeros. Below is an example of an input file in this format:

variableStep
60 0
72 0
102 3
188 42
246 0
333 25
360 3
426 0
448 75
471 65
483 24
494 0
504 23
514 1
525 183
534 98
601 0
653 0
670 0
706 0
741 0
784 0
794 0
843 0
989 0
1092 0
1104 0
1267 0
1278 0

Example files are provided below to test the execution of the script and help verify that input files are in the approrpiate format:

Flags

The table below outlines the flags accepted by the python script. All but the file flag, "-f", are optional. Default values are described below.

Flag	Value	Definition
-f	[String]	Path to WIG formatted file containing the reads mapped to the genome.
-gff	[String]	Optional. Path to GFF3 formatted file containing the coordinates, ID and Name for the genes. Output will contain an indicator line separating gene boundaries.
-p	[Float String]	Optional. String representation of the parameter values for the model. Specified as comma-separated float values in the interval (0,1). e.g. -p "0.9,0.1,0.01,0.02". If not specified, the parameters will be calculated as described in the publication.
-a	[Float String]	Optional. String representation of the Transition Probability Matrix. Speciefied as semi-colon separated rows, with comma-separated cells. Values in the interval (0,1). e.g. -a "[0.9,0.05,0.05; 0.05,0.9,0.05; 0.05,0.05,0.9]". If not specified, the values will be calculated as described in the publication.
-n	[Integer]	Optional. Integer value specifying the number of states. If this flag is specified, the "-a" and "-p" flags must also be specified to define custom values for the parameters and transition probabilities.

Output Format

Results are printed to screen in tab-separated format:

# Tn-HMM
# Command Used: python tn-hmm_1.00/hmm_geom.py -f example_reads1.wig
# 
# Mean: 49.27
# Median:   25.00
# pins (obs):   0.379310
# pins (est):   0.687500
# Run length (r):   4
# Self-Transition Prob: -6.0204e-07
# State Emission Parameters (theta):
#    ES: 0.9900   GD: 0.4239   NE: 0.0279   GA: 0.0056
# State Distributions:
#    ES: 44.83%   GD: 0.00%    NE: 55.17%    GA: 0.00%

60      0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   NE  
72      0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   NE  
102     3       9.90e-07    8.10e-02    2.56e-02    5.48e-03   NE  
188     42      9.90e-85    3.70e-11    8.50e-03    4.41e-03   NE  
246     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   NE  
333     25      9.90e-51    4.36e-07    1.37e-02    4.84e-03   NE  
360     3       9.90e-07    8.10e-02    2.56e-02    5.48e-03   NE  
426     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   NE  
448     75      9.90e-151   4.61e-19    3.35e-03    3.66e-03   NE  
471     65      9.90e-131   1.15e-16    4.44e-03    3.87e-03   NE  
483     24      9.90e-49    7.57e-07    1.41e-02    4.87e-03   NE  
494     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   NE  
504     23      9.90e-47    1.31e-06    1.45e-02    4.90e-03   NE  
514     1       9.90e-03    2.44e-01    2.71e-02    5.54e-03   NE  
525     183     0.00e+00    6.27e-45    1.58e-04    2.00e-03   NE  
534     98      9.90e-197   1.43e-24    1.75e-03    3.22e-03   NE  
601     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
653     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
670     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
706     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
741     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
784     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
794     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
843     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
989     0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
1092    0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
1104    0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
1267    0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E   
1278    0       9.90e-01    4.24e-01    2.79e-02    5.57e-03   E

Header/Comments begin with a "#" tag and contain information about the parameters of the model as well as statistics of the resulting state sequence.

Column #	Column Definition
1	Coordinate of TA site
2	Observed Read Count
3	Gamma for ES state
4	Gamma for GD state
5	Gamma for NE state
6	Gamma for GA state
7	State Classification (ES = Essential, GD = Growth Defect, NE = Non-Essential, GA = Growth-Defect)

Post-Processing

Two post-processing scripts are included in the archive, to process the output of the HMM. These can be used to obtain classifications of essentiality in a gene by gene basis, and to obtain a list of regions (i.e. consecutive TA sites with the same state classification), for a given state:

Post-Processing Genes

To get a list of the genes and their respective classifications, you must first run the HMM using the -gff flag to specify the gene boundaries. Next, using the output produced by the HMM you can produce calls of essentiality for the individual genes:

python tn-hmm.py -f example.wig -gff genome.gff3 > hmm_output_file.txt

python process_genes.py -f hmm_output_file.txt > gene_calls_output.txt

The post-processing script will classify those genes containing statistically significant runs of ES states as essential, as described in the manuscript. If you wish to prevent this behavior, you can specify the "-nd" flag. In addition, you can specify that intergenic regions be included in the output by including the "-nc" flag.

Post-Processing Segments

To get a list of the segments or regions of a given state, you must first run the HMM using the -gff flag to specify the gene boundaries. Next, using the output produced by the HMM you can produce a list of segments belonging to the state specified. The list will include the orf ids and gene names of the genes that fall within the regions. Non-coding regions will be labeled as "non_coding_xx" where xx is a number that increments as necessary:

python tn-hmm.py -f example.wig -gff genome.gff3 > hmm_output_file.txt

python process_segments.py -f hmm_output_file.txt -s "ES" > segment_output.txt

Running Time

The execution of the software on the representative dataset takes less than 2 minutes while running on a linux server. In general, this will depend on the number of TA sites in the genome being analyzed.

Creative Commons License