A Hidden Markov Model for identifying essential and growth-defect regions in bacterial genomes from transposon insertion sequencing data

Background: Knowledge of which genes are essential to the survival of an organism is critical to understanding the function of genes, and for the identification of potential drug targets for antimicrobial treatment. Previous statistical methods for assessing essentiality based on sequencing of tranposon libraries have usually limited their assessment to strict 'essential' or 'non-essential' categories. However, this binary view of essentiality does not accurately represent the more nuanced ways the growth of an organism might be affected by the disruption of its genes. In addition, these methods often limit their analysis to open-reading frames. We propose a novel method for analyzing sequence data from transposon mutant libraries using a Hidden Markov Model (HMM), along with formulas to adapt the parameters of the model to different datasets for robustness. This approach allows for the clustering of insertion sites into distinct regions of essentiality across the entire genome in a statistically rigorous manner, while also allowing for the detection of growth-defect and growth-advantage regions.

Results: We evaluate the performance of a 4-state HMM on a sequence dataset of M. tuberculosis transposon mutants. We also test the HMM on several synthetic datasets representing different levels of transposon insertion density and sequence coverage. We show that the HMM produces results that are highly correlated with previous assignments of essentiality for this organism. We also show that it detects growth-defect and growth-advantage genes previously shown to impair or enhance growth when disrupted.

Conclusions: A 4-state HMM provides an improved way of analyzing Tn-seq data and assessing different levels of essentiality that enables not only the characterization of essential and non-essential genes, but also genes whose disruption leads to impairment (or enhancement) of growth.

DeJesus, M.A., Ioerger, T.R. A Hidden Markov Model for identifying essential and growth-defect regions in bacterial genomes from transposon insertion sequencing data.BMC Bioinformatics. 2013. 14:303

Contact Information

If you have any questions, contact us at: ioerger@cs.tamu.edu.

Introduction

The software available here is a python implementation of the Hidden Markv Model referenced above. It utilizes read information obtained from sequencing libraries of transposon mutants, to determine the essentiality of genes. Using a HMM framework, observed read-counts are modeled through the Geometric distribution and the state sequence which best explains the observations is determined through the Viterbi algorithm. The HMM is implemented with 4 states, representing categories of essentiality with increasing read-counts (i.e. essential, growth-defect, non-essential, growth-advantage).

Source Code

Source code is written in Python, and comes with a README document containing instructions.

Version History

Version 1.03 [Download]

Change-log: Additional functionality for the post-processing script, procces_genes.py. See README file.

Version 1.02 [Download]

Change-log: Added flags to specify parameters, transition probabilities, and number of states. See README file

Version 1.01 [Download]

Change-log: Removed dependency on utils module. Suppressing extraneous divide by zero errors.

Requirements:

Python 2.7.1+ www.python.org
Scipy 0.6.0+ www.scipy.org/Download
Numpy 1.2.1+ www.scipy.org/Download

Source code can be extracted by using the following command:

tar -xvzf tn_hmm_1.00.tar.gz

Data

Example files are provided below to test the execution of the script and help verify that input files are in the appropriate format:

File #1 (Example of data in WIG format.)
Glycerol TraSH Data (Counts of reads at TA sites in the H37Rv genome, from the glycerol dataset, table S1, of Griffin et al. (2011) mapped using SOAP, in WIG format.)

Copyright Information

The method and implementation provided in this website was created by Michael A. DeJesus and Thomas R. Ioerger and is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

If you wish to use this source code, please provide attribution by using the following citation:

DeJesus, M.A., Ioerger, T.R. Quantification of Growth-Defect Regions in Bacterial Genomes from analysis of Transposon Mutagenesis Data using a Hidden Markov Model. BMC Bioinformatics. 2013.

Creative Commons License