Introduction
The software available here is a python implementation of the Bayesian analysis method referenced above. It utilizes read information obtained from sequencing libraries of transposon mutants, to determine the essentiality of genes. Using a Bayesian framework, essentiality is modeled through the Extreme Value (Gumbel) distribution, which characterizes the maximum run of non-insertions (i.e. number of consecutive TA sites lacking insertion in a row). Genes with significantly larger runs of non-insertion thant statistically expected have a higher likelihood of essentiality. A Metropolis-Hastings sampling procedure is utilized to sample from conditional densities of essentiality for all genes, and posterior estimates of the probability of being essential are estimated for all genes.
Source Code
Source code is written in Python, and is available online at http://saclab.tamu.edu/essentiality/
Before running the python script, please make sure you have installed all the necessary prerequisites listed above, and have downloaded the latest version of the script. Once the prerequisite software and libraries are installed, to run the script simply type the following command:
python gumbelMH.py -f test.igv.
Where "-f" is a flag for the input file, and "test.igv" is an example file name. Below the input file formats are described, followed by description of the input flags available, and finally a description of the output format for the results.
Read data must be contained in a tab-delimited text-file according to the IGV format, with two required tracks (i.e. Reads, TAs) representing the read count information of all TA sites in the genome (i.e. with or without insertions), as shown below. The first 5 columns are part of the standard IGV format for representing features in the Integrated Genomic Viewer application. The last two columns represent necessary tracks used by the program to determine the number of reads found at each TA site. TA sites without insertions must be included in the file, and their read counts are represented with zeros. Below is an example of an input file in this format:
#type=COPY_NUMBER
Chromosome Start End Feature Reads TAs
NC_000962.2 59 61 Rv0001 0 1
NC_000962.2 71 73 Rv0001 2 1
NC_000962.2 101 103 Rv0001 36 1
The table below describes each of the columns necessary for the input file:
Column Header | Column Definition | |
Chromosome: | NCBI Chromosome ID. |
Start: | Start Position of the TA dinucleotides within the genome. |
End: | End Position of the TA dinucleotides within the genome. |
Feature: | ID of the ORF containing the TA dinucleotide. |
Reads: | Read count observed at this particular TA site. Empty sites are represented by zero (0). |
TAs | A constant value of "1" for all TA sites. Helpful for to identifying TA sites within the IGV Genomics Viewer. |
Example files are provided below to test the execution of the script and help verify that input files are in the approrpiate format:
The table below outlines the flags accepted by the python script. All but the file flag, "-f", are optional. Default values are described below.
Flag | Value | Definition |
-f | [String] | Path to IGV formatted file containing the reads mapped to the genome. |
-s | [Integer] | Desired sample size for the Metropolis Hastings sampling procedure. Default: -s 10000 |
-m | [Integer] | Minimum read count to be considered as an "insertion". Default: -m 2 |
-p | [Float] | Starting value of the phi parameter (i.e. probability of non-insertion at non-essential genes). Default: -p 0.5 |
-b | [Integer] | Number of iterations to be discarded as a Burn-In period. Default: -b 100 |
-t | [Integer] | Only every t-th sample is kept. Helps decrease correlation between samples. Default: -t 1 |
-v | N/A | Final Sample of essentiality assignments and phi parameter is printed along side results. |
Results are printed to screen in tab-separated format:
# Copyright 2013. Michael A. DeJesus & Thomas R. Ioerger
# Version 1.21; http://saclab.tamu.edu/essentiality/
#
#Command used: python gumbelMH.py -f example_reads1.igv -s 100 -b 1 -t 1
#FDR Corrected thresholds: 0.0 1.0
#MH Acceptance-Rate: 99.00%
#Total Iterations Performed: 100
#Sample Size: 100
#phi estimate: 0.318559
#Orf k n r s zbar Call Sample
Rv0001 1 32 31 1365 1.000000 E 1,1,1,1,1,1,1,1,1,1
Rv0002 4 8 4 70 0.000000 NE 0,0,0,0,0,0,0,0,0,0
Rv0003 0 8 8 316 0.990000 E 1,1,1,1,1,1,1,1,1,1
Header/Comments are proceded with "#" tags, and are followed by a row for each ORF found in the input file. Below, each of the columns in is defined:
Column Header | Column Definition | |
ORF: | ORF ID found in input file. |
k: | Number of Transposon Insertions Observed within the ORF. |
n: | Total Number of TA dinucleotides within the ORF. |
r: | Length of the Maximum Run of Non-Insertions observed. |
s: | Span of nucleotidies for the Maximum Run of Non-Insertions. |
zbar | Posterior Probability of Essentiality. |
sample | Comma-Separated list of sample values. Only printed when "-v" flag was printed. |
The execution of the software on the representative dataset takes several hours on running on a linux server. In general, this will depend on the number of genes in the genome being analyzed, and the number of of samples desired. The number of samples to be taken can be controlled using the flag "-s". Furthermore, thinning the sample through the "-t" flag will also affect running time as this requires more iterations of the algorithm to achieve the desired effective sample size after thinning. Similarly, the burn-in period (set with the "-b" flag) will also increase the number of iterations required.
Creative Commons License
Attribution-NonCommercial 3.0 Unported
© Copyright 2012 -
.
Michael A. DeJesus & Thomas R. Ioerger.