There are many way to do pre-processing of Tn-Seq datasets, and it can depend on the the protocol used for Tn-Seq, the conventions used by the sequencing center, etc. However, TPP is written to accommodate the most common situation among our collaborating labs. In particular, it is oriented toward the Tn-Seq protocol developed in the Sassetti lab and described in (Long et al, 2015), which uses a barcoding system to uniquely identifying reads from distinct transposon-junction DNA fragments. This allows raw read counts to be reduced to unique template counts, eliminating effects of PCR bias. The sequencing must be done in paired-end (PE) mode (with a minimum read-length of around 50 bp). The transposon terminus appears in the prefix of read1 reads, and barcodes are embedded in read2 reads.
The suffixes of read1 and read2 contain nucleotides from the genomic region adjacent to the transpsoson insertion. These subsequences must be mapped into the genome. TPP uses BWA (Burroughs-Wheeler Aligner) to do this mapping. It is a widely-used tool, but you will have to install it on your system. Mapping large datasets takes time, on the order of 15 minutes (depending on many factors), so you will have to be patient.
Subsequent to the BWA mapping step, TPP does a bunch of post-processing steps. Primarily, it tabulates raw read counts at each TA site in the reference genome, reduces them to template counts, and writes this out in .wig format (as input for TRANSIT). It also calculates and reports some statistics on the dataset which a useful for diagnostic purposes. These are saved in local file caled ".tn_stats". The GUI automatically reads all the .tn_stats files from previously processed datasets in a directory and displays them in a table.
The GUI interface is set-up basically as a graphical front-end that allows you to specify input files and parameters to get a job started. Once you press START, the graphical window goes away, and the pre-processing begins, printing out status messages in the original terminal window. You can also run TPP directly from the command-line with the GUI, by providing all the inputs via command-line arguments.
TPP has a few optional parameters in the interface. We intend to add other options in the future, so if you have suggestions, let us know. In particular, if you have some datasets that requires special processing (such as if different primer sequences were used for PCR amplification, or a different barcoding system, or different contaminant sequences to search for, etc.), we might be able to add some options to deal with this.
Requirements:
python PATH/src/tpp.pywhere PATH is the path to the TRANSIT installation directory. This should pop up the GUI window, looking like this...
Note, TPP can process paired-end reads, as well as single-end datasets. (just leave the filename for read2 blank)
The main fields to fill out in the GUI are...
Subsequent to the BWA mapping step, TPP does a bunch of post-processing steps. Primarily, it tabulates raw read counts at each TA site in the reference genome, reduces them to template counts, and writes this out in .wig format (as input for essentiality analysis in TRANSIT). It also calculates and reports some statistics on the dataset which a useful for diagnostic purposes. These are saved in local file caled ".tn_stats". The GUI automatically reads all the .tn_stats files from previously processed datasets in a directory and displays them in a table.
TPP uses a local config file called "tpp.cfg" to rememeber parameter settings from run to run. This makes it convenient so that you don't have to type in things like the path to the BWA executable or reference genome over and over again. You just have to do it once, and TPP will remember.
Command-line mode: TPP may be run on a dataset directly from the command-line without invoking the user interface (GUI) by providing it filenames and parameters as command-line arguments.
For a list of possible command line arguments, type: python tpp.py -help usage: python PATH/src/tpp.py -bwa PATH_TO_EXECUTABLE -ref REF_SEQ -reads1 PATH_TO_FASTQ_OR_FASTA_FILE [-reads2 PATH_TO_FASTQ_OR_FASTA_FILE] -prefix OUTPUT_BASE_FILENAME [-maxreads N]The input arguments and file types are as follows:
-bwa | path executable | |
-ref | reference genome sequence | FASTA file |
-reads1 | file of read 1 of paired reads | FASTA or FASTQ format (or gzipped) |
-reads2 | file of read 2 of paired reads (optional for single-end reads) | FASTA or FASTQ format (or gzipped) |
-prefix | base filename to use for output files | |
-maxreads | subset of reads to process (optional); if blank, use all | |
-mismatches | how many to allow when searching reads for sequence patterns | |
(Note: if you have already run TPP once, the you can leave out the specification of the path for BWA, and it will automatically take the path stored in the config file, tpp.cfg. Same for ref, if you always use the same reference sequence.)
Statistics
Here is an explanation of the statistics that are saved in the .tn_stats file and displayed in the table in the GUI. For convenience, all the statistics are written out on one line with tab-separation at the of the .tn_stats file, to make it easy to add it as a row in a spreadsheet, as some people like to do to track multiple datasets.
total_reads | total number of reads in the original .fastq/.fasta files |
truncated_reads | reads representing DNA fragments shorter than the read length; adapter appears at end of read 1 and is stripped for mapping |
TGTTA_reads | number of reads with a proper transposon prefix (ending in TGTTA in read1) | reads1_mapped | number of R1 mappped into genome (independent of R2) |
reads2_mapped | number of R2 mappped into genome (independent of R1) |
mapped_reads | number of reads which mapped into the genome (requiring both read1 and read2 to map) |
read_count | total reads mapping to TA sites (mapped reads excluding those mapping to non-TA sites) |
template_count | reduction of mapped reads to unique templates using barcodes |
template_ratio | read_count / template_count |
TA_sites | total number of TA dinucleotides in the genome |
TAs_hit | number of TA sites with at least 1 insertion |
insertion_density | TAs_hit / TA_sites |
max_count | the maximum number of templates observed at any TA site |
max_site | the coordinate of the site where the max count occurs |
NZ_mean | mean template count over non-zero TA sites |
FR_corr | correlation between template counts on Fwd strand versus Rev strand |
BC_corr | correlation between read counts and template counts over non-zero sites |
primer_matches | how many reads match the primer sequence (primer-dimer problem in sample prep) |
vector_matches | how many reads match the phage sequence (transposon vector) used in Tn mutant library construction |
Here is an example of a .tn_stats file:
# title: Tn-Seq Pre-Processor # date: 02/18/2015 09:36:04 # command: python /pacific/home/ioerger/transit/tpp.py # read1: /pacific/HomeFrozen/ioerger/Run58/analysis/Tn-glyceryl-trioleate-2_1.fastq # read2: /pacific/HomeFrozen/ioerger/Run58/analysis/Tn-glyceryl-trioleate-2_3.fastq # ref_genome: /pacific/home/ioerger/transit/genomes/H37Rv.fna # total_reads 1301968 (read pairs) # truncated_reads 26000 (fragments shorter than the read length; ADAP2 appears in read1) # TGTTA_reads 1090333 (reads with valid Tn prefix, and insert size>20bp) # reads1_mapped 1016860 # reads2_mapped 427481 # mapped_reads 413251 (both R1 and R2 map into genome) # read_count 400069 (TA sites only) # template_count 211128 # template_ratio 1.89 (reads per template) # TA_sites 74605 # TAs_hit 21072 # density 0.282 # max_count 2306 (among templates) # max_site 212278 (coordinate) # NZ_mean 10.0 (among templates) # FR_corr 0.917 (Fwd templates vs. Rev templates) # BC_corr 0.965 (reads vs. templates, summed over both strands) # primer_matches: 78 reads contain CTAGAGGGCCCAATTCGCCCTATAGTGAGT # vector_matches: 2 reads contain CTAGACCGTCCAGTCTGGCAGGCCGGAAAC /pacific/HomeFrozen/ioerger/Run58/analysis/Tn-glyceryl-trioleate-2_1.fastq /pacific/HomeFrozen/ioerger/Run58/analysis/Tn-glyceryl-trioleate-2_3.fastq 1301968 1090333 1016860 427481 413251 1016860 427481 400069 211128 1.89491209124 74605 21072 2306 212278 10.0193621868 0.917104229568 0.96542310842 78 2Interpretation: To assess the quality of a dataset, I would recommend starting by looking at 3 primary statistics: