dcurtis

Getting Start with Tangram

Tangram is a C/C++ command line toolbox for structural variation (SV) detection based on MOSAIK alignments. It takes advantage of both read-pair (RP) and split-read (SR) algorithms to enhance the detection efficiency, specificity and breakpoint resolution. To reduce the memory footprint and increase the throughput, Tangram was designed to be able to call MEI events chromosome by chromosome or at even small granularity. To further speed up the program, multi-thread technology was introduced at the bottleneck step of the software, split-read mapping.  Powered by the Bamtools API, Tangram can call SV events on multiple BAM files (population-scale data) simultaneously without any merging process. Currently, Tangram focuses on MEI detection and widely applied on data from 1000 Genomes Project.

Tangram is now available at: https://github.com/jiantao/Tangram. It can be run in either Linux or Mac system. To get a copy of Tangram, simply open the “Terminal” program and type the following command (git is required):

git clone git://github.com/jiantao/Tangram.git

After downloading the source files, compile Tangram with following command (gcc and g++ 4.0 and above are required):

cd Tangram

cd src

make

The executable binary files will be found in the “bin” directory under the root of Tangram download directory. Totally, there should be 5 binary files:

tangram_index

tangram_scan

tangram_merge

tangram_detect

tangram_filter

In this blog, we will introduce the preprocess part, which involves the first three programs. The function of “tangram_index” is to index the reference file (in fasta format) and transfer the reference fasta file to a special format that will be used in the detection process (tangram_detect). For MEI detection, there should be two input fasta files: 1) the normal human genome reference file that can be downloaded from the 1000 Genome Project ftp site (ftp://ftp.1000genomes.ebi.ac.uk). 2) the mobile element sequence files that can be downloaded from RepBase (http://www.girinst.org/repbase/). Under the “data” directory, we have already included a sample MEI reference file (moblist_19Feb2010_sequence_length60.fa) that contains 4 Alu, 17 L1, 1 SVA and 1 HERV sequences. These sequences should be enough for the common use.  To index the reference files, use the following command:

tangram_index –ref $normal_ref_file –sp $mei_ref_file –out $output

Besides indexing the reference file, there is another important preprocess step in Tangram, calculating the fragment length distribution. Fragment length distribution plays an important role in SV detection with pair-end sequencing data. For a better accuracy, Tangram will calculate the empirical fragment length distribution for every read group in the BAM files. This function is implemented through the tangram_scan and tangram_merge. The required input, “-in”, of tangram_scan is a text file that contains a list of BAM files. The content in this text file should be like:

/path/to/the_first.bam

/path/to/the_second.bam

/path/to/the_third.bam

The output of “tangram_scan” is a path to a directory. This directory must be either empty or non-existing (tangram_scan will create one for you). Two data files should be found in this directory after running: “lib_table.dat” and “hist.dat”. These two files will serve as input for tangram_detect. Three optional arguments can be set for tangram_scan. “-cf” controls the percentage cutoff at both side of the fragment length distribution in order to distinguish the read pairs with normal fragment length and abnormal fragment length. For example, if “-cf” was set to 0.02 then in the fragment length distribution, 1%-tile (0.02/2) and 99%-tile (1 – 0.02/2) will then become the two boundaries to determine the normal and abnormal read pairs. Any read pairs with fragment length between 1%-tile and 99%-tile will be treated as normal read-pairs otherwise they will be either treated as short pairs (< 1%-tile) or long pairs (> 99%-tile). Setting a higher cutoff will result in more SV candidates and better sensitivity. However, the FDR will also increase accordingly. “-tr” option control the trim rate before setting the fragment length cutoff. If “-tr” is set to 0.01 then 1% data at both sides of the fragment length distribution will be removed in advance of setting the boundaries. “-mq” controls the minimum mapping quality of a normal read pair. Only those read pairs with mapping quality higher than this threshold will be used to build the fragment length distribution.

When you have many BAM files and a cluster that can run multiple jobs at the same time. It is better to use “tangram_scan” for each BAM file in order to increase the throughput and save time. This will result in generating multiple “lib_table.dat” and “hist.dat” files. However, at the detection step we want to use all these BAM files at the same time and this requires combining all these fragment length distribution files into one. “tangram_merge” can be used to accomplish this job. The required input of this program, “-dir”, is the path to the directory contains all the fragment length distribution files. For example, after running the “tangram_scan” for six BAM files, you get the following directories that contain fragment length distribution files:

/home/first_bam

/home/second_bam

/home/third_bam

In order to combine all these three directories, using the following command:

tangram_merge –dir /home

After running, you will find an additional directory under “/home”, called “merged”. You can find the merged fragment length distribution files in that directory.

0 Notes

Comments
blog comments powered by Disqus