ABSTRACT
Eukaryotic genomes contain high volumes of intronic and intergenic regions in which repetitive sequences are abundant. These repetitive sequences represent challenges in genomic assignment of short read sequences generated through next generation sequencing and are often excluded in analysis losing invaluable genomic information. Here we present a method, known as TRA (Tandem Repeat Assembler), for the assembly of repetitive sequences by constructing contigs directly from paired-end reads. Using an experimentally acquired data set for human chromosome 14, tandem repeats >200 bp were assembled. Alignment of the contigs to the human genome reference (GRCh38) revealed that 84.3% of tandem repetitive regions were correctly covered. For tandem repeats, this method outperformed state-of-the-art assemblers by generating correct N50 of contigs up to 512 bp.
- A novel gene containing a trinucleotide repeat that is expanded and unstable on huntington's disease chromosomes. the huntington's disease collaborative research group. Cell, 72:971--983, 1993.Google ScholarCross Ref
- C. R. Boland, S. N. Thibodeau, S. R. Hamilton, D. Sidransky, J. R. Eshleman, R. W. Burt, S. J. Meltzer, M. A. Rodriguez-Bigas, R. Fodde, and G. N. Ranzani. A national cancer institute workshop on microsatellite instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res, 58:5248--5257, 1998.Google Scholar
- M. D. Cao, E. Tasker, K. Willadsen, M. Imelfort, S. Vishwanathan, S. Sureshkumar, S. Balasubramanian, and M. Boden. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Res, 42:E16, 2014.Google ScholarCross Ref
- M. J. Chaisson, D. Brinza, and P. A. Pevzner. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res, 19:336--346, 2009.Google ScholarCross Ref
- W. F. Doolittle and C. Sapienza. Selfish genes, the phenotype paradigm and genome evolution. Nature, 284:604--607, 1980.Google ScholarCross Ref
- D. Earl, K. Bradnam, J. John, A. Darling, D. Lin, J. Fass, H. O. Yu, V. Buffalo, D. R. Zerbino, and M. Diekhans. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res, 21:2224--2241, 2011.Google ScholarCross Ref
- S. El-Metwally, T. Hamza, M. Zakaria, and M. Helmy. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol, 9:e1003345, 2013.Google ScholarCross Ref
- Y. Gelfand, A. Rodriguez, and G. Benson. Trdb-the tandem repeats database. Nucleic Acids Res, 20:265--272, 2007.Google Scholar
- R. Gemayel, M. D. Vinces, M. Legendre, and K. J. Verstrepen. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu Rev Genet, 44:445--477, 2010.Google ScholarCross Ref
- S. Gnerre, I. Maccallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, T. Sharpe, G. Hall, T. P. Shea, and S. Sykes. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A, 108:1513--1518, 2011.Google ScholarCross Ref
- M. Guttman, I. Amit, M. Garber, C. French, M. F. Lin, D. Feldser, M. Huarte, O. Zuk, B. W. Carey, and J. P. Cassady. Chromatin signature reveals over a thousand highly conserved large non-coding rnas in mammals. Nature, 458:223--227, 2009.Google ScholarCross Ref
- M. Gymrek, D. Golan, S. Rosset, and Y. Erlish. lobstr: A short tandem repeat profiler for personal genomes. Genome Res, 22:1154--1162, 2012.Google ScholarCross Ref
- A. J. Hannan. Tandem repeat polymorphisms: modulators of disease susceptibility and candidates for 'missing heritability'. Trends Genet, 26:59--65, 2010.Google ScholarCross Ref
- G. Highnam, C. Franck, A. Martin, C. Stephens, A. Puthige, and D. Mittelman. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res, 41:e32, 2013.Google ScholarCross Ref
- S. Koren, T. J. TReangen, and M. Pop. Bambus 2: scaffolding metagenomes. Bioinformatics, 27:2964--2971, 2011. Google ScholarDigital Library
- E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, and W. FitzHugh. Initial sequencing and analysis of the human genome. Nature, 409:860--921, 2001.Google ScholarCross Ref
- A. R. LaSpada, E. M. Wilson, D. B. Lubahn, A. E. Harding, and K. H. Fischbeck. Androgen receptor gene mutations in x-linked spinal and bulbar muscular atrophy. Nature, 352:77--79, 1991.Google ScholarCross Ref
- R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, and K. Kristiansen. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res, 20:265--272, 2010.Google ScholarCross Ref
- R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, and Y. Liu. Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1:18, 2012.Google ScholarCross Ref
- C. T. McMurray. Mechanisms of trinucleotide repeat instability during human development. Nat Rev Genet, 11:786--799, 2010.Google ScholarCross Ref
- M. L. Metzker. Sequencing technologies - the next generation. Nat Rev Genet, 11:31--46, 2010.Google ScholarCross Ref
- J. R. Miller, A. L. Delcher, S. Koren, E. Venter, B. P. Walenz, A. Brownley, J. Johnson, K. Li, C. Mobarry, and G. Sutton. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24:2818--2824, 2008. Google ScholarDigital Library
- J. R. Miller, s. Koren, and G. Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95:315--327, 2010.Google ScholarCross Ref
- L. Mularoni, R. Guigo, and M. M. Alba. Mutation patterns of amino acid tandem repeats in the human proteome. Genome Biol, 7:R33, 2006.Google ScholarCross Ref
- L. Noe and G. Kucherov. Yass: enhancing the sensitivity of dna similarity search. Nucleic Acids Res, 33:W540--543, 2005.Google ScholarCross Ref
- C. T. O'Dushlaine, R. J. Edwards, S. D. Park, and D. C. Shields. Tandem repeat copy-number variation in protein-coding regions of human genes. Genome Biol, 6:R69, 2005.Google ScholarCross Ref
- S. Ohno. So much.Google Scholar
- L. E. Orgel and F. H. Crick. Selfish dna: the ultimate parasite. Nature, 284:604--607, 1980.Google ScholarCross Ref
- M. O. Press, K. D. Carlson, and C. Queitsch. The overdue promise of short tandem repeat variation for heritability. Trends Genet, 30:504--512, 2014.Google ScholarCross Ref
- J. L. Rinn and H. Y. Chang. Genome regulation by long noncoding rnas. Annu Rev Biochem, 81:145--166, 2012.Google ScholarCross Ref
- R. A. Rollins, F. Haghighi, J. R. Edwards, R. Das, M. Q. Zhang, J. Ju, and T. H. Bestor. Large-scale structure of genomic methylation patterns. Genome Res, 16:157--163, 2006.Google ScholarCross Ref
- S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, and M. Roberts. Gage: A critical evaluation of genome assemblies and assembly algorithms. Genome Res, 22:557--567, 2012.Google ScholarCross Ref
- J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. Abyss: a parallel assembler for short read sequence data. Genome Res, 19:1117--1123, 2009.Google ScholarCross Ref
- A. F. Smit. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev, 6:743--748, 1996.Google ScholarCross Ref
- S. Subramanian, R. K. Mishra, and L. Singh. Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol, 4:R13, 2003.Google ScholarCross Ref
- T. J. Treangen and S. L. Salzberg. Repetitive dna and next-generation sequencing: computational challenges and solutions. Nat Rev Genet, 13:36--46, 2012.Google ScholarCross Ref
- P. S. Walsh, N. J. Fildes, and R. Reynolds. Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus vwa. Nucleic Acids Res, 24:2807--2812, 1996.Google ScholarCross Ref
- D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res, 18:821--829, 2008.Google ScholarCross Ref
- TRA: tandem repeat assembler for next generation sequences
Recommendations
Greedily assemble tandem repeats for next generation sequences
Eukaryotic genomes contain high volumes of intronic and intergenic regions in which repetitive sequences are abundant. These repetitive sequences represent challenges in genomic assignment of short read sequences generated through next generation ...
MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and BiomedicineMotivation:
An important step of "metagenomics" analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines employ a single-genome assembler with carefully optimized ...
Alignment-Free sequence comparison based on next generation sequencing reads: extended abstract
RECOMB'12: Proceedings of the 16th Annual international conference on Research in Computational Molecular BiologyNext generation sequencing (NGS) technologies have generated enormous amount of shotgun read data and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun ...
Comments