Abstract:Massive data generated by the rapid development of RNA-seq sequencing technology make serious challenges to the original read mapping algorithm in the efficiency. A spaced seed indexing algorithm without considering splice site based on MapReduce(PSeqMap),a spaced seed indexingalgorithm considering splice site(PJuncSeqMap),and a load-balancing solution are proposed. The MapReduce framework is employed to parallelize spaced seed indexing algorithms. The experimental results on the Arabidopsis gene datasets show that the proposed algorithms take full advantage of storage and computing power of the clusters and process massive genetic data efficiently.
[1] Smith A D,Xuan Z,Zhang M Q. Using Quality Scores and Longer Reads Improves Accuracy of Solexa Read Mapping. BMC Bioinformatics,2008. DOI:10.118611471-2105-9-128 [2] Jiang H,Wong W H. SeqMap: Mapping Massive Amount of Oligonucleotides to the Genome. Bioinformatics,2008,24(20): 2395-2396 [3] Langmead B,Trapnell C,Pop M. Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome. Genome Biology,2009. DOI:10.1186/gb-2009-10-3-r25 [4] Li R G,Yu C,Li Y R,et al. SOAP2: An Improved Ultrafast Tool for Short Read Alignment. Bioinformatics,2009,25(15): 1966-1967 [5] Li H,Durbin R. Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform. Bioinformatics,2009,25(14): 1754-1760 [6] Trapnell C,Pachter L,Salzberg S L. TopHat: Discovering Splice Junctions with RNA-Seq. Bioinformatics,2009,25(9): 1105-1111 [7] Au K F,Jiang H,Lin L,et al. Detection of Splice Junctions from Paired-End RNA-Seq Data by SpliceMap. Nucleic Acids Research,2010,38(14): 4570-4578 [8] Wang K,Singh D,Zeng Z,et al. MapSplice: Accurate Mapping of RNA-Seq Reads for Splice Junction Discovery. Nucleic Acids Research,2010. DOI:10.1093/nar/gkq622 [9] Homer N,Merriman B,Nelson S F. BFAST: An Alignment Tool for Large Scale Genome Resequencing. PLoS One,2009. DOI:10.1371/journal.pone.0007767 [10] Olson C B,Kim M,Clauson C,et al. Hardware Acceleration of Short Read Mapping // Proc of the 20th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. Toronto,Canada,2012: 161-168 [11] Dean J,Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM,2008,51(1): 107-113 [12] Yang X L. The Application Case Study of MapReduce Parallel Computation and the Optimization of Its Runtime Framework. Master Dissertation. Nanjing,China: Nanjing University,2012 (in Chinese) (杨晓亮.MapReduce并行计算应用案例及其执行框架性能优化研究.硕士学位论文.南京:南京大学,2012) [13] Schatz M C. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics,2009,25(11): 1363-1369