Page 53 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2015 2nd Student Computer Science Research Conference. Koper: University of Primorska Press, 2015
P. 53
cessing of next-generation sequencing (NGS) data with
MapReduce and HADOOP cluster

∗ †

Nándor Póka Miklós Krész, PhD.

Applications of Informatics Department Applications of Informatics Department
University of Szeged University of Szeged
Faculty of Education Faculty of Education
Szeged, Hungary Szeged, Hungary

nandor.poka@gmail.com kresz@jgypk.u-szeged.hu

ABSTRACT Categories and Subject Descriptors

Modern biological and medical research is heavily centered J.3 [Life and Medical Sciences]: Biology and Genetics—
around molecular biology and genomics, that is a discipline Genomics and Bioinformatics; D.1.3 [Programming tech-
in genetics that applies recombinant DNA, DNA sequenc- niques]: Concurrent Programming—MapReduce, HADOOP ;
ing methods, and bioinformatics to sequence, assemble, and D.1.1 [Programming techniques]: Applicative (Functional)
analyze the function and structure of the complete genetic Programming
material of an organism. See references [1] and [2] for more
information. Researchers came to the point where they have General Terms
to deal with several (or even hundreds of) gigabytes of data.
The rapid growth of cloud computing and local clustering Algorithm design
of common machines present great opportunity, thus high-
performance, distributed computing is no longer the privi- Keywords
lege of large high-tech companies. Since manually orches-
trating parallel computing is circumstantial, therefore some Bioinformatics, MapReduce, HADOOP
sort of framework is used to handle parallelization and data
management. One of such frameworks is MapReduce and 1. INTRODUCTION TO BIOLOGICAL CON-
its open source implementation Hadoop. De-novo assembly CEPTS
is a data intensive problem that can benefit from the use of
clusters and distributed computing. Current solutions like The DNA - deoxyribonucleic acid - is the genetic material of
Velvet assembler (ref [3]), are multi-threaded at best. Our all living organisms. It is made of four different nucleobases,
aim was to create an algorithm for de novo genome assem- namely: adenine (A), cytosine (C), guanine (G) and thymine
bly, using MapReduce and Hadoop. Similar has been done (T). These nucleobases are able to form chemical bonds with
by Yu-Jung Chang et al. (ref [4]), their key idea is called each other by the following strict rules 1:
aˆA˘Y¨edge alignmentaˆA˘ Z´, but their solution can only handle
a narrow, predefined set of data. In this work we present a • Pairing only occurs between purine (A and G) - pyrim-
de Bruijn graph based algorithm that is able to completely idine (C and T) pairs (no intragroup pairing)
reconstruct an artificially generated random sequence and
assemble large segments of real life test sequence. In the fol- • Adenine can only pair with thymine
lowing chapters we will overview key biology, bioinformatics
and informatics concepts necessary for understanding this • Cytosine can only pair with guanine
work, followed by the compact overview algorithm.
Also, owing to the chemical properties of the sugar-phosphate
∗Is also PhD candidate at Department of Medical Biology, backbone of the nucleobases, DNA can form virtually infi-
Faculty of Medicine, University of Szeged, author carried nite linear chains. These long linear DNA molecules are
out algorithm design, development and testing called DNA strands. The DNA strands have directionality -
†Supervisor and Head of Department, to whom correspon- the 5’ (five-prime) end is where the phosphate group is free
dence should be sent and the 3’ (three-prime) end is where the hydroxyl-group is
free on the backbone. Naturally DNA forms a double-helix
structure, in which two strands of DNA are paired up by
their bases - the two strands of a double-helix are fully com-
plementary to each other. By convention, a single strand of
DNA is strictly written in 5’ to 3’ direction. The size / length
of a given DNA is measured in bases or base-pairs with stan-
dard prefixes for indicating thousand (kilo-), million (mega-
), and billion (giga-) - kilobases, megabases, gigabases. For
further details see references [5], [6], [7]. Biological functions

1This is generally referred as base pairing

StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 53
Ljubljana, Slovenia, 6 October
   48   49   50   51   52   53   54   55   56   57   58