2. Assembly

Once these paired reads are obtained, researchers want to assemble them to obtain one long sequence on letters, as in the original DNA strand. The way to do it is incrementally, through overlapping regions of reads, as shown in the diagram below.

Figure 10. Incremental fragment assembly. By finding overlapping regions and piecing them together one by one, researchers obtain one long nucleotide sequence.

However, there is a problem with this strategy, namely that DNA has a lot of repeats. Thus, reads might overlap even though they don’t come from the same parts of the genome. As a matter of fact, the human genome is composed of about 50% repeats, and some regions might appear as many as a million times!! So the problem diagramed below is actually quite common.

Figure 11. Let A, B, C, D, R be different sequence fragments. Since R is a repeated region

which occurs twice, during incremental assembly, there would be uncertainty because both Rs would match perfectly. In a simplified view, if the reads are AR, RB, CR and RD, it is

impossible to know which Rs to piece together: whether the two initial fragments are ARB and CRD, or ARD and CRB.

However, due to the fact that the reads are actually paired, since they are obtained from the two ends of each fragment, this problem becomes easier.

