Figure 12. The problem from Figure 11 is solved using paired reads. Since AB and CD are paired, it know becomes obvious that the two fragments are ARB and CRD.
Paired reads help researchers avoid the ambiguity from Figure 11, and so paired reads are crucial in genome assembly. Usually, assembling the entire genome takes 10-15 algorithms executed in a row, so it’s a very difficult process even after all the data has been collected. Since data can be collected in different forms, there are many different assembly algorithms. Obviously, all of this would be impossible without computer science, which incorporates the small reads into one coherent genome sequence.
It is interesting to note that the limit of 700 nucleotides which can be read from the ends of the fragments keeps increasing: currently, it’s actually at about 1,000 nucleotides. There is also currently a push to develop cheaper technologies, with ultimate goal of being able to sequence the human genome for about $1,000 – then the technology would be readily available for the general public, and the information obtained could be used, for example, for personalized medicine.
3. Complete genomes today
There are about 300 genomes of animals, fungi, plants, etc. complete today. However, the technology is very expensive. Obtaining the mouse genome cost $300 million, the human genome – more than $1 billion, and each additional mammal currently costs about $13 million. Still, there are currently roughly 20 mammals being sequenced. Since it is so expensive, the natural question is – why do it?
There are two answers. The first one is scientific. We’re still sequencing new organisms not to better understand individual species, but rather to study evolution, and through similarities between different species to better understand the biology of each individual species.
The second answer is simply practical. After the human genome project was completed, there were many sequencing centers left with nothing more to do. So rather than dismissing those centers completely, it was easier to keep sequencing other genomes now that the technology was developed and the workers were trained.
B. Gene Finding
The problem of finding genes in the entire genome is one of needle in a haystack. In humans, there are 22,000 genes, comprising roughly 1.5% of the genome. In other words, about 1.5% of DNA becomes eventually encoded into proteins, and, even more surprisingly, roughly 95% of the human genome is actually “junk” (so random nucleotide mutations that occur there have no impact at all on the organism).
However, genes are composed of subsequences with well-defined boundaries at the sequence level, which makes it theoretically possible to find them using well-designed algorithms.