Error, alignment, and the myth of the complete genome

Centaur by {a href=”http://www.flickr.com/photos/consciousvision/”}JustMN{/a} via Flickr.

The myth of the complete genome is something that is not commonly known to active observers of genomic technologies. (The term ‘active observer’ is from the point of view of one with varying degrees of background in the biological sciences, and is in noway an aspersion.) The ‘first draft’ of the human genome was announced at a Clinton-era press conference on June 26, 2000, and it was an agreement between the two famously competitive individuals (Francis Collins and Craig Venter) representing the public (NIH and DOE) effort and the private one (Celera). This first draft was exactly that – about 90% complete, and the completed version was declared in 2003. This is not to discount the first seminal publications of this draft, as in 2001 when these papers were published (in Science and Nature respectively) the largest previously sequenced genome was 1/25th the size. In other words, the human genome represented a 25-fold leap in size and complexity of anything done to-date.

But what is considered complete? There are notoriously difficult regions to sequence that are highly repetitive, and very long, which exceed the limitations of technology. Some of these regions are associated with every chromosome – the centromeres, remind you, organize each set of sister chromatids, and stretch to megabases in size, and have a set sequence that repeats itself over and over of about 170 bases. Other types of regions that are difficult represent duplicated regions (pseudogenes, copy number variation of genes, gene families) and inverted regions (where one section is ‘flipped’ in the opposite orientation).

Thus it falls upon the computational biologist (also known by the somewhat difficult term bioinformaticist) to wrestle with these problems. It is also an opportunity for the life science vendors to produce better sequencing technology, such as very long insert technology such as a usable mate-pair sequencing strategy (inserts of 1kb, 10kb or more; I have heard that Complete Genomics will be publishing a Long Fragment Read technology of 40kb inserts soon, and make it available as a service offering in early 2013). And if single molecule sequencing could live up to its long awaited promise, that would be a shortcut in solving many of these difficulties.

Once I asked a Hopkins researcher what was their ideal readlength. The answer? 40kb long. Since Ion Torrent is a recapitulation of sequencing-by-synthesis using a pyrophosphate release approach, it stands to reason that 700bp readlengths would eventually be achieved, such as the Roche / 454 FLX+ technology can achieve today. But 700 bases is a far cry from 40kb.

Given where we are with next-generation sequencing technology, researchers can easily obtain (okay, ‘easily’ is a qualified term) a set of paired-end, 2x100bp reads with a 300bp gap between the reads (I am using a common example / scenario of a HiSeq 2000 run). But there are going to be errors, as there is no one perfect sequencing technology; for Illumina, Roche / 454 and SOLiD / 5500, the native error rate is on the order of 1%, while for PacBio the error is on the order of 15%, and these errors will impact how mappable an individual read will be against the reference genome, as well as limit the kind of discovery that takes place.

To elaborate: say there is a 100bp read that has four mismatches, and the alignment algorithm used has a parameter that allows up to 4 mismatches along any given read. One of the 4 mismatches in those 100 bases was a sequencing error (the 1% error rate), and two of the mismatches are true single nucleotide variants. The last mismatch poses a problem – it is an insertion of a single base, throwing off the alignment of every base after that insertion, and the alignment algorithm throws out the entire read as an ‘un-aligned’ read.

In this example there will be a growing pile of unaligned reads, that is a mixture of low-quality sequence (the dross), and high-quality sequence of biological significance (the gold). And it is up to the computational biologist to understand the characteristics of the platform, the kinds of biological variation that occur that can throw off a given alignment tool, and set parameters accordingly to use the correct tool for the task in the correct way.

Now that 100 base read could be in the opposite orientation as part of a complex inversion of that particular gene with great relevance to a disease. However without getting lucky (i.e. having a paired-end sequence not only span that inversion, but also an informatics person astutely detecting the inversion by a local de novo assembly of the unmatched pairs of reads), this sequence ‘gold’ gets thrown out with the trash. You can imagine other kinds of rearrangements, copy number variation, large deletions etc. that can occur in human genomics, that the kind of sequencing used simply does not detect. For several years Francis Collins would include a cartoon of a drunk person looking for their keys under a lightpost, just because the light was better – as an illustration of the limitations of the tools we use only allow us to see what we can see. The same applies for the kinds of sequencing (i.e. a paired-end read-pair with a 300bp insert as a dataset vs. the much less frequently done 1kb or 10kb mate-pair sequence dataset, vs. the rare 40kb fosmid sequencing and pooling as in this haplotype phasing work done recently by the Max Planck Institute) allowing one to see only certain kinds of variation. The same applies for informatics tools; one tool with a given set of parameters will allow you to see a certain kind of variation, and other tools with other parameters will be more appropriate for others. The tools are many, the parameters are many, and the permutations of each is a very large number, and new bioinformatic tools are introduced all the time in journals. Thus computational biologists have a heavy burden to bear, to keep up with the latest tool development, to understand the tools they use and the best way to use them, and to do the right kinds of experiments (and these are true experiments) in silico.

As a side note, this is why Galaxy (produced by good folks at Penn State) is so popular. For those not familiar with it, Galaxy will allow an easy way to put different pipelines together and run them with a given dataset with different parameters, with an easy to use front-end, and with both cloud and local implementation. I understand that both Partek and BioMatters are working on implementing a similar functionality, which will only widen their popularity.

On top of this computational approach determining what a researcher will see, each sequencing platform will have differences in bias and coverage. Bias refers to sequences that are relatively easier to sequence and thus get better coverage, versus other sequences that are more difficult and subsequently have less coverage, and of course still other sequences that have no coverage at all, and each platform will have its own flavor of bias. When I first started at Life Technologies in early 2010, in a training the 2009 paper that used SOLiD technology to sequence a Yoruba HapMap individual was reviewed by one of the veterans of the HGP Michael Rhodes, and I asked how this sequence and the variants determined could be compared to the same HapMap sample that was sequenced several months before on the Illumina platform. It is the same sample, shouldn’t it determine the same variants?

The short answer was there are too many variables, too many differences, too many parameters to make any kind of comparison. This was recapitulated recently in a presentation that was shared with me by Gholson Lyon, of unpublished work that has come ugly Venn diagrams of variant overlap between technologies, and it isn’t pretty to look at. (And it also reminds me of the striking Genome Research paper by Maggie Cam that produced not-too-impressive Venn diagrams of microarray expression data way back in 2003, that was a motive for the MAQC project; a similar SEQC project spearheaded by the same Leming Shi of the FDA has not been getting much attention.)

Two other items come to mind. At Emory university, a researcher there told me in 2009 how the error profile at the beginning of the read on the Illumina platform made a large difference in the assumptions taken by the alignment tool, and so they made a new aligner that gave much better results (and higher yield per run). This was a full 2 years after the launch of the platform, and this was not something incorporated into ELAND (Illumina’s standard alignment tool) at that time. And so the vendors have to depend upon the research community to work with the raw data and develop new tools to the benefit of everyone.

And BGI published a paper showing how about 5 million of sequence is missing from the human reference sequence. And these researchers expect between 19 and 40 million bases to eventually appear, showing how much more work remains to be done regarding the ‘complete’ human genome.

Nota bene – If you are interested in reading further, prominent genomics researcher Evan Eichler wrote this Nature Methods paper using short-reads to do de novo assembly of a Han Chinese individual and a Yoruba individual using two different assembly methods, and reported over 2,377 missing exons and 99.1% of validated copy number variants to be missing. This is important for new organisms where no closely-related reference exists, illustrates the need for hybrid-assembly methods, and is not an indictment of NGS for how it is used with human genomics research.

Leave a Comment Cancel reply