This week a remarkable paper was published in Nature, called “Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells”. What makes it remarkable is the ability of this method to obtain rare variant phase information by changing the library preparation method. Until now to obtain completely phased individual genomes required a fair amount of laboratory manipulation.
For those not familiar with the importance of phase information, way back in 2002 the International HapMap project set out to catalog common variation, and a key concept to understand is the idea of a haplotype block, which is a string of variants (common and rare) that reside along the same allele. By defining these blocks of varying lengths throughout the genome, researchers could use the common population variants (SNPs) as markers for the entire block, carrying with them many more rare variants.
Using whole genome genotyping microarrays by Affymetrix and then Illumina, as well as high-density arrays by Perlegen, the HapMap project released datasets in three phases, in 2005, 2007 and 2009. With these maps genome-wide association studies could then take off in earnest.
But this applies to GWAS, regarding populations and not individuals. Relating all of an individual’s variants to the haplotype map in terms of individual haplotype phasing is a problem, and one that makes the computational biologist’s work difficult and complex. With billions of bases of data, perhaps a billion read-pairs, loads of computational resources needed, and a large and growing set of tools at their disposal (and each tool having a large number of parameters that do not work in a monolithic fashion, that is, having a large number of variables to adjust for the goals in mind), the bioinformatics challenge is a large one.
Attempts at complete haplotype information from short-read sequencing has been performed, with varying degrees of limitation and a lot of computational power. One solution would be to obtain as long a read as possible, so there is a great interest in single-molecule approaches, to simplify the task at hand. But as mentioned before, Pacific Biosciences has its challenges with accuracy and throughput (in addition to system robustness, cost per megabase, and ease of use) and Oxford Nanopore (and Genia) do not have any preliminary test data out for analysis or early access (as of mid-2012). So the informatics people do what they can with the datasets they have.
Another solution would be to get longer fixed-insert information from a given DNA library. Paired-end sequencing typically has an insert size of 300-500 bases. (Each read maps to the genome with a given insert size distance, plus or minus 10%.) Another approach, called mate-pair, increases the insert size to 1kb up to 10kb. For a mate-pair experiment, typically several micrograms (and it used to be many 10’s of micrograms) of high molecular weight DNA are fragmented into 5kb or 10kb lengths, enzymatically treated to modify those ends, diluted into a relatively large volume of buffer, and allowed to form loops of circular DNA. Where the loops joined was the region of interest to be sequenced, and through additional enzymatic manipulation, that junction would be selected away from the 5kb or 10kb insert, and a library made from that material.
Life Technologies SOLiD had an advantage with the mate-pair protocol, as the Illumina ‘Long Insert Paired End’ protocol had many difficulties when it first launched. (I understand that now it has been improved, but I do not know of anyone who has used the improved version recently. If anyone could comment on this that would be appreciated.) The problem with the Illumina protocol was the number of redundant mate-pairs: looking at the start points of each read, if they start at the exact same base, you know that that additional read-pair isn’t giving you any additional data. Customers have told me a 50% read-pair redundancy was common with the Illumina protocol. The Life Technologies one worked much better, with the number of redundant reads in the low single digits. And now with Ion Torrent the protocol has been improved yet again, with lower input requirement and an easier workflow. It isn’t one of the easier protocols to do, nor has it been automated, but it does perform well.
Yet even with 10kb inserts, phasing of all the rare variants is not possible. 40kb or 100kb inserts would be a major advance. (Thus the interest in Oxford Nanopore’s claim of 100kb reads.) The Complete approach is called ‘Long Fragment Read’, and while not stated explicitly they imply an insert length of 100kb.
In the realm of clinical genomics, it is rare mutations (‘rare’ in the population genetics sense) that have large impact, particularly when it comes to a particular disease (such as cancer), and it is important to determine if these de novo mutations are in cis or trans relative to other genes and regulatory regions.
Four individual genomes have been published to-date with complete haplotyping / phasing information, and were produced by three different methods.
The first was Sanger sequencing (the traditional method) of Craig Venter’s genome, and the title of the paper reflects this: The diploid genome sequence of an individual human. This is not scalable (the estimate of how much it cost to sequnce via Capillary Electrophoresis is in the several million dollar range), but as a reference sequence the J. Craig Venter Institute did a great service, to obtain a ‘gold-standard quality human genome’. Speaking with the folks at JCVI, they were justifiably proud of determining a diploid sequence, to emphasize this point.
The second method is one that uses flow-cytometry to sort individual chromosomes and separate the daughter strands, and then sequence just the haploid alleles. Until a few years ago (2007) there were only two facilities in the world that could sort human chromosomes, but more researchers could do this kind of work and publish methods around it. As a method it is limiting, due to issues of scale, but a flow-cytometer is relatively common, albeit expensive equipment. (Expensive is relative, here as a sample preparation piece of equipment on the order of $150K-$200K.)
The third method is one that goes back to fosmid generation, which is how the original Human Genome Project was done (as well as how the Venter genome was handled as well). Applying short-read sequencing, and some clever pooling, resulted in a few recent publications (and I would note here, one using SOLiD sequencing technology).
Complete Genomics gets around these three methods and introduces a fourth, which is mapping and sequencing simultaneously. This was first proposed in 1989 and called HAPPY mapping (HAPloid DNA samples using the PolYmerase chain reaction), and updated in 2009 to propose its incorporation into NGS for simultaneous mapping and sequencing. (Credit to Keith Robison’s blog post on this topic to give needed background.) By dilution of high molecular weight DNA (not mentioned in the paper but on the order of 100kb), then barcode-tagging and individually sequencing an individual aliquot, two alleles in the same well can then be associated together. It is highly parallel linkage mapping.
One key technical advance, is the use of an optimized Multiple Displacement Amplification technology (MDA), which is a whole genome amplification technology (WGA). Complete has solved a problem that has been limiting this technology’s use for low-input next-generation sequencing, which is preferential amplification and relatively low genome coverage. (For example, this publication that uses MDA in front of NGS, and obtained a 6% genomic coverage from sequencing single nuclei.) Others refer to this as the ‘pileup’ problem, where the preferential amplification of particular regions overwhelm the reaction.
There are companies that offer commercial WGA products – namely QIAGEN’s Repli-g, Sigma-Aldrich’s GenomePlex, and GE Healthcare has GenomiPhi. Yet if you look at these products, they market them for SNP genotyping or CNV analysis from small quantities of cells, not for next generation sequencing. One other company is Rubicon Genomics, and their PicoPlex, and again only data for microarray work.
Thus this ability by Complete Genomics to get around these problems is significant. And an additional per-sample cost of $100 in reagents to do this entire protocol is also worth noting.
Through the advantages of their sample preparation and some sophisticated algorithms laid out in the paper, parental haplotypes can be assigned. These algorithms appear to be a heterozygote connectivity graph a few levels of abstractions deep. (A talented mathematician and friend Bahram Kermani is one of the two co-first-authors, who I know from my Illumina days.)
One other item is an advance in accuracy, which is notable at 99.99999%, by use of the phase information to eliminate errors introduced by the MDA enzymatic process. One thing that Life Technologies learned with marketing SOLiD was that the research market does value accuracy in NGS, but it is not the primary buying criteria, only one of several criteria, and often trumped by cost, throughput/turn around time, and ease-of-use.
This technology already has two patents applied for (noted in the journal article which is somewhat puzzling to point out), and while is not specific to Complete Genomic’s sequencing methodology is an attractive method to obtain phasing from small sample input. I wouldn’t be surprised to see a wave of papers using this method or a modification of it across several NGS platforms.