A few days ago I reviewed in brief the history of Helicos Biosciences (HCLS), a company that held out the promise of single molecule sequencing, but failed to deliver on several fronts to the next-generation sequencing market. (This would include accuracy, throughput per dollar, and ease of use / reliability was yet another factor.)
But why is single molecule sequencing so attractive in the first place? What can it do that other technologies cannot?
First proposed in the late 1980’s, single molecule sequencing offers the ability to read long stretches of DNA by direct detection of the molecule as it passes through a protein pore via electrical signals. (The first publication was reading sequences via laser-beam.) So a lot of effort on this approach has been expended over many years. (I was told in 2005 that in the late 1990’s / early 2000’s Los Alamos National Laboratories was working on micro-cantilevers via micromachining MEMS, and the problem was that the DNA would enter the pores so quickly the cantilevers could not ‘feel’ the nucleotides since they were passing through the pore so quickly. I don’t know whether this actually was the case but on the molecular level could easily imagine it to be true.)
Helicos had 24-70 base-pair read lengths due to some technical limitations of their chemistry, but the Pacific Biosciences RS instrument (PACB) (with their new C2 chemistry) now has an average read length of 2500 bases, with 100’s of reads on any given run in the 10’s of thousands of bases long. And at AGBT, Oxford Nanopore Technologies (ONT, privately held) claimed to have sequenced lambda phage (a 48kb genome) in a single read, with the average read length expected to be on the order of 100kb in length. You can say that Helicos was the first company to offer single-molecule sequencing, which is why they marketed their offering as tSMS™, ‘True Single Molecule Sequencing™‘. PacBio would differentiate it with the idea of Single Molecule sequencing in Real Time (SMRT). (If you haven’t seen my prior take on PacBio, you can find it here.)
These lengths offer enormous value for the genetics and genomics community, due to the fact that going back to the array genotyping technology from about 9 years ago (starting with the first Affymetrix (AFFX) 10K SNP arrays in 2003 and then a few years later with the Illumina GoldenGate genotyping arrays), a given variant at a given position in the genome could be called accurately, but in relation to the next variant some distance away, it was just not known which chromosome that variant came from in relation to the other. In other words, say you have an A/T SNP at a certain location of the genome, and 200 bases away you have another G/T SNP; you do not know that the paternal allele is A (200 base gap between SNPs) and then G, and the maternal allele is T (200 base gap) and then T. All you know is that there’s an A/T SNP in one location and 200 bases away there’s a G/T. (By the way, on average, a variant appears in the genome about every 1000 bases; individual genomes have a reported 2.7 to 4.1M variants depending on the technique and coverage used, and if a genome is defined as 3.2Gb that is where the 1/1000 number comes from.)
In next generation sequencing, this phasing problem doesn’t really get much better, due to the problem of read lengths. One can use terribly complicated algorithms to sort it out, but they are simply not very good at it.
One great benefit for the genome sequence of HuRef-1 (what the J. Craig Venter Institute “humbly” called Craig Venter’s genome in their 2007 paper, that used CE sequencing which cost on the order of several million dollars) was that it was done using long Sanger reads. Thus they were able to obtain complete phasing information (that is, assigning Single Nucleotide Variants – SNV’s to individual alleles) for the entire genome, and call their paper ‘the diploid genome sequence of an individual human’.
The Human Genome Project (HGP) was actually a mixture of many samples (for the government effort it was two male and two female samples from an original collection of 40 total, for the Celera effort it was 5 samples from a original collection of 21) and thus what we collectively refer to as ‘reference’ actually is not one individual but a mosaic of many individuals.
So in this context, the many dozens of published whole genome sequences (from individuals in Korea, Yoruba from the Ibidan region of Nigeria, Japanese, Northern European descent, etc.) do not have this phasing information due to the short-read platforms used. (In the case of James Watson’s genome on 454, the read lengths from the technology in 2007 only reached about 250 bases long.) Several techniques have sprung up to solve this phasing problem, from chromosome micro-dissection (it does work but if this method actually scale is an open question), to fosmid library construction and pooling, then NGS (hearkening back to the ‘good old days’ of the HGP where fosmid libraries were the norm), to using flow-cytometry to sort human chromosomes (more of an art than a science; for many years there were only two laboratories in the world who could do it routinely, the NCI here in Bethesda MD and in Cambridge U.K.).
Thus we come to the promise of single-molecule sequencing. If the individual reads were 10’s of thousands of bases long (instead of 10’s of thousands of bases separating two 100 bp reads as with the fosmid pooling example mentioned earlier), phasing the variants becomes trivial (in the memorable words of computational biologists the world over).
Currently exome sequencing at the translational level (I’m careful with my language here – Life Technologies’ products are strictly For Research Use Only), a company like Ambry Genetics or GeneDx will do exome sequencing of a trio – the affected individual and their parents, in order to assist with the variant calling effort, as well as to suppress the false-positive basecalls. (That is, calling a particular variant can be an impossible proposition for that individual’s genome, if the paternal genotype for example is A/T and the maternal is A/G and the individual’s is C/T; that ‘C’ basecall is an impossible one.)
Therefore (getting back to single-molecule sequencing) even at 15-16% error, the Pacific Bioscience’s RS platform is very useful for the research market, and while it won’t give 15 minute whole human genomes like Stephen Turner promised in that memorable AGBT talk in 2008, it will give some nice data for phasing variants given their long-read capability.
Lex Nederbragt of the Norwegian Sequencing Center recently reported using their C2 chemistry (personal communication), obtaining an average readlength on the order of 2.5kb – 3.5kb in their first six runs, with the longest reads ranging from 17Kb to 20Kb. The error was reported to be 13-15%. They will be publishing this work soon in Nature Biotechnology, which should be something to keep an eye out for.
A customer / friend told me at AGBT, ‘if Oxford Nanopore can deliver half of what was promised in twice as long as they claim they’ll be able to do it in, I’ll be very happy’. In other words, if ONT can deliver a 50kb readlength at a 8% error by the end of 2013, many customers will line up to buy a lot of the upcoming GridION and MinION systems. Of course time will tell whether they will be able to launch both platforms by the end of 2012, since as of late June 2012 there is no news at all of any test dataset release or early-access ‘send us a sample and we’ll send you data back’ trials, which is the next logical step for them to take on the way to commercialization.
For further reading, I came across this paper from October 2008 Nature Biotechnology – “The Potential and Challenges of Nanopore Sequencing” that doesn’t require a Nature subscription to access that explores all the types of single-molecule sequencing that are being worked on, as well as a clear biophysical description of the challenges involved.