A thoroughly enjoyable surprise at the Advances in Genome Biology conference last week was hearing Gene Myers of Max Planck Institute of Molecular and Cell Biology in Dresden, Germany. It wasn’t because it was about assembling a de-novo human genome at 54x coverage from a Pacific Biosciences RSII dataset in a fraction (some 1/36th) of the time, it wasn’t because of the elegance of the presentation – it was because it was for the first time in 10 years that Gene Myers attended AGBT, and for a simple reason – he did not consider short-read sequencing ‘intellectually satisfying’.
You know you are hearing from an unusually talented person when they talk about things that are intellectually satisfying.
For a bit of history and context, Gene Myers was on the original team that wrote BLAST and published a paper in 1990. Published in the Journal of Molecular Biology, it was entitled simply “Basic local alignment search tool” and has received almost 50,000 citations to-date, and was the most-cited paper of the 1990’s. BLAST and its iterations are still in wide use today, almost 25 years later, and due to its provenance from the NCBI (NIH), it remains ‘one of the most widely used bioinformatic programs’.
A few years later, as a vice-president of Celera Genomics he was deeply involved with Craig Venter on the sequencing of the Human Genome, and as a proof-of-principle sequenced the Drosophila genome in 2000. The Celera Assembler is currently open-source, maintained by the University of Maryland here.
After I moved to the East Coast in 2005, I heard when Myers (who moved from Celera to UC Berkeley in the interim) moved to the new HHMI Janelia Farms campus in northern Virginia. (For those not familiar with the Howard Hughes Medical Institute, it is the second largest philantropic organization in the US, and the country’s largest private supporter of academic biomedical research, with some $16.9B in endowments; in 2013 they disbursed some $727M in U.S. biomedical research.) Throughout this time, his work centered around developmental neuroscience, as evidenced by his publications there.
It was with interest that I heard his talk, “A de novo Whole Genome Shotgun Assembler for Noisy Long Read Data”, and not from the ‘reads and feeds’ kind of instrument-platform perspective. It was from the perspective of a person who has the ability, funding and freedom to choose the problems they want to solve, rather than the ability and funding to solve problems handed to them. Gene Myers is an example of someone who chose to work on genome assembly with PacBio data, because it was an interesting problem to him.
Not that the reads-and-feeds were ignored – they were just a footnote in the story he told about his new assembler Dazzler. Using the last-generation P4/C2 chemistry, there was an 11% insertion error, a 3% deletion error, and a 1% substitution error; while the high error rate of the PacBio RSII system is well-documented, this does represent a marked improvement over time. What was notable to Gene however, was that the errors and read sampling were almost perfectly random, which meant a much easier time for the informatic analysis, as a Poisson distribution could precisely model both.
His comment about random read sampling deserves a bit of explanation, as those not familiar with the day-to-day reality of sequencing. All DNA is not created equal – you have a stretch of DNA that can vary a lot in terms of base composition, in particular the G:C to A:T ratio, otherwise known as ‘GC ratio’. Regions that are high in GC-content sequence poorly, as ‘traditional’ next-generation sequencing involves amplification during library preparation, amplification during template preparation, and the sequencing itself which uses modified polymerases and nucleotides for light-detection (native polymerases and nucleotides for Ion Torrent) but nonetheless, at every stage of this process particular GC- or AT-rich stretches can and do simply fail. For those interested in a cross-platform comparison, this Quail et al. 2012 paper in BMC Genomics (direct PDF link here) look at some very AT-rich (Plasmodium falciparum is about~83% AT) and some GC-rich (Bordetella Pertussis is 67% GC) organisms.
I remember a poster from the Broad that they presented perhaps two years ago at AGBT showing the read distributions across GC-poor, GC-average and GC-rich organisms, comparing several platforms (including Ion Torrent and Pacific BioSciences) and showing absolutely flat distributions for the PacBio. This very likely has to do with their method of single-molecule sequencing, which obviates the need for template amplification; however there is still a need for library preparation steps to get a SMRTbell adapter on it to prepare the sample for the sequencing process. (I wrote up my thoughts on Pacific BioSciences a few years ago here if you are interested in additional information.) If a particular read doesn’t make it all the way through the sequencing process, it goes missing from the final dataset, and thus you no longer have a random representation of molecules.
The implication of these last two aspects (random error and random read start-points) is that the accuracy will converge geometrically with depth. Thus getting 100% coverage is just a matter of getting sufficient coverage, rather than a diminishing return with an incomplete dataset to assemble a given genome. (An ongoing battle is to determine when is ‘good enough’ ‘good enough’ as an assembly – an acceptable number of gaps and reads that can’t be placed anywhere due to an incomplete assembly. There is the temptation to sequence just a bit more, and then some more, but there comes a point where someone decides that the quality of the work is sufficient. Just like with any other large project, it’s hard sometimes to define what the finish-line goal actually is when the final product remains incomplete in some ways.)
Back to the talk, Myers shared data that indicated after consensus correction, at 10x coverage the error dropped to 0.5%, and that his assembler Dazzler was 36-times faster than the time it took to assemble the same data using a tool called BLASR; more information about the project, the cell-line used, and how it was done is on PacBio’s blog here. Myers had diagrams laying out his approach (input FASTAs only, overlap, scrub, error-correct, overlap again, scrub, then assemble with input from quality scores via a Quiver datafile), and specifications on how he did it, with 512 cores running for 5 days and only 16GB of RAM, with a Distributed File System. (“It will take you 100x longer if you don’t use DFS, and shame on you if you don’t.”)
This tweet came from Jason Chin (@infoecho on Twitter):
405K core-hours (the technical details about this process linked to in the tweet can be found here) on the Google Cloud meant that it would have taken only about 11K core-hours (just 21h on Myers’ 512-core system) for the initial BLASR to Dazzler ‘apples-apples’ comparison, thus the 36x calculation (edited for clarity, see Gene’s comment below for details). Taking a guesstimate of what the ‘retail price’ of a core-hour would cost (way too many variables, but nonetheless) even at $0.50 per core-hour you can see how expensive (and computationally intensive) such an exercise is with the BLASR algorithm. With this new method, a human genome can be assembled with PacBio reads on ‘a MacBook laptop in 5 days’ – a remarkable promise.
And thus Gene Myers comes back into the genomics realm, finding a problem of his choosing to solve. And solve it he did, with further refinements to come – he promised the correction needed some more work and that in about a month he would make it available.
Lastly, some interesting questions from those in attendance, including whether he’d look at assembling transcriptomes. Myers answered that while Dazzler is a ‘pure’ strategy (i.e. only long reads of a single type), a hybrid strategy would be a good thing as there may not be enough PacBio reads for a given transcript.
If you have to ask how much a 54x human genome on PacBio costs, well, could you afford it? I was told it was on the order of $40,000 USD. Yet it comes down to affordability of both the runs and the instrument; while the tool exists for a few users who can afford the approximately $800,000 for it, the RSII is still priced out of the budgets of many research and commercial laboratories.
One last interesting point is that this genome is a lot more complete than the reference-guided assembly of the same sample to 2.83GB and contig N50 of 144KB using Illumina short-reads and BAC clones; the PacBio Google-cloud de novo assembly ended up being 3.25GB and a contig N50 of 4.38MB, a marked improvement.