Moleculo and Haplotype Phasing

A picture of a robot toy (sorry, my kids do not have a toy that looks like the ‘Moleculo Man’ of Conan OBrien from 2001…)

A few weeks ago at the J.P. Morgan Healthcare Summit in San Francisco, Illumina announced that they acquired a startup company called Moleculo that provides virtual long reads of 8 to 10kb. Single molecule sequencing will provide long reads (Pacific Bioscience’s RS will now go out to 5kb, although the platform is hampered by poor accuracy), and I’ve written before about last summer’s accomplishment by Complete Genomics in publishing their Long Fragment Read technology of phased reads on the order of 100kb. (And do take a look there to see why haplotype phasing is important.)

This news is significant on a few dimensions. First is that there is more than one way to get long ‘virtual’ reads using a short read sequencer, and this is a demonstration of that. Another is the manner in which this startup began.

From what I’ve been able to determine from the recent Plant and Animal Genome XXI conference held in San Diego, the Moleculo technology is comprised of two components, which is an upfront sample prep and downstream bioinformatics. The sample prep involves fragmenting the sample to about 10kb pieces, then a limiting dilution step that may (or may not) involve some inexpensive plastic microfluidics. (This is conjecture, but given that this comes from Steve Quake’s laboratory, who founded Fluidigm.) So now you have a single 10kb piece of DNA, that is a haploid fragment, which might be amplified at this point. Amplification, even at low levels, would be necessary to avoid ‘dropouts’ in the next step. The DNA can then be randomly tagged (perhaps with an Epicentre transposase-mediated insertion which Illumina calls Nextera™) with a unique barcode to that molecule in that compartment.

Why microfluidics? You will need to have at least six hundred thousand single molecules of 10kb apiece to result in a six million kb (or 6 billion bases) of diploid human sequence, if there was perfect efficiency of completely adjacent 10kb pieces. And in order to know which particular read is on the same allele as its adjacent partner, you would need to have some overlap. So if it is six hundred thousand single molecules of 10kb apiece, that is an impossible number of 384-well plates to handle for a single sample. In a small microfluidic cartridge, however, you could easily partition the diluted molecules into manifold chambers and add reagents en masse for the downstream reagent additions.

This bit of wizardry is where little information is found; what I heard at PAG was that there was a simple microfluidic device involved, and since Moleculo counts Quake as one of the founders, it does all add up.

And in this scheme all the reagents should be additive; any purification steps are undesirable as it unnecessarily complicates the handling. It would work like an Ambion^® Cells to Ct™ kit, which goes from cell lysis to real-time PCR in five additive steps. A new Fluidigm instrument, the C1, operates in the same manner, using microfluidics.

After the single molecules are amplified (presumably) and tagged for that particular chamber (that is another aspect where microfluidics would really help, but also need to be worked out, as each individual partition requires an individual barcode), the entire collection of barcoded template molecules have standard library adapters placed onto them.

So the sequencer in the middle still produces short reads of 110 bases (as is the case for the HiSeq), but for a given barcode, that subset of reads with that barcode would then be assembled into a contig de novo. Moleculo’s technology yields 8 to 10 kb of these virtual long reads (one internet user wanted to nickname them a “haplotig”, believe it or not), since all the molecules with the same barcode came from the original single 10kb strand of DNA from the original sample.

The bioinformatics is a vital piece, and the Moleculo group had Jared Simpson now at the Sanger in the UK (of ABySS fame when he was at BCCA) wrote the assembler. All the reads from the same barcode are from the same 10kb original single DNA fragment from one allele, with perfect variant phasing all along that 10kb stretch. Then the entire collection of 10kb fragments would have to be aligned by the overlapping variants; this is where the informatics expertise would need to be applied. 10kb is long enough for a particular variant (whether copy number or insertions or deletions) on a single allele to be aligned to another 10kb piece that is offset.

This is a complicated problem that cannot be solved by shotgun short-read sequencing, and this technology solves this problem nicely. Repetitive elements and CNV’s are particularly tricky, and being able to obtain all this information, from what early-access customers have said, offers a lot of value.

On top of all this, one aspect of the informatics that sped the process up was to upload sequence to the cloud to start the analysis before the sequencing run was finished.

What I heard at PAG XXI was that Moleculo’s approach requires about 10x coverage to get a reasonably high amount of phasing; so for a human genome, beyond a typical 30x coverage of 90G bases of sequence, another 30G bases of sequence would be required. Before the acquisition Moleculo had a fee-for-service model, where they would accept genomic DNA, prepare their libraries, send these libraries out for sequencing, then analyze the file in the cloud for delivery to their customers.

Another dimension of this company that is of interest is how small it is, and how recent it is. Their homepage was quaint (it has now been changed and unfortunately the Wayback Machine didn’t crawl it from its prior iteration). It simply said, ‘we are offering our service to a limited number of early-access partners’, and provided a contact link and a login link. Turns out that Moleculo was basically three people, only founded one year ago (January 2012), and was purchased on the brink of obtaining a Series A funding round before Illumina made an offer that was ‘fair to everyone’.

So with only three people (and a fourth one recently brought on), this company could swing to a profit within six months, without needing the capital expense of buying an NGS system, nor buying expensive computer resources – they just paid for what they needed.

There is a lesson here: this may be the best time in history for starting a company, due to the much lower barrier to entry. A big problem to solve, a combination of a few key technologies (in this case some microfluidics, molecular biology barcoding, and add in some bioinformatics), obtain the needed seed funding, and leverage the ability to outsource.

Edit added 1/25/13 at 4PM ET:
Thanks to a sharp-eyed reader who pointed out my math on the number of fragments was off by three orders magnitude, or 1000-fold! Six hundred thousand individual reactions is a lot different to handle than 600. (Facepalm)

Next Generation Sequencing, Marketing, and the Genomic Revolution

Next Generation Sequencing, Marketing, and the Genomic Revolution

Leave a comment Cancel reply