Virtual Research from a Million Datasets

Image courtesy of <a href=””>jervetson</a> via Flickr.

Way back in 1997 a company launched one of the first commercial expression microarrays. The company was Affymetrix, the technology was micro-lithography, and the excitement around this new technology was palpable in the days before even the Human Genome Project had been completed.

At that time there was much discussion and speculation about the final number of genes in the human genome, from the low tens of thousands to over a hundred or even two hundred thousand genes. We now have a number of about 23,000 genes, however the definition of a gene has changed considerably since the 1990’s.

A difficult definition

Indeed, what is a gene? A pseudo-gene is a gene that has been duplicated and mutated over time, yet remains conserved and difficult to study (due to sequence homology, these pseudo-genes will have cross-hybridization in both the microarray environment and even with the specificity of PCR, and need to have specific assays developed to study). A recent paper by Arul Chinnaiyan at the University of Michigan in PNAS demonstrated the expression of these pseudogenes in different cancers, and its role with endogenous competing siRNA and miRNA binding sites. Is a gene a non-coding stretch of miRNA that regulates other genes? Or is a gene only that mRNA that encodes for protein? Or what about alternative splice variants of a particular gene recombining into myriad protein products? Since there are distinct proteins can it be fair to call them the product of a single gene?

And then there’s the perception of ‘junk’ DNA, as if every one of the 10 trillion cells in the human body would carry 98% ‘junk’ just as an artifact of evolutionary history. The ENCODE project (ENCylopedia Of DNA Elements), sponsored by the NHGRI starting in 2003, published its first of several key findings papers in the summer of 2007, and the surprising fact is that over 90% of the genome is transcribed. The exact function of all this RNA is not clear, although it is presumed that all this transcription does serve a purpose.

Some historical context

Getting back to the launch of commercial gene expression, before high density microarrays the methods used were primer extension and RT-PCR (reverse transcription PCR), laborious single gene approaches. (As a nod to my own research past, I’ve done more RT-PCR’s and primer extensions than I care to recount.) The advent of microarrays (first homebrew ‘spotted’ methods, using equipment from now-defunct companies such as Cartesian or Genomic Solutions, and then commercial microarrays from Affymetrix) opened up the possibility to look at a sample of RNA, label it with fluorescence, and get a relative abundance from thousands of genes at once, gave meaning to the concept of genome-wide gene expression. This was so attractive to users that the effort and time involved to print your own microarray (either with cDNA clones laboriously amplified using PCR, or with long 70- or 100-mer synthetic oligonucleotides from an oligo house) gave way over the course of several years to a commercial expression array.

Although Affymetrix had the first-mover advantage, several companies sprang up in its wake, including Agilent’s inkjet printing method for manufacturing microarrays and NimbleGen’s method of using Digital Light Processing (DLP, the same technology found in a projection LED system) to make a photomicrolithographic mask for on-substrate oligo synthesis. CombiMatrix used 4K and 16K Dynamic Random Access Memory chips (DRAM chips, the same as found in late-1980’s desktop PCs when 640K memory was the norm) to assign each memory location as a site where oligonucleotides could be synthesized on the chip itself. And lastly Illumina was a relative late-comer, introducing a whole-genome gene expression array in 2004.

All these companies still use these microarrays in one form or another today, with the sole exception of CombiMatrix. Affymetrix still sells a wide variety of gene expression and genotyping arrays, Agilent sells (and is making a strong effort in 2012) array-based Comparative Genomics Hybridization (a-CGH) to the clinical genetics market (as an alternative substitute for labor-intensive methods such as Fluorescent In-Situ Hybridization, or FISH), in addition to using the array-synthesized oligos for their SureSelect sequence capture technology. NimbleGen (now a part of 454), although announcing recently they are getting out of the microarray business, still uses their microarray technology for their SEQ-Cap EZ whole exome and targeted enrichment products. And Illumina still sells their microarrays.

A million samples

After about 15 years of publicly funded research, there are close to a million sample’s worth of expression data that are publicly available, ready to analyze. (This is what is available, many more millions of sample’s worth reside behind the firewalls of Big Pharma and other biotechnology companies.) Is there only one ‘best’ way to design an experiment, or to analyze genome-wide expression data, or to look at multiple different parameters? Of course not – many limitations to the number of samples, budget limitations, limitations on doing multiple biological replicates, among others.

For example, regarding biological replicates, back in my old Illumina days there was a very informative slide we used to show customers that compared a technical replicate (that is, the same reference RNA but split and labeled separately, and then hybridized separately onto microarrays) and its r2 value between sample replicates, and then compared it to three individual mouse samples, where the mice were treated identically, and the RNA isolated from each animal, labeled and hybridized to a microarray.

The variation between individual technical replicates was slight; the variation between biological replicates was huge in comparison. The lesson: biological variation is much higher than the variation between arrays and between labeling procedures, and that you need at least samples in triplicate to get meaningful, reproducible results.

Of course if you had equivalent numbers in the 10’s, or 100’s, that would be so much better, in that the signal could be much more easily dissected from the noise.

Which is where we are today – with a million samples’ worth, all it takes is the effort to dig into these samples and determine which datasets to use, and what questions to ask. And one pioneer taking this to a logical methodology – taking advantage of all this data, and looking at it in novel ways – is Atul Butte, Chief of Systems Medicine at Stanford. At his TEDx talk presented in Washington D.C. in April 2012 (viewable online here) he explains his approach to a non-scientist audience, highlighting how his approach to look at existing datasets, and its corollary, to outsource the generation of new data. What he does not state is his productivity, which in terms of academic papers is on a paper per 14 day pace. (That’s 26 papers per year.) Nor does he state the six companies he has founded, with another on the way.

The future of research – big data

This is the future of research, analysis of big data by those with the inclination to dig for diamonds of discovery still hidden in its depths. It is inevitable in the context of constrained budgets.

Will the same types of research be performed with NGS data? A case can be made that it costs less to generate a new dataset (given the breakneck pace of the data-generation cost decline) than to store that dataset to begin with. A given run on a HiSeq 2000 generates 600G bases of sequence data per run, at a cost of about $25K in reagents. It can easily become 20x of that in terms of bytes as trimmed reads, quality scores on a per-base basis, and alignment to the reference genome files. Using a factor of 20x, a $25K run will consume 12T bytes of data on a storage area network. That carrying cost on per-year basis can be $3 / Gb to purchase outright, so 12 T bytes = 12000 G bytes = $36K to purchase storage for. (The 20x factor can vary depending on how ‘raw’ the data is, so individual dataset sizes can vary.)

But in this simple example, it cost $25K to generate the data but $36K to store it. And when the next decline in sequencing costs occur (whether Ion Torrent Proton in early 2013 or something else in development) that difference is only going to get worse. So the similarity to microarrays just didn’t see the same kinds of competitive price pressure as NGS has, so the incentive to re-analyze the NGS data is not there; it is easier to design and run a new experiment than to dig up 1 year old (or 3 year old) data. It can change if the storage costs become more advantageous (and the rate of price decline in the NGS context levels off). I wouldn’t count on it though.


Baker, M. Nature July 18 2012 “Gene data to hit milestone“.

About Dale Yuzuki

A sales and marketing professional in the life sciences research-tools area, Dale currently is employed by Olink as their Americas Field Marketing Director. For additional biographical information, please see my LinkedIn profile here: and also find me on Twitter @DaleYuzuki.

Leave a comment

Your email address will not be published. Required fields are marked *