“Let a hundred flowers bloom, let a hundred schools of thought contend”, Mao once said during the Revolutionary Days of 1957. Now in the middle of a genomics revolution, it feels that way in the market for genomics software to analyze next-generation sequencing data. New companies are being formed, large software and hardware firms are expanding into the life sciences, and others are offering in addition to software options the implementation of a cloud-based service. But will customers really spend scarce funds on commercial software for analysis? And can the market absorb all these new offerings? Software from the vendors of equipment that generate the data is not complete – talk to anyone with a 454, Illumina or Life Technologies system. Even for Sanger sequencing, the established “gold standard” that has been routinely used for almost 30 years, other software tools supplement the vendor-supplied software, in order to meet a particular scientist’s requirement(s).
For example, in next-generation sequencing all the three major vendors will provide software that will take the massive amount of data (from millions to billions of reads), and perform alignment against a reference sequence, resulting in a BAM format. (BAM is short for Binary representation of a sequence Alignment / Map file, which was a standard developed by the 1000 Genomes Project.) This is known historically as a secondary analysis (the term ‘primary analysis’ is what the instrument does in producing the data, say turning voltage signals from an Ion Torrent 316 chip into base calls with an assigned quality score).
Functions beyond alignment to reference vary with the vendor. (This can be deemed tertiary analysis.) A common vendor-supplied function is to call variants after the alignment of sequence reads to a reference (and the standard output file is called a VCF file, again from the 1000 Genomes Project). Another function could be de novo assembly. And while this tertiary analysis sounds straightforward (alignment to reference or assembling a small genome from scratch), it is not a ‘one-size-fits-all’ proposition.
In the context of the complexity of the data, and the varied needs of individual researcher,this is a reasonable business opportunity: to specialize in the data analysis segment of the next-generation sequencing market.
There are a few unique aspects to providing software (or software-as-a-service in the cloud, known as SaaS) to the research market. First and foremost, within any research group with genomic data experience (dating back to whole-genome gene expression and before) there will be one or several experienced computational biologists, who can assemble data analysis pipelines and are accustomed to working with large datasets. Secondly, there is a large (and growing) set of free-to-use tools published by research groups in prominent journals, with new tools being published very frequently. (Ask any bioinformatics specialist for their current list of ‘go-to’ tools and how they have changed over the past 12 or 18 months, and be prepared to hear an inventory of which software has remained useful over time and another set of software that at one point was useful and later on proved less so, when a better tool appeared.) Third, and perhaps most importantly, funds for a given project are set aside for reagents and perhaps some equipment including computational hardware (especially if the project is large), but funds are not set aside for purchasing software or software services for analysis. Funds for software often have to be looked at on a department- or institute-wide site license, which complicates matters substantially (for both the researcher and the vendor trying to make the sale).
It is hard to compete with free. Of course the implementation of any software, even if the software is provided free of charge, is not free when you count the cost of the resources needed to implement the software, both in terms of time and computation hardware capacity. On top of saving funds on an initial outlay, another strong argument for the ‘do it yourself’ approach to bioinformatics, is the fact that putting a solution together of hand-picked tools means that each tool is understood on its own, with its particular strengths and weaknesses known to the user (or discovered in due course). For purchased software, while technical support is offered, the amount of transparency in terms of how the software is constructed will vary between vendors; in other words, the software ‘secret sauce’ can be a black box to the customer, with surprising (usually negative) effects discovered empirically.
In this background there are many software vendors. A friend put together this list of genomics software and service providers, currently with 90 listed. Some are long-established companies dating back to before the Human Genome Project was completed, the whole-genome expression days, or whatever you want to call the late 1990’s (Ingenuity, Accelerys, GenomeQuest); others are well-established firms developing software since the early days of NGS (circa 2005, such as clcBIO, SoftGenetics, Knome), and others have been formed more recently and are gaining some market share and awareness (Geneious, DNAnexus, BioTeam). Still others are major recognized multinational firms (Microsoft, Samsung, GE Healthcare). Each company is offering a better mousetrap; emphasis on a hand-curated protein-interaction database from the primary literature (Ingenuity), performing data analysis as a cloud-based service (Appistry, DNAnexus), multiple samples across whole-genome or whole-exome sequence data (Knome), the list goes on and on about each companies’ unique ‘value proposition': the simple “this is what we offer that is better than anything else”.
Whatever the true current number of bioinformatics companies might be, we are in a phase of the market that will tolerate a lot of creative approaches before the inevitable pruning begins. In both the automobile and aviation industries, in the beginning of these major industries there was no standardization of function, and little agreement on what controls were needed or required. Eventually both the functions that were required and the companies that produced an effective product were sorted out by the marketplace, and an inevitable contraction took place. For bioinformatic software (or software as a service), the market still is in the explosive growth phase. Revolutions tend to begin in a messy way.
Given the incomplete nature of what a genome is to begin with (a prior post was written up here), there is no one ‘best way’ to analyse something so complex, with so many judgement calls to be made along the entire process. The imperfect nature of the quality of the data is a result from a 1% typical NGS error rate. Any filtering will suppress false positive results at the expense of an elevated false negative error rate, and any variant not identified is simply not discovered; it is the false positive result (a putative variant identified that turns out to be not real) that limits career advancement and gets individuals in trouble (not to mention wasting limited resources). Unknown variation that goes by undetected does not get individuals in trouble, however it may mean the difference between a major discovery being made or a negative result, or a definitive diagnosis for a patient with an unknown condition, versus a negative result from whole-exome or whole-genome sequencing.
Is there room for 90 companies to produce software or provide software as a service for genomics? Eventually, definitely not. In the near term, as the revolution continues, a hundred flowers bloom.