Combining genomics with proteomics opens a new era of discovery


Twenty years after the completion of Human Genome Project, combining human genetic variation datasets with circulating protein biomarker information heralds a new era of biological discovery with direct impact on therapeutics, diagnostics and prevention

That summer day in 2000 was a memorable one: President Bill Clinton with Prime Minister Tony Blair, then-NHGRI Director Francis Collins, and then-CEO of Celera Craig Venter announcing the draft map of the Human Genome Project.

To quote a transcript of the event:

We are here to celebrate the completion of the first survey of the entire human genome. Without a doubt, this is the most important, most wondrous map ever produced by humankind.

President Bill Clinton, 2000

The optimism of this event (along with the sequence draft analyses published by the competing NGHRI and Celera groups simultaneously in Nature and Science in 2001) is captured with a second quote from the same 2000 event:

Genome science will have a real impact on all our lives — and even more, on the lives of our children. It will revolutionize the diagnosis, prevention and treatment of most, if not all, human diseases.

ibid

Today, over 20 years later, where has this genomic revolution taken us? There are three areas of medicine clearly stand out where genomics has transformed the practice of healthcare: first, with personalized cancer therapy according to the type of mutation that individual’s tumor carries; second, in how expectant mothers of average risk now can take a blood test for fetal trisomy instead of an invasive amniocentesis (with the risk of harming the developing fetus); and lastly with rare monogenic disorders, where there are approximately 7,000 rare diseases affecting an estimated 31M individuals in the US. Of this, about 2800 genes with phenotype-causing mutations have been identified and catalogued.

This 2800 number is an estimate of one-third to one-half of all Mendelian disorders. There’s still plenty more work to do, as these additional disorders involve complex multiple-gene effects (called pleiotropy).

There is evidence that it takes some 20 to 40 years for technologies to mature; this book excerpt called “Techno-optimism and the rule-of-threes: why the world will soon enter an era of mass flourishing” points to three major spheres of technology (information, machines and materials) in the early 20th century that kicked off another 80 years of “mass flourishing”. Telephone and radio for information are obvious, but spectroscopy and X-ray crystallography were new information sources in the early part of the 20th century; machines include greater precision in mass manufacturing, also completely new methods of transport (automobiles and airplanes); lastly new methods for generating power (from hydroelectric to nuclear and now solar and wind). And in the early 20th century, a veritable explosion in the kinds of materials that were developed and put into wide use: from polymers and pharmaceuticals to high-strength concrete and new structural metal alloys. Interestingly, a car uses at least one-third of the elements of the periodic table, while computer and communications equipment uses a full two-thirds.

What kinds of information are maturing now in the realm of biological research? With the Human Genome Project, an immense cataloging of genomic variation has been underway, first with Genome-Wide Association Studies (GWAS) using microarrays (from about 2005 onward), and then Whole-Exome Sequencing and Whole-Genome Sequencing with the advent of next-generation sequencing (from about 2007 onward).

The Human Genome Project and what we have learned

Although the first Genome-Wide Association Study was conducted in 2002 in examining the genetic susceptibility for myocardial infarction (Ozaki K, Ohnishi Y, Iida A, et al. Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat Genet (2002) 32:650–654 doi:10.1038/ng1047), a landmark GWAS study for the most common form of blindness (age-related macular degeneration) was published in 2005 using Affymetrix SNP arrays. (Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymorphism in age-related macular degeneration. Science (2005) 308:385–389 doi:10.1126/science.1109557).

In 2001 the HapMap project was proposed and organized, which had a goal to reduce the approximately 10 million then-known variants into approximately 500,000 tagging SNPs. In subsequent years the HapMap database was superseded by the 1000 Genomes Project database in 2008. With the advent of ever-less-expensive whole-exome and whole-genome sequencing, in 2015 the 1000 Genomes Project Consortium published the completion of their goal: “A global reference for human genetic variation”, the genomes of 2,504 individuals from 26 populations, containing 88M variants (84.7M SNPs, 3.6M indels, and 60K structural variants), all phased into haplotypes with a minor allele frequency (MAF, its appearance in these populations) of at least 1%.

Of course, the sequencing of Whole Exomes and Whole Genomes continues – first in an exome aggregation consortium called ExAC by Daniel MacArthur at the Broad Institute in 2016 (that catalogued protein-coding variation in 60.7K individuals), which was in turn superseded again by another database by the same group at the Broad called GnomAD (Genome Aggregation Database) in 2020. At that point there were 127,748 exome sequences and 15,708 genome sequences from specific disease studies as well as population genetics studies. Initial analysis published in 2020 revealed 443,769 high-confidence pLoF (protein Loss-of-Function) variants.

These variations can be at many places along the coding sequence for a gene in addition to non-coding regions (think of promoter regions and intron splice sites). There can also be frameshift mutations, an introduction of a stop codon or removal of the canonical stop site. These coding variants can affect the level of that particular protein as it appears in circulation – as a protein that is abnormal, being too short, having a different pattern of folding, too long, or some other change affecting its structure or function.

Remember, humans are diploid

One major point in biology that is easy to overlook is that humans are diploid organisms. When we discuss human genetic variation, at a particular SNP locus the genotype could be literally a change from a C → T in a single strand of DNA (that is, a cytosine DNA base to a thymine), however the alternative allele will remain unchanged (a C base in this case). This is what is called a heterozygote SNP at that particular location, labeled generically as “AB” (A for one allele, B for the alternative allele).

In a different individual that inherited a C base in one allele, and a C base in the alternative allele, we can call that an “AA” homozygote. A third individual, inheriting a T base in one allele and a T base in the alternative allele at in that same SNP location, is called a “BB” homozygote.

Now imagine the “AA” genotype is a normal functioning protein, and the “BB” genotype is a misfolded one due to a change in the amino acid composition of the protein, with direct effects on the levels of that protein in the bloodstream of the individual. The “AA” (“wild-type”, normal function protein) person will have high levels of that protein, while the “BB” (“mutated”, abnormal functioning protein) will have low levels of that protein.

There is the case of the “AB”: a third individual, with one good copy and one mutated copy of the same gene. You would expect an intermediate level. And this is exactly what is observed experimentally.

Borrowed from Figure 1d of an excellent review paper, Suhre K, McCarthy MI, Schwenk JM. Genetics meets proteomics: perspectives for large population-based studies. Nat Rev Genet. (2021) 22(1):19-37. doi:10.1038/s41576-020-0268-2

The variant in the example above is directly in the coding region (or in close proximity to it, in the aforementioned case of splice variants). These are called cis-acting Protein Quantitative Trait loci, or cis-pQTL for short.

You can extend the example to mutations across the genome that affect other genes (a transcription factor, a companion protein required for function, etc.) however that mutation affects the circulating level of the protein of interest. That polymorphism, on a different chromosome entirely, affects the level of the protein in the same way as the figure above, it is just not directly connected – this is where new biology and discovery lies as well. Since it is far away, this SNP is then a trans-acting Protein Quantitative Trait loci, or trans-pQTL for short.

This information is of great usefulness in drug discovery. A companion paper in Nature Medicine 2020 titled “The effect of LRRK2 loss-of-function variants in humans” looks a gene whose gain-of-kinase function leads to a significant risk of developing Parkinson’s Disease, suggesting that the lowering of this kinase function would be a useful drug target to treat Parkinson’s. By analyzing these protein Loss-of-Function variants across 141.7K individuals in GnomAD, an additional 49.9K individuals from the UK Biobank, and an additional 4M genotyped individuals in the 23andMe consumer genomics database, 1,455 individuals were found to have pLoF variants in the LRRK2 gene. They confirmed that 82.5% of these individuals had lower levels of circulating LRRK2, and was not strongly correlated to any adverse disease outcome.

This is a ‘natural experiment’ – a damaged LRRK2 gene means lower levels of activity in these individuals, and thus for it is much more likely any drug discovery using therapeutics to lower the level of LRRK2 will avoid unwanted disease.

Why you should care about Protein Quantitative Trait Loci (pQTLs)

After three decades of work, first to map the human genome and then to determine 88 million variants across over a hundred and twenty thousand individuals, the scientific community is now ready to finally bring about much more of the promise of the original Human Genome Project: to “revolutionize the diagnosis, prevention and treatment of most, if not all, human diseases.” With 443K pLoF variants that code for changes in amino-acid sequence in addition to the 88M haplotype-tag variants that are characterized and catalogued, one more piece is missing: the ability to measure 1000’s of proteins or metabolites at scale across statistically-significant numbers of affected and normal individual (usually in the 100’s or 1000’s of people).

Circulating proteins and metabolites are the real-time, real-world readout of health and disease. The vast majority of drug targets are proteins, as proteins carry out all the functions of a cell. (The exceptions are specialized drugs, such as DNA analogues that get incorporated in cell division for chemotherapy, or RNAi approaches that knock-down specific messenger RNAs.)

As many of you know, I work for one of two companies (Olink Proteomics) that offer products and services to analyze thousands of protein levels from biofluids (think serum or plasma primarily, but can be cerebral-spinal fluid [CSF], cell lysates, ocular fluid etc.). A competitor to Olink Proteomics, SomaLogic Incorporated, published the first genome-wide pQTL papers using this approach in 2017, measuring 1,124 proteins across 1,000 individuals of Central European descent (of course their genome sequence was known) and replicated the results in another 338 individuals of a Arab and Asian ethnicities.

This paper (Suhre K, Arnold M, Bhagwat AM, et al. Connecting genetic risk to disease end points through the human blood plasma proteome Nat Commun. (2017) 8:14357. Published 2017 Feb 27. doi:10.1038/ncomms14357) discovered 539 associations between protein levels and gene variants (pQTLs) in the German cohort, with half of them replicating. They combined 509,946 autosomal SNPs in these samples for GWAS to 1,124 protein levels across 1,000 individuals. This is the first large-scale proteomics GWAS on blood plasma proteins in humans, and their analysis goes deep into expression QTLs (eQTLs), methylation QTLs (meQTLs), regulatory elements from ENCODE, glycosylation pathways, as well as the metabolome.

From Figure 1 of Suhre K et al. Nat Commun. (2017) 8:14357. doi:10.1038/ncomms14357

For additional visualization of this dataset, a “pGWAS Server” as been setup to visualize the data via an ‘ideogram’, ‘a list of loci’ (such as illustrated above), and via a ‘network view’ and is available online here, along with hyperlinks to the paper and a non-paywalled preprint.

We are living in the age of proteogenomics, combining the wealth of genomic variant information with levels of circulating protein levels to discover Protein Quantitative Trait Loci to not only determine risk for disease but also their cause. It was the stated goal all along, to connect genetic risk to disease end points through the plasma proteome, and now this is all within reach of the scientific community worldwide.


About Dale Yuzuki

A sales and marketing professional in the life sciences research-tools area, Dale currently is employed by Olink as their Americas Field Marketing Director. https://olink.com For additional biographical information, please see my LinkedIn profile here: http://www.linkedin.com/in/daleyuzuki and also find me on Twitter @DaleYuzuki.

Leave a comment

Your email address will not be published. Required fields are marked *