This page describes the process of annotating genomic data for clinical interpretation.
Genome Annotation for Clinical interpretation.
The human genome (i.e. the set of all genetic material in an individual) contains over 3 billion nucleotides, any of which can be mutated in a particular individual. In fact, every person’s genome contains many thousands of mutations, most of which have no impact on the health of the individual. However, do, and identifying impactful mutations consumes nearly all of the time and energy of modern geneticists.
To detect mutations that impact human health, genome scientists created a valuable tool called the human reference genome. This genome is not that of any person, rather, it is assembled by comparing the genomes of many individuals and recording, for each location in the genome, what nucleotide occurs there most frequently. By comparing the genome of a specific individual to this reference, scientists and clinicians can easily detect nucleotides that are mutated.
These mutations are called ‘genomic variants’ by geneticists. Genomic variants can come in a variety of forms. The easiest to detect are called single nucleotide polymorphisms (SNPs), which change a single nucleotide (for example adenine) to another (e.g. cytosine). Mutations can also take the form of insertions of sequence, deletions, and even large structural damage that cause whole arms of chromosomes to be rearranged in complex ways.
Once variants have been detected, the next step towards interpreting their impact on human health is to annotate them. Annotation is a big topic, but in short we use that term to mean “match a variant with the information that is currently known about it”.
One of the first and most important goals of annotation is to assign a population frequency to each genomic variant. The population frequency of a variant tells you in what fraction of a particular population (e.g. individuals of European, African or Asian ancestry) a variant occurs. Typically, if a variant is seen in a large number of individuals, it is not causative of rare diseases (although there are exceptions). Variant frequencies are found simply by sequencing large numbers of people (which we have done already as part of ongoing scientific research studies), and tracking in how many individuals a given variant occurs. Then, when a variant that has previously been seen in other samples appears in the sample you are analyzing, you can assign it the known population frequency.
Many large studies have contributed data over the years to databases that record variant frequencies. Currently, the most comprehensive such sources are made available through the Exome Aggregation Commission (ExAC) and the Genome Aggregation Database (gnomAD). gnomAD focuses on whole-genome data rather than exome (the details of those differences are beyond the scope of this article) but can generally be considered the most comprehensive data source going forward.
In general, genomic variants that appear in more than 5% of individuals can be removed from consideration when attempting to interpret a genome. There are several exceptions to this rule (for example the well-known hemochromatosis allele, c.845G>A) and these special cases need to be handled carefully to avoid missing important findings. This cutoff can be considerably more strict in many situations, and ClinGen, an large group of scientists working to create an open, centralized genomic interpretation resource, will issue gene-specific variant frequency cutoff thresholds in coming years.
Once a variant frequency is assigned, the next step is to interpret the consequence of the variant on a protein product. Most genes have an effect on an organism after being translated into a protein in a cell. These proteins carry out a wide range of functions for the cell, ranging from providing structural support to catalyzing enzymatic reactions. Variants that disrupt a protein’s ability to carry out its function are more likely to have an impact on health. There are a number of tools (VEP, annovar, SNPeff) that will predict what the effect on the protein of a variant will be, and they can disagree. Even more importantly, scientists do not completely agree on precisely where within the human genome genes are located, which has given rise to several different ‘gene models’, or representations of the gene’s genomic coordinates (e.g. Ensembl, RefSeq, UCSC). This is further complicated by the fact that a gene can have multiple ‘transcripts’, or versions of messenger RNA that are created from the same DNA. Because of these factors, genome annotation typically is run using multiple gene models, with tools for resolving differences.
Of special interest to clinical geneticists are any genomic variants that cause ‘loss of function’ mutations, or mutations that are likely to completely destroy the resulting protein product. Two important classes of loss of function variants are stopgain and frameshift variants. In a stopgain variant, the variant creates a new ‘codon’ that tells the ribosome, which is eventually responsible for creating a protein from messenger RNA, to stop creating that protein prematurely. Then, through a process called nonsense mediated decay, the incomplete protein product is destroyed, causing that protein to be partially or completely absent from the cell, with a variety of ensuing, but often dire, consequences. A frameshift variant, on the other hand, causes the translation mechanism to create a completely different protein sequence from what was encoded in the DNA. This can have a variety of effects, including creating a premature stop codon and triggering nonsense mediated decay, but the resulting protein product can also be larger than coded for. In either case, the protein product is generally non-functional.
Not all important variants occur in locations that impact proteins. Two important classes of genetic variation that occur outside of protein-coding regions are ‘Transcription factor binding site’ variants and ‘splicing’ variants. Both of these types of variants impact the process by which DNA is transcribed into messenger RNA (an intermediate step in the creation of a protein). In an extraordinarily complex set of processes, a cell ‘expresses’ different sets of genes based on environmental and internal signals. To express a gene, ‘transcription factors’ must first bind to DNA at special locations, called ‘transcription factor binding sites’. The transcription factor attracts a large protein complex called the RNA polymerase, which ‘reads’ DNA and produces a matching RNA ‘transcript’. This transcript RNA must then be ‘spliced’, that is, edited by a massive enzyme complex called the spliceosome before it can be translated into a protein. The spliceosome removes stretches of RNA that are not translated, called introns, to create the final messenger RNA. Variants that disrupt either the splice site or the transcription factor binding site can alter the amount of correctly produced protein that is available to a cell.
After the effect on the protein product is assessed, many variants can be filtered using knowledge about which types of variants cause disease under which circumstances. For example, variants that have little to no impact on the protein product rarely impact human health.
Next, other information can be assigned to variants to help the interpretation process. There may be functional studies of this variant in scientific literature that further describe the effect the variant has on the protein, cell or organism. There may be studies that show how a particular variant was inherited within a family of individuals impacted by a particular disease of interest. Some variants appear in disease databases as having been associated with individuals with a given disease previously, and many variants have been previously interpreted and submitted to a central database called ClinVar. There are a large number of computational algorithms that attempt to assess the deleteriousness of a variant, although the accuracy of these is not currently high enough for them to be used in a clinical context without being accompanied by a wide variety of supporting information. Finally, there may be much known about the gene itself, which can be helpful in interpreting the variant.
The American College of Medical Genetics and Genomics has issued a growing set of guidelines for interpreting genomes, and all of the evidence above can be combined in a framework for assigning variant pathogenicity. Typically it takes multiple (3 or more) pieces of supporting evidence that point towards pathogenicity, as well as the absence of evidence that a variant is benign, for it to be considered for inclusion on a clinical report. Variants that are on the borderline but lack definitive evidence, or for which there are conflicting pieces of evidence are assigned the category ‘variant of unknown significance’ (VUS) and will need to be re-interpreted as more data is gathered. VUS are a large and growing problem for the field of clinical genomics.
Codified’s system takes in raw genomic information an automatically analyzes it and assigns a wide range of annotations. Once loaded, these annotations are presented in a compact format that makes all of the relevant data quickly available to clinical geneticists who are performing the interpretation. In addition, Codified provides a number of automated tools that help prioritize variants for review and assign pathogenicity.
Clinical genetics is a field that moves extremely rapidly, which means that once assigned, annotations become outdated quickly. Because of this, old cases must be frequently “re-analyzed”. Codified Genomics handles this by providing a means for automatically updating variant annotations (while retaining the previous data in case it needs to be revisited). New annotations can be reviewed in a dedicated interface that is designed to prioritize data that is likely to have a large change on the interpretation of an existing variant.
For any questions or comments, email firstname.lastname@example.org.