An Exploration of Information Loss in Transformer Embedding Spaces for Enhancing Predictive AI in Genomics

Daniel Hintz
2024-06-06

Thesis Question

How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?

Outline

  • Thesis Question
  • Background
    • DNA
    • Motivation
    • GenSLM Embedding Algorithm
  • Data and Processing
    • Source and Cleaning
  • Methodology
    • Intrinsic and Extrinsic Evaluation
  • Results
  • Conclusion

Background

DNA

  • Deoxyribonucleic acid (DNA) is a molecule that contains the genetic code that provides the instructions for building and maintaining life.
  • The structure of DNA can be thought of as rungs on a ladder (known as base pairs) involving the pairing of four nucleotides - Adenine (A), Cytosine (C),Guanine (G) and Thymine (T).

Figure 1: DNA Base Pairs; [1]

DNA Sequencing Technology

  • Someone gets tested for Covid, PCA is ran to detect if the assay is indeed positive.
  • Positive tests are sequenced using a machine that is most likely either Ilumina or Oxford Nanopore.
  • This takes in the Covid Sample and arrives at a digital copy of DNA sequences.

Figure 2: Simplified Schema of Sequencing Process for SARS-CoV-2

SARS-CoV-2 and Proteins

  • SARS-CoV-2 has 29 proteins.
  • Different proteins have different functions.
  • Proteins are encoded from different sites of DNA.
  • In studying embeddings, tasks have been either whole genome oriented or protein specific.

Figure 3: SARS-CoV-2 Proteins; Adapted from [2], pg. 99

Mutations

  • A mutation in the context of DNA refers to a change in the nucleotide sequence of the genetic material of an organism.
  • Mutations can be random or induced by external factors.
  • For contect, SARS-CoV-2 can mutate to become more infective for increasing its chances of survival.
  • There many different types of mutations but only substitutions are illustrated for this talk.

Figure 4: Substitution Mutations

Motivation

  • This project came about from connecting with Nick Chia, a biophysicist at Argonne National labs.
  • Nick and his collaborators in 2020 (then at the Mayo Clinic) made an algorithm that predicted the trajectory of colorectal cancer mutations.
  • However, this algorithm was slow and expensive to train.
  • The embeddings used in this algorithm was one of the biggest choke points.
  • New and improved ways are needed to embed large genomes into lower dimensions
    • For example, a one hot encoding for the human genome as DNA dimensionally is n x 3 billion x 4, where n is the number of patients, 3 billion is the number of rows for each nucleotide and 4 columns represent the four possible bases (A, T, G, C).
  • This is the Big P little N problem; hence dimension reduction via an embedding is extremely advantageous!
  • Further work needs to be done is studying the quality of embeddings to help bring about more efficient real world implementations like precision medicine for colorectal cancer.

Dimension Reduction

In the UW Statistics program what methods do we learn to apply dimension reduction?

  • Principal component analysis (PCA)!
  • PCA as well as embeddings, transform the coordinate system of our data.
  • The difference for this project is that dimension reduction is being applied to DNA.

Embeddings

  • An embedding is the name for the dense numeric vectors produced by embedding algorithms, such as GenSLM.

  • “Embedding” generally refers to the embedding vector.

  • Embeddings are representation of data in lower-dimensional space that preserves some aspects of the original structure of the data.

  • But why would you ever embed something?

    • Embedding vectors are generally more computationally efficient.

Sequence Embeddings

  • Neural network embedding algorithms for genomic data include GenSLM (the focus of this presentation), DNABERT2, and HyenaDNA.
  • These embeddings are lossy encodings, meaning that some information is lost in the transformation.
  • They also can distort structural relationships represented in the data.

GenSLM

  • Overall, the GenSLM algorithm first tokenizes, i.e., breaks up sequences in to chunks of three nucleotides.
  • Then, the input sequence are vectorized creating a \(1 \times 512\) vector for each inputed sequence.

Figure 5: Generic Embedding Workflow
  • First the tokenized input sequence is passed to the transformer encoder.
  • The transformer encoder converts 1,024 bp slices into numeric vectors.
  • This is ran recursively through a diffusion model to learn a condensed distribution of the whole sequence.
  • The transformers in GenSLM are used to capture local interactions within a genomic sequence.

Figure 6: GenSLM Transformer Architecture; [3]
  • Diffusion and Transformer models work in tandem in GenSLM.
  • Transformers captures local interactions, while the diffusion model captures learns a representation of transformers vectors to represent the whole genome.

Downstream

Downstream

  • Consider our colereactal cancer example, in the context of precision medicine, you may want to classify aggressive and non-aggressive cancers.

  • Or in the context of SARS-CoV-2, you may want to be able to monitor changes in variants and how they differ from previous variants.

  • Assuming embeddings in both examples, the tasks are considered downstream from the process of creating the embeddings from the original data.

Raw DNA Versus Embedded Data

Project Workflow

Figure 7: Project Workflow

Data and Processing

Data Description

  • All Covid sequences were downloaded with permission from the Global Initiative on Sharing All Influenza Data (GISAID).
  • No geographical or temporal restrictions were placed on the sequences extracted.

Data Cleaning

  • GISAID’s exclusion criteria was used to remove sequences of poor quality.
  • Additional exclusion criteria was implemented to remove all sequence with ambiguous nucleotides (the NA's of the genomic world) and large gap sizes post-alignment.

Multiple Sequence Alignment

  • An multiple alignment alignment was performed to be able to slice the exact location of proteins across different sequences.

Figure 18: Un-aligned Sequences

Figure 19: Aligned Sequences

Exploratory Data Analysis

Figure 11: Hamming Distance of Aligned Proteins to Wuhan Reference Sequence

Methodology

Evaluating Embeddings

  • Broadly speaking, the quality of an embedding is assessed based upon its information richness and the degree of non-redundancy.

  • Their are two main methodologies for evaluating embeddings:

    • When downstream learning tasks are evaluated, it’s referred to as extrinsic evaluation [4].
    • When the qualities of the embedding matrix itself are assessed it is referred to as intrinsic evaluation [4].

Intrinsic Evaluation

  • There are three main qualities to asses when studying the the quality of an embedding: Redundancy, Separability, Preservation of Semantic Distance
    • Redundancy reveals the efficiency of a embeddings encoding; more efficient encodings tend to perform better and use fewer computational resources.
    • Separability gives practical insight into whether or not the embedding output can be separated into meaningful genomic groups (i.e., variants).
    • Preservation of semantic distance is important for determining if the structural representation of information remains valid for subsequent tasks.
  • Sub-methods used for intrinsic evaluation include Singular Value Decomposition (SVD), Distance matrices, Radial Dendograms, and Principal Copnonent Analysis.

Extrinsic Evaluation

  • Extrinsic evaluation pursues practical benchmarks to assess the performance of an embeddings algorithm per its associated embedding matrix across varying levels of task complexity.
  • For GenSLM, the subset of possible tasks chosen were variant and protein classification.
  • A Classification and Regression Tree (CART) was used for performing the classification.

Results

Intrinsic Evaluation Results

Redundancy

Singular Value Decomposition

Figure 13: SVD CPE Plot

Preservation of Semantic Distance

Distance Matrices

  • There is a lot of heterogeneity in the differences among sequences as fine scales.
  • Notice the L shape in striations in alpha sequences compare to other variants.
  • Comparing the Sequence distance matrix to the Embedding distance matrix we see that finer differences amongst sequences are lost in the embedding transformation.
  • The absolute difference matrix is the absolute difference of the sequence distance matrix and the embedding distance matrix.
  • The difference distance matrix is the sequence distance matrix minus the embedding distance matrix.

Radial Dendrograms (1)

  • Radial dendrograms show who easily sequences cluster by their variant.
  • Figures 12 and 13 in the next two slides show:
    • Radial dendrograms using sequence data does not perfectly clusters sequences by variants.
    • Radial dendrograms using the embedded data perfectly clusters sequences by variants.

Radial Dendrograms (2)

Separability

Principal Component Analysis

Extrinsic Evaluation Results

Variant Classification

Variant Embedding Classification (1)

  • Variant classification with a CART learner had a 97.71% on aligned sequence embeddings.

Figure 17: Variant Missclassifcations

Variant Embedding Classification (2)

Figure 17: Variant Missclassifcations
  • Imagine if we could classify aggressive and non-aggressive cancer with this kind of accuracy.

Variant One-Hot-Encoding Classification

  • Variant classification with a CART learner had a 10.28% on One-Hot-Encoded sequences.

Thesis Question

How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?

Conclusion

  • The study assessed GenSLM’s embeddings for quality using intrinsic and extrinsic evaluations, focusing on redundancy, separability, and information preservation.
  • While Covid DNA can be highly related with a little as 0.067% differentiating some variates, the GenSLM embedding did indctaed strong performance.
    • GenSLM demonstrated strong performance in classification tasks, with a 97.71% success rate for variant classification and 100% for proteins using the CART learner, significantly outperforming One-Hot-Encoded matrices.
    • GenSLM aslo excelled in separability.
  • However , GenSLM also displayed suboptimal features
    • GenSLM was not effective in preserving fine-scale genetic differences and distorted the genetic distances between certain viral variants.
  • Overall, GenSLM has room to improve considering its dimensional redundancy and only moderate success in preserving semantic distances.
  • Future research should compare GenSLM with other neural network embedding algorithms and explore its utility in more diverse genomic analysis tasks to establish broader applicability.

Acknowledgements

  • A big thank you to my Committee Tim Robinson, Shaun Wulff, Sasha Skiba and Nick Chia.
  • An Extra special thank you Nick for sticking by me!
  • Robert Petit for Introducing me to Bioinformatics!
  • Liudmila Mainzer for all your support, training and guidance!
  • My Cohort: Allie Midkiff, OisĂ­n O’Gailin, Austin Watson, Sandra Biller, Austin Watson, Daiven Francis, and Joe Crane.
  • My partner Hana for all your support and patience!
  • And another big thank you to Tim!

Thank you!!

Appendix

Word Embeddings

  • Before we introduce GenSLM, lets look at a simpler application in natural language.
    • If we consider English text, How can we measure the similarity between words?
    • For example, what is the semantic difference between the words “King” and “Woman”.
    • Let’s demonstrate the semantic relationships of “King”, “Queen”, “Man”, “Woman”.

Word2vec Example

## Pre-Ran Code ##
library(word2vec)
# using British National Corpus http://vectors.nlpl.eu/repository/
model <- read.word2vec("British_National_Corpus_Vector_size_300_skipgram", normalize = TRUE)
embedding <- predict(model,newdata = c("king_NOUN","man_NOUN","woman_NOUN","queen_NOUN"),type="embedding")

Plotting Word Embeddings

  • We can Plot the vectors shown in the previous slide to demonstrate how word2vec produces embeddings that measures semantic similarity between words.

Figure 4: Word Embedding Vectors

Kullback–Leibler Divergence (KLD)

Figure 15: CKL vs DKL

References

1.
National Human Genome Research Institute (2024) Base pair
2.
Kandwal S, Fayne D (2023) Genetic conservation across SARS-CoV-2 non-structural proteins–insights into possible targets for treatment of future viral outbreaks. Virology
3.
Zvyagin M, Brace A, Hippe K, et al (2023) GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. The International Journal of High Performance Computing Applications 37(6):683–705
4.
LavraÄŤ N, PodpeÄŤan V, Robnik-Ĺ ikonja M (2021) Representation learning: Propositionalization and embeddings. Springer