An Exploration of Information Loss in Transformer Embedding Spaces for Enhancing Predictive AI in Genomics

Daniel Hintz
2024-06-06

Thesis Question

How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?

Outline

Thesis Question
Background
- DNA
- Motivation
- GenSLM Embedding Algorithm
Data and Processing
- Source and Cleaning
Methodology
- Intrinsic and Extrinsic Evaluation
Results
Conclusion

Background

DNA

Deoxyribonucleic acid (DNA) is a molecule that contains the genetic code that provides the instructions for building and maintaining life.
The structure of DNA can be thought of as rungs on a ladder (known as base pairs) involving the pairing of four nucleotides - Adenine (A), Cytosine (C),Guanine (G) and Thymine (T).

DNA Sequencing Technology

Someone gets tested for Covid, PCA is ran to detect if the assay is indeed positive.
Positive tests are sequenced using a machine that is most likely either Ilumina or Oxford Nanopore.
This takes in the Covid Sample and arrives at a digital copy of DNA sequences.

Figure 2: Simplified Schema of Sequencing Process for SARS-CoV-2

SARS-CoV-2 and Proteins

SARS-CoV-2 has 29 proteins.
Different proteins have different functions.
Proteins are encoded from different sites of DNA.
In studying embeddings, tasks have been either whole genome oriented or protein specific.

Figure 3: SARS-CoV-2 Proteins; Adapted from [2], pg. 99

Mutations

A mutation in the context of DNA refers to a change in the nucleotide sequence of the genetic material of an organism.
Mutations can be random or induced by external factors.
For contect, SARS-CoV-2 can mutate to become more infective for increasing its chances of survival.
There many different types of mutations but only substitutions are illustrated for this talk.

Spontaneous mutations: These arise naturally and randomly without any external influence, often due to errors in DNA replication. Errors might include mispairing of nucleotides or slippage during replication.
Induced mutations: These are caused by external factors, known as mutagens, which can include physical agents like ultraviolet light and X-rays, or chemical agents such as certain drugs and pollutants.

Why Mutations Occur

Random errors: Many mutations result simply from random errors during DNA replication. DNA polymerase, the enzyme responsible for replicating DNA, is not perfect and occasionally makes mistakes.

Environmental stress: Exposure to environmental stressors like radiation, chemicals, and even biological agents can damage DNA and lead to mutations.

Evolutionary advantage: From an evolutionary perspective, mutations are a key mechanism of genetic variability and thus, evolution. While many mutations are neutral or harmful, some can confer advantages that lead to positive selection in a population.

Motivation

This project came about from connecting with Nick Chia, a biophysicist at Argonne National labs.
Nick and his collaborators in 2020 (then at the Mayo Clinic) made an algorithm that predicted the trajectory of colorectal cancer mutations.
However, this algorithm was slow and expensive to train.
The embeddings used in this algorithm was one of the biggest choke points.
New and improved ways are needed to embed large genomes into lower dimensions
- For example, a one hot encoding for the human genome as DNA dimensionally is n x 3 billion x 4, where n is the number of patients, 3 billion is the number of rows for each nucleotide and 4 columns represent the four possible bases (A, T, G, C).
This is the Big P little N problem; hence dimension reduction via an embedding is extremely advantageous!
Further work needs to be done is studying the quality of embeddings to help bring about more efficient real world implementations like precision medicine for colorectal cancer.

Dimension Reduction

In the UW Statistics program what methods do we learn to apply dimension reduction?

Principal component analysis (PCA)!
PCA as well as embeddings, transform the coordinate system of our data.
The difference for this project is that dimension reduction is being applied to DNA.

Embeddings

An embedding is the name for the dense numeric vectors produced by embedding algorithms, such as GenSLM.
“Embedding” generally refers to the embedding vector.
Embeddings are representation of data in lower-dimensional space that preserves some aspects of the original structure of the data.
But why would you ever embed something?
- Embedding vectors are generally more computationally efficient.

Sequence Embeddings

Neural network embedding algorithms for genomic data include GenSLM (the focus of this presentation), DNABERT2, and HyenaDNA.
These embeddings are lossy encodings, meaning that some information is lost in the transformation.
They also can distort structural relationships represented in the data.

Overall, the GenSLM algorithm first tokenizes, i.e., breaks up sequences in to chunks of three nucleotides.
Then, the input sequence are vectorized creating a \(1 \times 512\) vector for each inputed sequence.

First the tokenized input sequence is passed to the transformer encoder.
The transformer encoder converts 1,024 bp slices into numeric vectors.
This is ran recursively through a diffusion model to learn a condensed distribution of the whole sequence.
The transformers in GenSLM are used to capture local interactions within a genomic sequence.

Figure 6: GenSLM Transformer Architecture; [3]

Diffusion and Transformer models work in tandem in GenSLM.
Transformers captures local interactions, while the diffusion model captures learns a representation of transformers vectors to represent the whole genome.

Downstream

Consider our colereactal cancer example, in the context of precision medicine, you may want to classify aggressive and non-aggressive cancers.
Or in the context of SARS-CoV-2, you may want to be able to monitor changes in variants and how they differ from previous variants.
Assuming embeddings in both examples, the tasks are considered downstream from the process of creating the embeddings from the original data.

Raw DNA Versus Embedded Data

Project Workflow

Data and Processing

Data Description

All Covid sequences were downloaded with permission from the Global Initiative on Sharing All Influenza Data (GISAID).
No geographical or temporal restrictions were placed on the sequences extracted.

Data Cleaning

GISAID’s exclusion criteria was used to remove sequences of poor quality.
Additional exclusion criteria was implemented to remove all sequence with ambiguous nucleotides (the NA's of the genomic world) and large gap sizes post-alignment.

Multiple Sequence Alignment

An multiple alignment alignment was performed to be able to slice the exact location of proteins across different sequences.

Exploratory Data Analysis

Figure 11: Hamming Distance of Aligned Proteins to Wuhan Reference Sequence

Methodology

Evaluating Embeddings

Broadly speaking, the quality of an embedding is assessed based upon its information richness and the degree of non-redundancy.
Their are two main methodologies for evaluating embeddings:
- When downstream learning tasks are evaluated, it’s referred to as extrinsic evaluation [4].
- When the qualities of the embedding matrix itself are assessed it is referred to as intrinsic evaluation [4].

Intrinsic Evaluation

There are three main qualities to asses when studying the the quality of an embedding: Redundancy, Separability, Preservation of Semantic Distance
- Redundancy reveals the efficiency of a embeddings encoding; more efficient encodings tend to perform better and use fewer computational resources.
- Separability gives practical insight into whether or not the embedding output can be separated into meaningful genomic groups (i.e., variants).
- Preservation of semantic distance is important for determining if the structural representation of information remains valid for subsequent tasks.
Sub-methods used for intrinsic evaluation include Singular Value Decomposition (SVD), Distance matrices, Radial Dendograms, and Principal Copnonent Analysis.

Extrinsic Evaluation

Extrinsic evaluation pursues practical benchmarks to assess the performance of an embeddings algorithm per its associated embedding matrix across varying levels of task complexity.
For GenSLM, the subset of possible tasks chosen were variant and protein classification.
A Classification and Regression Tree (CART) was used for performing the classification.

Results

Intrinsic Evaluation Results

Redundancy

Singular Value Decomposition

Preservation of Semantic Distance

Distance Matrices

Sequence
Embeddings
Abs. Diff.
Reg. Diff.

There is a lot of heterogeneity in the differences among sequences as fine scales.
Notice the L shape in striations in alpha sequences compare to other variants.

Comparing the Sequence distance matrix to the Embedding distance matrix we see that finer differences amongst sequences are lost in the embedding transformation.

The absolute difference matrix is the absolute difference of the sequence distance matrix and the embedding distance matrix.

The difference distance matrix is the sequence distance matrix minus the embedding distance matrix.

Radial Dendrograms (1)

Radial dendrograms show who easily sequences cluster by their variant.
Figures 12 and 13 in the next two slides show:
- Radial dendrograms using sequence data does not perfectly clusters sequences by variants.
- Radial dendrograms using the embedded data perfectly clusters sequences by variants.

Radial Dendrograms (2)

Separability

Principal Component Analysis

Extrinsic Evaluation Results

Variant Classification

Variant Embedding Classification (1)

Variant classification with a CART learner had a 97.71% on aligned sequence embeddings.

Variant Embedding Classification (2)

Imagine if we could classify aggressive and non-aggressive cancer with this kind of accuracy.

Variant One-Hot-Encoding Classification

Variant classification with a CART learner had a 10.28% on One-Hot-Encoded sequences.

Thesis Question

How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?

Conclusion

The study assessed GenSLM’s embeddings for quality using intrinsic and extrinsic evaluations, focusing on redundancy, separability, and information preservation.
While Covid DNA can be highly related with a little as 0.067% differentiating some variates, the GenSLM embedding did indctaed strong performance.
- GenSLM demonstrated strong performance in classification tasks, with a 97.71% success rate for variant classification and 100% for proteins using the CART learner, significantly outperforming One-Hot-Encoded matrices.
- GenSLM aslo excelled in separability.
However , GenSLM also displayed suboptimal features
- GenSLM was not effective in preserving fine-scale genetic differences and distorted the genetic distances between certain viral variants.
Overall, GenSLM has room to improve considering its dimensional redundancy and only moderate success in preserving semantic distances.
Future research should compare GenSLM with other neural network embedding algorithms and explore its utility in more diverse genomic analysis tasks to establish broader applicability.

Acknowledgements

A big thank you to my Committee Tim Robinson, Shaun Wulff, Sasha Skiba and Nick Chia.
An Extra special thank you Nick for sticking by me!
Robert Petit for Introducing me to Bioinformatics!
Liudmila Mainzer for all your support, training and guidance!
My Cohort: Allie Midkiff, Oisín O’Gailin, Austin Watson, Sandra Biller, Austin Watson, Daiven Francis, and Joe Crane.
My partner Hana for all your support and patience!
And another big thank you to Tim!

Thank you!!

Appendix

Word Embeddings

Before we introduce GenSLM, lets look at a simpler application in natural language.
- If we consider English text, How can we measure the similarity between words?
- For example, what is the semantic difference between the words “King” and “Woman”.
- Let’s demonstrate the semantic relationships of “King”, “Queen”, “Man”, “Woman”.

Word2vec Example

## Pre-Ran Code ##
library(word2vec)
# using British National Corpus http://vectors.nlpl.eu/repository/
model <- read.word2vec("British_National_Corpus_Vector_size_300_skipgram", normalize = TRUE)
embedding <- predict(model,newdata = c("king_NOUN","man_NOUN","woman_NOUN","queen_NOUN"),type="embedding")

Plotting Word Embeddings

We can Plot the vectors shown in the previous slide to demonstrate how word2vec produces embeddings that measures semantic similarity between words.

Kullback–Leibler Divergence (KLD)

References

National Human Genome Research Institute (2024) Base pair

Kandwal S, Fayne D (2023) Genetic conservation across SARS-CoV-2 non-structural proteins–insights into possible targets for treatment of future viral outbreaks. Virology

Zvyagin M, Brace A, Hippe K, et al (2023) GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. The International Journal of High Performance Computing Applications 37(6):683–705

Lavrač N, Podpečan V, Robnik-Šikonja M (2021) Representation learning: Propositionalization and embeddings. Springer

An Exploration of Information Loss in Transformer Embedding Spaces for Enhancing Predictive AI in Genomics

Daniel Hintz 2024-06-06

Thesis Question

Outline

Background

DNA

DNA Sequencing Technology

SARS-CoV-2 and Proteins

Mutations

Motivation

Dimension Reduction

Embeddings

Sequence Embeddings

GenSLM

Downstream

Downstream

Raw DNA Versus Embedded Data

Project Workflow

Data and Processing

Data Description

Data Cleaning

Multiple Sequence Alignment

Exploratory Data Analysis

Methodology

Evaluating Embeddings

Intrinsic Evaluation

Extrinsic Evaluation

Results

Intrinsic Evaluation Results

Redundancy

Singular Value Decomposition

Preservation of Semantic Distance

Distance Matrices

Radial Dendrograms (1)

Radial Dendrograms (2)

Separability

Principal Component Analysis

Extrinsic Evaluation Results

Variant Classification

Variant Embedding Classification (1)

Variant Embedding Classification (2)

Variant One-Hot-Encoding Classification

Thesis Question

Conclusion

Acknowledgements

Thank you!!

Appendix

Word Embeddings

Word2vec Example

Plotting Word Embeddings

Kullback–Leibler Divergence (KLD)

References

Daniel Hintz
2024-06-06