How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?
In the UW Statistics program what methods do we learn to apply dimension reduction?
An embedding is the name for the dense numeric vectors produced by embedding algorithms, such as GenSLM.
“Embedding” generally refers to the embedding vector.
Embeddings are representation of data in lower-dimensional space that preserves some aspects of the original structure of the data.
But why would you ever embed something?
Consider our colereactal cancer example, in the context of precision medicine, you may want to classify aggressive and non-aggressive cancers.
Or in the context of SARS-CoV-2, you may want to be able to monitor changes in variants and how they differ from previous variants.
Assuming embeddings in both examples, the tasks are considered downstream from the process of creating the embeddings from the original data.
NA's
of the genomic world) and large gap sizes post-alignment.Broadly speaking, the quality of an embedding is assessed based upon its information richness and the degree of non-redundancy.
Their are two main methodologies for evaluating embeddings:
How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?
## Pre-Ran Code ##
library(word2vec)
# using British National Corpus http://vectors.nlpl.eu/repository/
model <- read.word2vec("British_National_Corpus_Vector_size_300_skipgram", normalize = TRUE)
embedding <- predict(model,newdata = c("king_NOUN","man_NOUN","woman_NOUN","queen_NOUN"),type="embedding")