embeddings

package embeddings

Ordering

Alphabetic

Visibility

Public
Protected

Type Members

class Hash2Vec extends Serializable
Implementation of Hash2Vec, an efficient word embedding technique using feature hashing.
Implementation of Hash2Vec, an efficient word embedding technique using feature hashing. Based on: Argerich, Luis, Joaquín Torré Zaffaroni, and Matías J. Cano. "Hash2vec, feature hashing for word embeddings." arXiv preprint arXiv:1608.08940 (2016).
Produces embeddings for elements in sequences using a hash-based approach to avoid storing a vocabulary. Uses MurmurHash3 for hashing elements to embedding indices and signs.
Output DataFrame has columns "id" (element identifier, same type as sequence elements) and "vector" (dense vector of doubles, summed across all occurrences).
Tradeoffs: Higher numPartitions reduces local state and memory per partition but increases aggregation and merging overhead across partitions. Larger embeddingsDim provides richer representations but consumes more memory. Seeds control hashing for reproducibility.
class RandomWalkEmbeddings extends Serializable with Logging with WithIntermediateStorageLevel
RandomWalkEmbeddings is a class for generating node embeddings in a graph using random walks and sequence-to-vector models.
RandomWalkEmbeddings is a class for generating node embeddings in a graph using random walks and sequence-to-vector models. This implementation supports two types of embedding models: Word2Vec and Hash2Vec, each with different performance characteristics.
Word2Vec is based on the skip-gram model, which typically provides higher quality embeddings due to its ability to capture semantic relationships through gradient descent optimization. However, it is computationally expensive, requires more memory, and scales to approximately 20 million vertices in a graph, as it depends on transforming sequences into a vocabulary.
Hash2Vec uses random projection hashing, making it much faster and more memory-efficient, with excellent horizontal scaling properties. Its drawbacks include the need for wider embedding dimensions (typically 512 or more, depending on graph size) and generally lower quality due to its sparse nature.
Additionally, this class supports optional neighbor aggregation, where embeddings from sampled neighbors are aggregated (using average) and concatenated with the node's own embedding. This technique leverages min-hash sampling and has shown to improve predictive power by over 20% in synthetic tests. It is particularly efficient for Hash2Vec, as Word2Vec already incorporates neighborhood information through random walks and skip-gram learning.
This class provides also a way to run only embedding model (or sequnce2vec model) on top of cached RandomWalks. Users can provide a path to cached walks in parquet format.
To use this class, instantiate with a GraphFrame, set the random walk generator, choose the sequence model (Word2Vec or Hash2Vec), and optionally configure other parameters like seed, edge direction usage, neighbor aggregation, and maximum neighbors for sampling.

Value Members

object Hash2Vec extends Serializable
object RandomWalkEmbeddings extends Serializable
Companion object for RandomWalkEmbeddings.

Packages

embeddings

package embeddings

Type Members

Value Members

Ungrouped

Packages

embeddings

package embeddings

Type Members

Value Members

Ungrouped

embeddings