class RandomWalkEmbeddings extends Serializable with Logging with WithIntermediateStorageLevel
RandomWalkEmbeddings is a class for generating node embeddings in a graph using random walks and sequence-to-vector models. This implementation supports two types of embedding models: Word2Vec and Hash2Vec, each with different performance characteristics.
Word2Vec is based on the skip-gram model, which typically provides higher quality embeddings due to its ability to capture semantic relationships through gradient descent optimization. However, it is computationally expensive, requires more memory, and scales to approximately 20 million vertices in a graph, as it depends on transforming sequences into a vocabulary.
Hash2Vec uses random projection hashing, making it much faster and more memory-efficient, with excellent horizontal scaling properties. Its drawbacks include the need for wider embedding dimensions (typically 512 or more, depending on graph size) and generally lower quality due to its sparse nature.
Additionally, this class supports optional neighbor aggregation, where embeddings from sampled neighbors are aggregated (using average) and concatenated with the node's own embedding. This technique leverages min-hash sampling and has shown to improve predictive power by over 20% in synthetic tests. It is particularly efficient for Hash2Vec, as Word2Vec already incorporates neighborhood information through random walks and skip-gram learning.
This class provides also a way to run only embedding model (or sequnce2vec model) on top of cached RandomWalks. Users can provide a path to cached walks in parquet format.
To use this class, instantiate with a GraphFrame, set the random walk generator, choose the sequence model (Word2Vec or Hash2Vec), and optionally configure other parameters like seed, edge direction usage, neighbor aggregation, and maximum neighbors for sampling.
- Alphabetic
- By Inheritance
- RandomWalkEmbeddings
- WithIntermediateStorageLevel
- Logging
- Serializable
- AnyRef
- Any
- by any2stringadd
- by StringFormat
- by Ensuring
- by ArrowAssoc
- Hide All
- Show All
- Public
- Protected
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- def +(other: String): String
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toany2stringadd[RandomWalkEmbeddings] performed by method any2stringadd in scala.Predef.
- Definition Classes
- any2stringadd
- def ->[B](y: B): (RandomWalkEmbeddings, B)
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toArrowAssoc[RandomWalkEmbeddings] performed by method ArrowAssoc in scala.Predef.
- Definition Classes
- ArrowAssoc
- Annotations
- @inline()
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
- def ensuring(cond: (RandomWalkEmbeddings) => Boolean, msg: => Any): RandomWalkEmbeddings
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
- Definition Classes
- Ensuring
- def ensuring(cond: (RandomWalkEmbeddings) => Boolean): RandomWalkEmbeddings
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
- Definition Classes
- Ensuring
- def ensuring(cond: Boolean, msg: => Any): RandomWalkEmbeddings
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
- Definition Classes
- Ensuring
- def ensuring(cond: Boolean): RandomWalkEmbeddings
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
- Definition Classes
- Ensuring
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
- def getIntermediateStorageLevel: StorageLevel
Gets storage level for intermediate datasets that require multiple passes.
Gets storage level for intermediate datasets that require multiple passes.
- Definition Classes
- WithIntermediateStorageLevel
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
- val intermediateStorageLevel: StorageLevel
- Attributes
- protected
- Definition Classes
- WithIntermediateStorageLevel
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def logDebug(s: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(s: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(s: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarn(s: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
- def resultIsPersistent(): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def run(): DataFrame
Executes the random walk embedding generation process.
Executes the random walk embedding generation process. Requires that sequenceModel and randomWalks are set. The input GraphFrame must have valid vertex and edge DataFrames, with vertices containing an ID column.
The process generates random walks, applies the chosen sequence model to produce initial embeddings, and optionally aggregates neighbor embeddings if aggregateNeighbors is enabled.
- returns
A DataFrame containing the original vertex columns plus an additional "embedding" column (as defined by RandomWalkEmbeddings.embeddingColName) of type Vector containing the node embeddings. If aggregateNeighbors is true, the embedding will be a concatenation of the node's embedding and the averaged embeddings of sampled neighbors.
- def setAggregateNeighbors(value: Boolean): RandomWalkEmbeddings.this.type
Sets whether to aggregate neighbor embeddings via min-hash sampling, concatenating the aggregated vector with the node's own embedding.
Sets whether to aggregate neighbor embeddings via min-hash sampling, concatenating the aggregated vector with the node's own embedding. This improves predictive power (e.g., +20% in tests) and is more efficient for Hash2Vec. For Word2Vec, this adds redundant information since it already learns neighborhood relations. Default: true.
- value
Boolean flag for neighbor aggregation.
- returns
This instance for method chaining.
- def setCleanUpAfterRun(value: Boolean): RandomWalkEmbeddings.this.type
Sets whether to clean up temporary random walk files after generating embeddings.
Sets whether to clean up temporary random walk files after generating embeddings. Default: false.
- value
Boolean flag for clean-up.
- returns
This instance for method chaining.
- def setIntermediateStorageLevel(value: StorageLevel): RandomWalkEmbeddings.this.type
Sets storage level for intermediate datasets that require multiple passes (default:
).MEMORY_AND_DISKSets storage level for intermediate datasets that require multiple passes (default:
).MEMORY_AND_DISK- Definition Classes
- WithIntermediateStorageLevel
- def setMaxNbrs(value: Int): RandomWalkEmbeddings.this.type
Sets the maximum number of neighbors to sample for aggregation.
Sets the maximum number of neighbors to sample for aggregation. Used only if aggregateNeighbors is true. Default: 50.
- value
Maximum neighbors to sample.
- returns
This instance for method chaining.
- def setRandomWalks(value: RandomWalkBase): RandomWalkEmbeddings.this.type
Sets the random walk generator to use.
Sets the random walk generator to use. No default; this must be set before running.
- value
The random walk generator instance.
- returns
This instance for method chaining.
- def setSeed(value: Long): RandomWalkEmbeddings.this.type
Sets the random seed for reproducibility.
Sets the random seed for reproducibility. Default: 42L.
- value
The random seed.
- returns
This instance for method chaining.
- def setSequenceModel(value: Either[Word2Vec, Hash2Vec]): RandomWalkEmbeddings.this.type
Sets the sequence model to use for generating embeddings.
Sets the sequence model to use for generating embeddings. This can be either a Word2Vec model (Left(Word2Vec)) or a Hash2Vec model (Right(Hash2Vec)). No default; this must be set before running.
- value
The sequence model to use.
- returns
This instance for method chaining.
- def setUseEdgeDirections(value: Boolean): RandomWalkEmbeddings.this.type
Sets whether to use edge directions in random walks and neighbor aggregation.
Sets whether to use edge directions in random walks and neighbor aggregation. If true, considers directed edges; otherwise, treats the graph as undirected. Default: false.
- value
Boolean flag for using edge directions.
- returns
This instance for method chaining.
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- def useCachedRandomWalks(path: String): RandomWalkEmbeddings.this.type
Sets the path to the existing cached RandomWalks if you want to run only embeddings model and skip the sequences generation step.
Sets the path to the existing cached RandomWalks if you want to run only embeddings model and skip the sequences generation step.
- path
to walks in parquet format
- returns
This instance for method chaining.
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
Deprecated Value Members
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable]) @Deprecated
- Deprecated
(Since version 9)
- def formatted(fmtstr: String): String
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toStringFormat[RandomWalkEmbeddings] performed by method StringFormat in scala.Predef.
- Definition Classes
- StringFormat
- Annotations
- @deprecated @inline()
- Deprecated
(Since version 2.12.16) Use
formatString.format(value)instead ofvalue.formatted(formatString), or use thef""string interpolator. In Java 15 and later,formattedresolves to the new method in String which has reversed parameters.
- def →[B](y: B): (RandomWalkEmbeddings, B)
- Implicit
- This member is added by an implicit conversion from RandomWalkEmbeddings toArrowAssoc[RandomWalkEmbeddings] performed by method ArrowAssoc in scala.Predef.
- Definition Classes
- ArrowAssoc
- Annotations
- @deprecated
- Deprecated
(Since version 2.13.0) Use
->instead. If you still wish to display it as one character, consider using a font with programming ligatures such as Fira Code.