Packages

class RandomWalkEmbeddings extends Serializable with Logging with WithIntermediateStorageLevel

RandomWalkEmbeddings is a class for generating node embeddings in a graph using random walks and sequence-to-vector models. This implementation supports two types of embedding models: Word2Vec and Hash2Vec, each with different performance characteristics.

Word2Vec is based on the skip-gram model, which typically provides higher quality embeddings due to its ability to capture semantic relationships through gradient descent optimization. However, it is computationally expensive, requires more memory, and scales to approximately 20 million vertices in a graph, as it depends on transforming sequences into a vocabulary.

Hash2Vec uses random projection hashing, making it much faster and more memory-efficient, with excellent horizontal scaling properties. Its drawbacks include the need for wider embedding dimensions (typically 512 or more, depending on graph size) and generally lower quality due to its sparse nature.

Additionally, this class supports optional neighbor aggregation, where embeddings from sampled neighbors are aggregated (using average) and concatenated with the node's own embedding. This technique leverages min-hash sampling and has shown to improve predictive power by over 20% in synthetic tests. It is particularly efficient for Hash2Vec, as Word2Vec already incorporates neighborhood information through random walks and skip-gram learning.

This class provides also a way to run only embedding model (or sequnce2vec model) on top of cached RandomWalks. Users can provide a path to cached walks in parquet format.

To use this class, instantiate with a GraphFrame, set the random walk generator, choose the sequence model (Word2Vec or Hash2Vec), and optionally configure other parameters like seed, edge direction usage, neighbor aggregation, and maximum neighbors for sampling.

Linear Supertypes
WithIntermediateStorageLevel, Logging, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. RandomWalkEmbeddings
  2. WithIntermediateStorageLevel
  3. Logging
  4. Serializable
  5. AnyRef
  6. Any
Implicitly
  1. by any2stringadd
  2. by StringFormat
  3. by Ensuring
  4. by ArrowAssoc
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. def +(other: String): String
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toany2stringadd[RandomWalkEmbeddings] performed by method any2stringadd in scala.Predef.
    Definition Classes
    any2stringadd
  4. def ->[B](y: B): (RandomWalkEmbeddings, B)
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toArrowAssoc[RandomWalkEmbeddings] performed by method ArrowAssoc in scala.Predef.
    Definition Classes
    ArrowAssoc
    Annotations
    @inline()
  5. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  6. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  7. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
  8. def ensuring(cond: (RandomWalkEmbeddings) => Boolean, msg: => Any): RandomWalkEmbeddings
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  9. def ensuring(cond: (RandomWalkEmbeddings) => Boolean): RandomWalkEmbeddings
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  10. def ensuring(cond: Boolean, msg: => Any): RandomWalkEmbeddings
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  11. def ensuring(cond: Boolean): RandomWalkEmbeddings
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toEnsuring[RandomWalkEmbeddings] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  12. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  13. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  14. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  15. def getIntermediateStorageLevel: StorageLevel

    Gets storage level for intermediate datasets that require multiple passes.

    Gets storage level for intermediate datasets that require multiple passes.

    Definition Classes
    WithIntermediateStorageLevel
  16. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  17. val intermediateStorageLevel: StorageLevel
    Attributes
    protected
    Definition Classes
    WithIntermediateStorageLevel
  18. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  19. def logDebug(s: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  20. def logInfo(s: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  21. def logTrace(s: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  22. def logWarn(s: => String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  23. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  24. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  25. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  26. def resultIsPersistent(): Unit
    Attributes
    protected
    Definition Classes
    Logging
  27. def run(): DataFrame

    Executes the random walk embedding generation process.

    Executes the random walk embedding generation process. Requires that sequenceModel and randomWalks are set. The input GraphFrame must have valid vertex and edge DataFrames, with vertices containing an ID column.

    The process generates random walks, applies the chosen sequence model to produce initial embeddings, and optionally aggregates neighbor embeddings if aggregateNeighbors is enabled.

    returns

    A DataFrame containing the original vertex columns plus an additional "embedding" column (as defined by RandomWalkEmbeddings.embeddingColName) of type Vector containing the node embeddings. If aggregateNeighbors is true, the embedding will be a concatenation of the node's embedding and the averaged embeddings of sampled neighbors.

  28. def setAggregateNeighbors(value: Boolean): RandomWalkEmbeddings.this.type

    Sets whether to aggregate neighbor embeddings via min-hash sampling, concatenating the aggregated vector with the node's own embedding.

    Sets whether to aggregate neighbor embeddings via min-hash sampling, concatenating the aggregated vector with the node's own embedding. This improves predictive power (e.g., +20% in tests) and is more efficient for Hash2Vec. For Word2Vec, this adds redundant information since it already learns neighborhood relations. Default: true.

    value

    Boolean flag for neighbor aggregation.

    returns

    This instance for method chaining.

  29. def setCleanUpAfterRun(value: Boolean): RandomWalkEmbeddings.this.type

    Sets whether to clean up temporary random walk files after generating embeddings.

    Sets whether to clean up temporary random walk files after generating embeddings. Default: false.

    value

    Boolean flag for clean-up.

    returns

    This instance for method chaining.

  30. def setIntermediateStorageLevel(value: StorageLevel): RandomWalkEmbeddings.this.type

    Sets storage level for intermediate datasets that require multiple passes (default: MEMORY_AND_DISK).

    Sets storage level for intermediate datasets that require multiple passes (default: MEMORY_AND_DISK).

    Definition Classes
    WithIntermediateStorageLevel
  31. def setMaxNbrs(value: Int): RandomWalkEmbeddings.this.type

    Sets the maximum number of neighbors to sample for aggregation.

    Sets the maximum number of neighbors to sample for aggregation. Used only if aggregateNeighbors is true. Default: 50.

    value

    Maximum neighbors to sample.

    returns

    This instance for method chaining.

  32. def setRandomWalks(value: RandomWalkBase): RandomWalkEmbeddings.this.type

    Sets the random walk generator to use.

    Sets the random walk generator to use. No default; this must be set before running.

    value

    The random walk generator instance.

    returns

    This instance for method chaining.

  33. def setSeed(value: Long): RandomWalkEmbeddings.this.type

    Sets the random seed for reproducibility.

    Sets the random seed for reproducibility. Default: 42L.

    value

    The random seed.

    returns

    This instance for method chaining.

  34. def setSequenceModel(value: Either[Word2Vec, Hash2Vec]): RandomWalkEmbeddings.this.type

    Sets the sequence model to use for generating embeddings.

    Sets the sequence model to use for generating embeddings. This can be either a Word2Vec model (Left(Word2Vec)) or a Hash2Vec model (Right(Hash2Vec)). No default; this must be set before running.

    value

    The sequence model to use.

    returns

    This instance for method chaining.

  35. def setUseEdgeDirections(value: Boolean): RandomWalkEmbeddings.this.type

    Sets whether to use edge directions in random walks and neighbor aggregation.

    Sets whether to use edge directions in random walks and neighbor aggregation. If true, considers directed edges; otherwise, treats the graph as undirected. Default: false.

    value

    Boolean flag for using edge directions.

    returns

    This instance for method chaining.

  36. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  37. def toString(): String
    Definition Classes
    AnyRef → Any
  38. def useCachedRandomWalks(path: String): RandomWalkEmbeddings.this.type

    Sets the path to the existing cached RandomWalks if you want to run only embeddings model and skip the sequences generation step.

    Sets the path to the existing cached RandomWalks if you want to run only embeddings model and skip the sequences generation step.

    path

    to walks in parquet format

    returns

    This instance for method chaining.

  39. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  40. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  41. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

    (Since version 9)

  2. def formatted(fmtstr: String): String
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toStringFormat[RandomWalkEmbeddings] performed by method StringFormat in scala.Predef.
    Definition Classes
    StringFormat
    Annotations
    @deprecated @inline()
    Deprecated

    (Since version 2.12.16) Use formatString.format(value) instead of value.formatted(formatString), or use the f"" string interpolator. In Java 15 and later, formatted resolves to the new method in String which has reversed parameters.

  3. def [B](y: B): (RandomWalkEmbeddings, B)
    Implicit
    This member is added by an implicit conversion from RandomWalkEmbeddings toArrowAssoc[RandomWalkEmbeddings] performed by method ArrowAssoc in scala.Predef.
    Definition Classes
    ArrowAssoc
    Annotations
    @deprecated
    Deprecated

    (Since version 2.13.0) Use -> instead. If you still wish to display it as one character, consider using a font with programming ligatures such as Fira Code.

Inherited from WithIntermediateStorageLevel

Inherited from Logging

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Inherited by implicit conversion any2stringadd fromRandomWalkEmbeddings to any2stringadd[RandomWalkEmbeddings]

Inherited by implicit conversion StringFormat fromRandomWalkEmbeddings to StringFormat[RandomWalkEmbeddings]

Inherited by implicit conversion Ensuring fromRandomWalkEmbeddings to Ensuring[RandomWalkEmbeddings]

Inherited by implicit conversion ArrowAssoc fromRandomWalkEmbeddings to ArrowAssoc[RandomWalkEmbeddings]

Ungrouped