Packages

class Hash2Vec extends Serializable

Implementation of Hash2Vec, an efficient word embedding technique using feature hashing. Based on: Argerich, Luis, Joaquín Torré Zaffaroni, and Matías J. Cano. "Hash2vec, feature hashing for word embeddings." arXiv preprint arXiv:1608.08940 (2016).

Produces embeddings for elements in sequences using a hash-based approach to avoid storing a vocabulary. Uses MurmurHash3 for hashing elements to embedding indices and signs.

Output DataFrame has columns "id" (element identifier, same type as sequence elements) and "vector" (dense vector of doubles, summed across all occurrences).

Tradeoffs: Higher numPartitions reduces local state and memory per partition but increases aggregation and merging overhead across partitions. Larger embeddingsDim provides richer representations but consumes more memory. Seeds control hashing for reproducibility.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Hash2Vec
  2. Serializable
  3. AnyRef
  4. Any
Implicitly
  1. by any2stringadd
  2. by StringFormat
  3. by Ensuring
  4. by ArrowAssoc
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new Hash2Vec()

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. def +(other: String): String
    Implicit
    This member is added by an implicit conversion from Hash2Vec toany2stringadd[Hash2Vec] performed by method any2stringadd in scala.Predef.
    Definition Classes
    any2stringadd
  4. def ->[B](y: B): (Hash2Vec, B)
    Implicit
    This member is added by an implicit conversion from Hash2Vec toArrowAssoc[Hash2Vec] performed by method ArrowAssoc in scala.Predef.
    Definition Classes
    ArrowAssoc
    Annotations
    @inline()
  5. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  6. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  7. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
  8. def ensuring(cond: (Hash2Vec) => Boolean, msg: => Any): Hash2Vec
    Implicit
    This member is added by an implicit conversion from Hash2Vec toEnsuring[Hash2Vec] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  9. def ensuring(cond: (Hash2Vec) => Boolean): Hash2Vec
    Implicit
    This member is added by an implicit conversion from Hash2Vec toEnsuring[Hash2Vec] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  10. def ensuring(cond: Boolean, msg: => Any): Hash2Vec
    Implicit
    This member is added by an implicit conversion from Hash2Vec toEnsuring[Hash2Vec] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  11. def ensuring(cond: Boolean): Hash2Vec
    Implicit
    This member is added by an implicit conversion from Hash2Vec toEnsuring[Hash2Vec] performed by method Ensuring in scala.Predef.
    Definition Classes
    Ensuring
  12. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  13. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  14. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  15. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  16. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  17. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  18. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  19. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  20. def run(rawData: DataFrame): DataFrame

    Runs the Hash2Vec algorithm on the input DataFrame containing sequences.

    Runs the Hash2Vec algorithm on the input DataFrame containing sequences. The specified sequenceCol must contain arrays of elements (string or numeric). Produces a DataFrame with "id" (element ID, same type as elements) and "vector" (embedding vector, VectorType). Embeddings are summed across all partitions and occurrences.

  21. def setContextSize(value: Int): Hash2Vec.this.type

    Sets the context window size around each element to consider during training.

    Sets the context window size around each element to consider during training. Larger values incorporate more distant elements but increase computation time. Default: 5.

  22. def setDecayFunction(value: String): Hash2Vec.this.type

    Sets the decay function used to weight context elements by distance.

    Sets the decay function used to weight context elements by distance. Supported values: "gaussian", "constant". Default: "gaussian".

  23. def setDoNormalization(value: Boolean): Hash2Vec.this.type

    Convenience overload for setDoNormalization(doNorm, safeNorm) that uses safe‑mode (extra channel) by default.

    Convenience overload for setDoNormalization(doNorm, safeNorm) that uses safe‑mode (extra channel) by default. Equivalent to setDoNormalization(value, true).

    value

    If true, output vectors are L2‑normalized with safe (extra‑channel) semantics.

    returns

    This Hash2Vec instance for method chaining.

  24. def setDoNormalization(doNorm: Boolean, safeNorm: Boolean): Hash2Vec.this.type

    Sets whether final vectors are L2‑normalized after aggregation across partitions.

    Sets whether final vectors are L2‑normalized after aggregation across partitions. When normalization is enabled, each vector is scaled to unit length (L2 norm = 1).

    When safeNorm is true (default), the method adds an extra channel to the vector equal to log(L2‑norm + 1) / sqrt(dim). This preserves some information about the original magnitude while still making vectors comparable via cosine similarity.

    When safeNorm is false, normalizes without adding an extra channel, discarding magnitude entirely.

    This setting applies globally to all output vectors.

    doNorm

    If true, output vectors are normalized.

    safeNorm

    If true (and doNorm is true), retains magnitude information in an extra dimension. If false, performs standard L2‑normalization.

    returns

    This Hash2Vec instance for method chaining.

  25. def setEmbeddingsDim(value: Int): Hash2Vec.this.type

    Sets the dimensionality of the dense embedding vectors.

    Sets the dimensionality of the dense embedding vectors. Larger dimensions allow richer representations but require more memory. Corresponds to the hash table size. Default: 512.

  26. def setGaussianSigma(value: Double): Hash2Vec.this.type

    Sets the sigma parameter for Gaussian decay weighting.

    Sets the sigma parameter for Gaussian decay weighting. Smaller values decay weights faster with distance. Default: 1.0.

  27. def setHashingSeed(value: Int): Hash2Vec.this.type

    Sets the seed for hashing elements to embedding indices.

    Sets the seed for hashing elements to embedding indices. Used for reproducibility of embeddings. Default: 42.

  28. def setMaxVectorsPerPartition(value: Int): Hash2Vec.this.type

    Limits the maximum number of distinct element vectors that can be processed inside a single partition before flushing intermediate results to the iterator.

    Limits the maximum number of distinct element vectors that can be processed inside a single partition before flushing intermediate results to the iterator.

    Partition processing uses a paged matrix to store vectors. When the number of allocated vectors reaches this limit within a partition, the current batch of vectors is returned (as part of the iterator) and a new empty batch is started for the remaining elements.

    This prevents a single partition from consuming unbounded memory while processing very large vocabularies, at the cost of producing multiple iterator batches per partition.

    The default value is 100000.

    value

    Upper bound on distinct vectors processed per batch inside a partition.

    returns

    This Hash2Vec instance for method chaining.

  29. def setNumPartitions(value: Int): Hash2Vec.this.type

    Sets the number of partitions for RDDs to parallelize computation.

    Sets the number of partitions for RDDs to parallelize computation. More partitions distribute workload and reduce memory per partition but complicate merging across partitions. Default: 5.

  30. def setSequenceCol(value: String): Hash2Vec.this.type

    Sets the column name containing sequences of elements (as arrays).

    Sets the column name containing sequences of elements (as arrays). Default: "random_walk".

  31. def setSignHashSeed(value: Int): Hash2Vec.this.type

    Sets the seed for hashing elements to determine the sign of contributions.

    Sets the seed for hashing elements to determine the sign of contributions. Used for reproducibility of embeddings. Default: 18.

  32. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  33. def toString(): String
    Definition Classes
    AnyRef → Any
  34. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  35. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  36. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

    (Since version 9)

  2. def formatted(fmtstr: String): String
    Implicit
    This member is added by an implicit conversion from Hash2Vec toStringFormat[Hash2Vec] performed by method StringFormat in scala.Predef.
    Definition Classes
    StringFormat
    Annotations
    @deprecated @inline()
    Deprecated

    (Since version 2.12.16) Use formatString.format(value) instead of value.formatted(formatString), or use the f"" string interpolator. In Java 15 and later, formatted resolves to the new method in String which has reversed parameters.

  3. def [B](y: B): (Hash2Vec, B)
    Implicit
    This member is added by an implicit conversion from Hash2Vec toArrowAssoc[Hash2Vec] performed by method ArrowAssoc in scala.Predef.
    Definition Classes
    ArrowAssoc
    Annotations
    @deprecated
    Deprecated

    (Since version 2.13.0) Use -> instead. If you still wish to display it as one character, consider using a font with programming ligatures such as Fira Code.

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Inherited by implicit conversion any2stringadd fromHash2Vec to any2stringadd[Hash2Vec]

Inherited by implicit conversion StringFormat fromHash2Vec to StringFormat[Hash2Vec]

Inherited by implicit conversion Ensuring fromHash2Vec to Ensuring[Hash2Vec]

Inherited by implicit conversion ArrowAssoc fromHash2Vec to ArrowAssoc[Hash2Vec]

Ungrouped