class Hash2Vec extends Serializable
Implementation of Hash2Vec, an efficient word embedding technique using feature hashing. Based on: Argerich, Luis, Joaquín Torré Zaffaroni, and Matías J. Cano. "Hash2vec, feature hashing for word embeddings." arXiv preprint arXiv:1608.08940 (2016).
Produces embeddings for elements in sequences using a hash-based approach to avoid storing a vocabulary. Uses MurmurHash3 for hashing elements to embedding indices and signs.
Output DataFrame has columns "id" (element identifier, same type as sequence elements) and "vector" (dense vector of doubles, summed across all occurrences).
Tradeoffs: Higher numPartitions reduces local state and memory per partition but increases aggregation and merging overhead across partitions. Larger embeddingsDim provides richer representations but consumes more memory. Seeds control hashing for reproducibility.
- Alphabetic
- By Inheritance
- Hash2Vec
- Serializable
- AnyRef
- Any
- by any2stringadd
- by StringFormat
- by Ensuring
- by ArrowAssoc
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new Hash2Vec()
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- def +(other: String): String
- def ->[B](y: B): (Hash2Vec, B)
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
- def ensuring(cond: (Hash2Vec) => Boolean, msg: => Any): Hash2Vec
- def ensuring(cond: (Hash2Vec) => Boolean): Hash2Vec
- def ensuring(cond: Boolean, msg: => Any): Hash2Vec
- def ensuring(cond: Boolean): Hash2Vec
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @IntrinsicCandidate() @native()
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @IntrinsicCandidate() @native()
- def run(rawData: DataFrame): DataFrame
Runs the Hash2Vec algorithm on the input DataFrame containing sequences.
Runs the Hash2Vec algorithm on the input DataFrame containing sequences. The specified sequenceCol must contain arrays of elements (string or numeric). Produces a DataFrame with "id" (element ID, same type as elements) and "vector" (embedding vector, VectorType). Embeddings are summed across all partitions and occurrences.
- def setContextSize(value: Int): Hash2Vec.this.type
Sets the context window size around each element to consider during training.
Sets the context window size around each element to consider during training. Larger values incorporate more distant elements but increase computation time. Default: 5.
- def setDecayFunction(value: String): Hash2Vec.this.type
Sets the decay function used to weight context elements by distance.
Sets the decay function used to weight context elements by distance. Supported values: "gaussian", "constant". Default: "gaussian".
- def setDoNormalization(value: Boolean): Hash2Vec.this.type
Convenience overload for
setDoNormalization(doNorm, safeNorm)that uses safe‑mode (extra channel) by default.Convenience overload for
setDoNormalization(doNorm, safeNorm)that uses safe‑mode (extra channel) by default. Equivalent tosetDoNormalization(value, true).- value
If true, output vectors are L2‑normalized with safe (extra‑channel) semantics.
- returns
This Hash2Vec instance for method chaining.
- def setDoNormalization(doNorm: Boolean, safeNorm: Boolean): Hash2Vec.this.type
Sets whether final vectors are L2‑normalized after aggregation across partitions.
Sets whether final vectors are L2‑normalized after aggregation across partitions. When normalization is enabled, each vector is scaled to unit length (L2 norm = 1).
When
safeNormis true (default), the method adds an extra channel to the vector equal tolog(L2‑norm + 1) / sqrt(dim). This preserves some information about the original magnitude while still making vectors comparable via cosine similarity.When
safeNormis false, normalizes without adding an extra channel, discarding magnitude entirely.This setting applies globally to all output vectors.
- doNorm
If true, output vectors are normalized.
- safeNorm
If true (and doNorm is true), retains magnitude information in an extra dimension. If false, performs standard L2‑normalization.
- returns
This Hash2Vec instance for method chaining.
- def setEmbeddingsDim(value: Int): Hash2Vec.this.type
Sets the dimensionality of the dense embedding vectors.
Sets the dimensionality of the dense embedding vectors. Larger dimensions allow richer representations but require more memory. Corresponds to the hash table size. Default: 512.
- def setGaussianSigma(value: Double): Hash2Vec.this.type
Sets the sigma parameter for Gaussian decay weighting.
Sets the sigma parameter for Gaussian decay weighting. Smaller values decay weights faster with distance. Default: 1.0.
- def setHashingSeed(value: Int): Hash2Vec.this.type
Sets the seed for hashing elements to embedding indices.
Sets the seed for hashing elements to embedding indices. Used for reproducibility of embeddings. Default: 42.
- def setMaxVectorsPerPartition(value: Int): Hash2Vec.this.type
Limits the maximum number of distinct element vectors that can be processed inside a single partition before flushing intermediate results to the iterator.
Limits the maximum number of distinct element vectors that can be processed inside a single partition before flushing intermediate results to the iterator.
Partition processing uses a paged matrix to store vectors. When the number of allocated vectors reaches this limit within a partition, the current batch of vectors is returned (as part of the iterator) and a new empty batch is started for the remaining elements.
This prevents a single partition from consuming unbounded memory while processing very large vocabularies, at the cost of producing multiple iterator batches per partition.
The default value is 100000.
- value
Upper bound on distinct vectors processed per batch inside a partition.
- returns
This Hash2Vec instance for method chaining.
- def setNumPartitions(value: Int): Hash2Vec.this.type
Sets the number of partitions for RDDs to parallelize computation.
Sets the number of partitions for RDDs to parallelize computation. More partitions distribute workload and reduce memory per partition but complicate merging across partitions. Default: 5.
- def setSequenceCol(value: String): Hash2Vec.this.type
Sets the column name containing sequences of elements (as arrays).
Sets the column name containing sequences of elements (as arrays). Default: "random_walk".
- def setSignHashSeed(value: Int): Hash2Vec.this.type
Sets the seed for hashing elements to determine the sign of contributions.
Sets the seed for hashing elements to determine the sign of contributions. Used for reproducibility of embeddings. Default: 18.
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
Deprecated Value Members
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable]) @Deprecated
- Deprecated
(Since version 9)
- def formatted(fmtstr: String): String
- Implicit
- This member is added by an implicit conversion from Hash2Vec toStringFormat[Hash2Vec] performed by method StringFormat in scala.Predef.
- Definition Classes
- StringFormat
- Annotations
- @deprecated @inline()
- Deprecated
(Since version 2.12.16) Use
formatString.format(value)instead ofvalue.formatted(formatString), or use thef""string interpolator. In Java 15 and later,formattedresolves to the new method in String which has reversed parameters.
- def →[B](y: B): (Hash2Vec, B)
- Implicit
- This member is added by an implicit conversion from Hash2Vec toArrowAssoc[Hash2Vec] performed by method ArrowAssoc in scala.Predef.
- Definition Classes
- ArrowAssoc
- Annotations
- @deprecated
- Deprecated
(Since version 2.13.0) Use
->instead. If you still wish to display it as one character, consider using a font with programming ligatures such as Fira Code.