GraphFrames 0.11.0 release
- Published: 2026-03-30T00:00:00Z
- Title: GraphFrames 0.11.0 release
- Summary: This release brings a new Connected Components algorithm based on Randomized Contraction, automatic Pregel optimization that skips unnecessary joins, graph embeddings via random walks with Word2Vec and Hash2Vec, a PySpark Property Graph API, approximate triangle counting with DataSketches, and various improvements.
Want to try these features now? Wherobots is the only cloud platform shipping a modern version of GraphFrames. Sign up and start running graph algorithms right away — no setup required.
Connected Components: new algorithm
The Connected Components API now offers three algorithm choices:
| Algorithm | Key | Description |
|---|---|---|
| GraphX | graphx |
The original GraphX-based implementation |
| Two Phase | two_phase |
The same big-star/small-star label propagation algorithm as before, now under a clearer name (default) |
| Randomized Contraction | randomized_contraction |
Based on Bogeholz et al. (ICDE 2020) |
Randomized Contraction
The new randomized_contraction algorithm works by iteratively contracting the graph using random linear functions. At each step, vertices are mapped through a randomized a*x + b transformation (with overflow-safe arithmetic via a custom Spark SQL expression) and merged. The process repeats until no edges remain, then reconstructs the final component labels in reverse order.
Theoretical guarantees from Bögeholz, Harald, Michael Brand, and Radu-Alexandru Todor. "In-database connected component analysis." 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 2020. look promising, as do the benchmark results on small graphs.
val components = graph.connectedComponents
.setAlgorithm("randomized_contraction")
.run()
Pregel: automatic join skipping
A significant memory optimization was added to the Pregel API. GraphFrames now automatically analyzes message expressions at the start of run() to detect whether destination vertex columns are actually referenced. If they are not, as is the case for algorithms like PageRank and Shortest Paths that only read source vertex state, the expensive second join (attaching destination vertex state to triplets) is skipped entirely.
How it works
Standard Pregel triplet construction requires two joins:
- Join source vertices with edges
- Join the result with destination vertices
The optimization uses SparkShims.extractColumnReferences() to inspect the AST of all message expressions. If no Pregel.dst(...) columns are accessed (or only dst.id, which is always available from the edge itself), the second join is replaced with a lightweight struct construction from the edge column itself.
This is fully automatic and requires no changes to existing code. Algorithms that do reference destination state (like Label Propagation) continue to perform both joins as before.
Graph embeddings & random walks
This release delivers the graph ML features previewed in the 0.10.0 roadmap: a complete pipeline for generating vertex embeddings via random walks.
Random walks
The RandomWalkWithRestart algorithm generates vertex sequences by performing random walks across the graph. At each step, the walker either continues to a random neighbor or restarts from the origin with a configurable probability.
import org.graphframes.rw.RandomWalkWithRestart
val rw = new RandomWalkWithRestart()
.setRestartProbability(0.1)
.setNumWalksPerNode(5)
.setNumBatches(5)
.setBatchSize(10)
.setTemporaryPrefix("/tmp/random_walks")
Key design choices:
- Batched execution: walks are generated in batches persisted to temporary parquet files, keeping memory bounded regardless of graph size
- Deterministic sampling: neighbors are sampled via
KMinSampling(min-hash with xxhash64), avoiding the hub problem (vertices with thousands of neighbors) and ensuring fault-tolerant reproducibility - Edge direction support: walks can respect or ignore edge directionality
Embedding models
Two sequence-to-vector models are available:
| Model | Strengths | Typical dimensions | Scale |
|---|---|---|---|
| Word2Vec | Higher quality embeddings, well-studied | 50-300 | ~20M vertices |
| Hash2Vec | No vocabulary needed, constant memory per element | 512+ | Billions of vertices |
Hash2Vec is based on Argerich, Luis, Joaquín Torré Zaffaroni, and Matías J. Cano. "Hash2vec, feature hashing for word embeddings." arXiv preprint arXiv:1608.08940 (2016). and uses MurmurHash3 to avoid storing explicit vocabularies. Internally it uses a custom PagedMatrixDouble structure with 4096-element pages for cache-friendly, GC-efficient vector accumulation.
End-to-end pipeline
The RandomWalkEmbeddings class ties everything together:
import org.graphframes.embeddings.RandomWalkEmbeddings
val embeddings = graph.randomWalksBasedEmbedding
.setRandomWalks(rw)
.setSequenceModel(Right(new Hash2Vec().setEmbeddingsDim(512)))
.setAggregateNeighbors(true) // +20% quality via neighborhood averaging
.run()
// Returns DataFrame with all vertex columns + "embedding" column
embeddings.show()
Enabling aggregateNeighbors applies a GraphSAGE-style convolution that averages sampled neighbor embeddings and concatenates them with the node's own vector, improving downstream task performance by 20%+ in synthetic benchmarks.
Use cases
- Node classification: predict vertex labels from learned representations
- Link prediction: score potential edges by embedding similarity
- Community detection: cluster vertices in embedding space
- Anomaly detection: identify outliers far from their neighborhood in embedding space
PySpark Property Graph API
A new PropertyGraphFrame API was added for PySpark, enabling multi-typed graphs where different vertex and edge groups coexist in a single logical structure.
Core classes
VertexPropertyGroup: wraps a DataFrame of vertices with a name, primary key, and optional ID masking (SHA256 hashing to prevent collisions across groups)EdgePropertyGroup: wraps edges with source/destination property groups, direction, and optional weightsPropertyGraphFrame: manages collections of vertex and edge groups
Example
from graphframes.pg import PropertyGraphFrame, VertexPropertyGroup, EdgePropertyGroup
users = VertexPropertyGroup("users", users_df, primary_key_column="user_id")
products = VertexPropertyGroup("products", products_df, primary_key_column="product_id")
purchases = EdgePropertyGroup(
"purchases", purchases_df,
src_property_group=users, dst_property_group=products,
is_directed=True,
src_column_name="user_id", dst_column_name="product_id",
weight_column_name="amount"
)
pg = PropertyGraphFrame([users, products], [purchases])
# Convert to a standard GraphFrame and run algorithms
gf = pg.to_graphframe(["users", "products"], ["purchases"])
components = gf.connectedComponents()
# Join results back to original vertex data
result = pg.join_vertices(components, ["users"])
Bipartite projection
PropertyGraphFrame also supports bipartite graph projection, creating edges between vertices of the same type that share neighbors in the other partition:
projected = pg.projection_by("users", "products", "purchases",
new_edge_weight=lambda w1, w2: w1 + w2)
Other new features
Approximate triangle counting
A new approximate triangle counting algorithm was added using Apache DataSketches. This provides fast, memory-efficient triangle count estimates for large graphs where exact counting would be prohibitively expensive. Note that this feature requires Spark 4.1.0 or later, as it relies on the theta_sketch_agg, theta_intersection, and theta_sketch_estimate SQL functions introduced in that version.
Aggregate Neighbors API
A new AggregateNeighbors class implements multi-hop breadth-first traversal with customizable accumulators, stopping conditions, target conditions, and edge filters. This is useful for computing reachability, path-based features, and neighborhood aggregations.
Pattern matching improvements
- Anonymous vertices in variable-length patterns:
()-[*1..3]->(v) - Undirected fixed-length patterns:
(u)-[*2]-(v) - Chaining with fixed-length patterns now works correctly
Future work
- GQL query engine:
PropertyGraphFramealready carries full schema information — which vertex groups exist, which edge groups connect them, and how. This is exactly the metadata a query engine needs. A planned next step is to add a GQL (ISO/IEC 39075) subset query engine on top of PGF that compilesMATCHstatements into optimized Spark DataFrame join plans. GraphFrame.reverse(): a convenience method to reverse edge directions by swappingsrcanddstcolumns, useful for computing in-degree centralities, reverse reachability, and transpose-based algorithms like Kosaraju's strongly connected components.- Aggregate Neighbors algorithms: new graph algorithms built on top of the
AggregateNeighborsAPI, such as all-paths computation. - Random walk variants: additional random walk implementations including weighted random walks and temporal random walks.