Quick-Start
This quick-start guide shows how to get started using GraphFrames. After you work through this guide, move on to the User Guide to learn more about the many queries and algorithms supported by GraphFrames.
The following example shows how to create a GraphFrame, query it, and run the PageRank algorithm.
Python API
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)
# Query: Get in-degree of each vertex.
g.inDegrees.show()
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()
Scala API
// import graphframes package
import org.graphframes._
// Create a Vertex DataFrame with unique ID column "id"
val v = spark.createDataFrame(List(
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30)
)).toDF("id", "name", "age")
// Create an Edge DataFrame with "src" and "dst" columns
val e = spark.createDataFrame(List(
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow")
)).toDF("src", "dst", "relationship")
// Create a GraphFrame
import org.graphframes.GraphFrame
val g = GraphFrame(v, e)
// Query: Get in-degree of each vertex.
g.inDegrees.show()
// Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()
// Run PageRank algorithm, and show results.
val results = g.pageRank.resetProbability(0.01).maxIter(20).run()
results.vertices.select("id", "pagerank").show()
Graph Algorithms
Historically, Apache Spark had a built-in graph processing tool named GraphX
, that was based on RDD
(pre Spark 2.x
way of doing things). GraphX provided a set of graph algorithms, like PageRank
, LabelPropagation
, etc. In Spark
4.0.x GraphX was deprecated and is not recommended for usage. Opposite, GraphFrames
represent graphs using Spark's
Dataset
/ Dataframe
. GraphFrames
. It also provides the set of standard graph algorithms, and this set is growing.
For algorithms implemented in GraphX
but currently not supported natively in GraphFrames
, the library also provides
a conversion method (see user guide). The following table shows the currently
supported algorithms:
Algorithm | GraphX Wrapper | GraphFrames Implementation | Recommendations |
---|---|---|---|
BFS | Yes | Yes | GraphFrames provides smoother API |
Connected Components | Yes | Yes | For small graphs and streaming GraphX, otherwise GraphFrames |
Strongly Connected Components | Yes | No | GraphX |
Label Propagation Algorithm | Yes | Yes | GraphFrames is order of magnitude faster |
PageRank | Yes | No | GraphX |
Parallel Personalized PageRank | Yes | No | GraphX |
Shortest Paths | Yes | Yes | For small graphs and streaming GraphX, otherwise GraphFrames |
Triangle Count | Yes | Yes | GraphFrames provides smoother API |
SVD++ | Yes | No | GraphX |