Quick-Start

This quick-start guide shows how to get started using GraphFrames. After you work through this guide, move on to the User Guide to learn more about the many queries and algorithms supported by GraphFrames.

The following example shows how to create a GraphFrame, query it, and run the PageRank algorithm.

Python API

# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
    ("a", "Alice", 34),
    ("b", "Bob", 36),
    ("c", "Charlie", 30),
], ["id", "name", "age"])

# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
    ("a", "b", "friend"),
    ("b", "c", "follow"),
    ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *

g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()

Scala API

// import graphframes package

import org.graphframes._

// Create a Vertex DataFrame with unique ID column "id"
val v = spark.createDataFrame(List(
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30)
)).toDF("id", "name", "age")

// Create an Edge DataFrame with "src" and "dst" columns
val e = spark.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow")
)).toDF("src", "dst", "relationship")
// Create a GraphFrame

import org.graphframes.GraphFrame

val g = GraphFrame(v, e)

// Query: Get in-degree of each vertex.
g.inDegrees.show()

// Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

// Run PageRank algorithm, and show results.
val results = g.pageRank.resetProbability(0.01).maxIter(20).run()
results.vertices.select("id", "pagerank").show()

Graph Algorithms

Historically, Apache Spark had a built-in graph processing tool named GraphX, that was based on RDD (pre Spark 2.x way of doing things). GraphX provided a set of graph algorithms, like PageRank, LabelPropagation, etc. In Spark 4.0.x GraphX was deprecated and is not recommended for usage. Opposite, GraphFrames represent graphs using Spark's Dataset / Dataframe. GraphFrames. It also provides the set of standard graph algorithms, and this set is growing. For algorithms implemented in GraphX but currently not supported natively in GraphFrames, the library also provides a conversion method (see user guide). The following table shows the currently supported algorithms:

Algorithm	GraphX Wrapper	GraphFrames Implementation	Recommendations
BFS	Yes	Yes	GraphFrames provides smoother API
Connected Components	Yes	Yes	For small graphs and streaming GraphX, otherwise GraphFrames
Strongly Connected Components	Yes	No	GraphX
Label Propagation Algorithm	Yes	Yes	For small graphs and streaming GraphX, otherwise GraphFrames
PageRank	Yes	No	GraphX
Parallel Personalized PageRank	Yes	No	GraphX
Shortest Paths	Yes	Yes	For small graphs and streaming GraphX, otherwise GraphFrames
Triangle Count	Yes	Yes	GraphFrames provides smoother API
SVD++	Yes	No	GraphX
Cycles Detection	No	Yes	GraphFrames
Triangel Count	No	Yes	GraphFrames
K-Core	No	Yes	GraphFrames
Maximal Independent Set	No	Yes	GraphFrames