Installation
If you are new to using Apache Spark, refer to the Apache Spark Documentation and its Quick-Start Guide for more information.
Maven Central Coordinates
GraphFrames core is published in the Maven Central under namespace io.graphframes. All the artifacts are groupped using the following logic.
graphframes-{component-name}-{spark-major-version}_{scala-version}
Examples:
graphframes-spark3_2.12, graphframes core for spark 3.x and scala version 2.12graphframes-graphx-spark4_2.13, graphframes internal fork of GraphX for spark 4.x and scala version 2.13graphframes-connect-spark3_2.13, graphframes Spark Connect plugin for spark 3.x and scala version 2.13
Core
GraphFrames core is the main package that should be used.
Spark-Connect plugin
Only for users who want to use GraphFrames with Spark Connect.
GraphFrames-GraphX
Runtime dependency of graphframes, should be resolved automatically. Contains internal modified and updated fork of the Apache Saprk GraphX.
Spark Versions Compatibility
| Component | Spark 3.x (Scala 2.12) | Spark 3.x (Scala 2.13) | Spark 4.x (Scala 2.13) |
|---|---|---|---|
| graphframes | ✓ | ✓ | ✓ |
| graphframes-connect | ✓ | ✓ | ✓ |
The following example shows how to run the Spark shell with the GraphFrames package. We use the --packages argument to download the graphframes package and any dependencies automatically.
Spark 3.x
Spark Shell
$ ./bin/spark-shell --packages io.graphframes:graphframes-spark3_2.12:0.11.0
Or use the following command to force using of Scala 2.13:
$ ./bin/spark-shell --packages io.graphframes:graphframes-spark3_2.13:0.11.0
PySpark
$ pip install graphframes-py==0.11.0
$ ./bin/pyspark --packages io.graphframes:graphframes-spark3_2.12:0.11.0
Spark 4.x
Spark Shell
$ ./bin/spark-shell --packages io.graphframes:graphframes-spark4_2.13:0.11.0
PySpark
$ pip install graphframes-py==0.11.0
$ ./bin/pyspark --packages io.graphframes:graphframes-spark4_2.13:0.11.0
Spark Connect Server Extension
To add GraphFrames to your spark connect server, you need to specify the plugin name:
For Spark 4.x:
./sbin/start-connect-server.sh \
--conf spark.connect.extensions.relation.classes=\
org.apache.spark.sql.graphframes.GraphFramesConnect \
--packages io.graphframes:graphframes-connect-spark4_2.13:0.11.0
For Spark 3.x:
./sbin/start-connect-server.sh \
--conf spark.connect.extensions.relation.classes=\
org.apache.spark.sql.graphframes.GraphFramesConnect \
--packages io.graphframes:graphframes-connect-spark3_2.12:0.11.0
WARNING: The GraphFrames Connect Server Extension is not compatible with managed SparkConnect from Databricks. To make it work, you need to use build GraphFrames Connect Server Extension from source with a flag:
./build/sbt -Dvendor.name=dbx connect/assembly
Spark Connect Clients
At the moment GraphFrames has only PySpark client bundled with the package: pip install graphframes-py==0.11.0. In Runtime GraphFrames PySpark client will automatically handle the connection to the GraphFrames Connect Server Extension in case it is Spark Connect environment.
Messages
At the moment, the following APIs are exposed:
message GraphFramesAPI {
bytes vertices = 1;
bytes edges = 2;
oneof method {
AggregateMessages aggregate_messages = 3;
BFS bfs = 4;
ConnectedComponents connected_components = 5;
DropIsolatedVertices drop_isolated_vertices = 6;
DetectingCycles detecting_cycles = 7;
FilterEdges filter_edges = 8;
FilterVertices filter_vertices = 9;
Find find = 10;
LabelPropagation label_propagation = 11;
PageRank page_rank = 12;
ParallelPersonalizedPageRank parallel_personalized_page_rank = 13;
PowerIterationClustering power_iteration_clustering = 14;
Pregel pregel = 15;
ShortestPaths shortest_paths = 16;
StronglyConnectedComponents strongly_connected_components = 17;
SVDPlusPlus svd_plus_plus = 18;
TriangleCount triangle_count = 19;
Triplets triplets = 20;
KCore kcore = 21;
MaximalIndependentSet mis = 22;
RandomWalkEmbeddings rw_embeddings = 23;
AggregateNeighbors aggregate_neighbors = 24;
}
}
Building GraphFrames from Source
./build/sbt package
Nightly Builds
GraphFrames project is publishing SNAPSHOTS (nightly builds) to the "Central Portal Snapshots." Please read this section of the Sonatype documentation to check how can you use snapshots in your project.
GroupId: io.graphframes
ArtifactIds:
graphframes-spark3_2.12graphframes-spark3_2.13graphframes-connect-spark3_2.12graphframes-connect-spark3_2.13graphframes-graphx-spark3_2.12graphframes-graphx-spark3_2.13graphframes-spark4_2.13graphframes-connect-spark4_2.13graphframes-graphx-spark4_2.13