Installation
If you are new to using Apache Spark, refer to the Apache Spark Documentation and its Quick-Start Guide for more information.
Spark Versions Compatibility
Component | Spark 3.x (Scala 2.12) | Spark 3.x (Scala 2.13) | Spark 4.x (Scala 2.13) |
---|---|---|---|
graphframes | ✓ | ✓ | ✓ |
graphframes-connect | ✓ | ✓ | ✓ |
The following example shows how to run the Spark shell with the GraphFrames package. We use the --packages
argument to download the graphframes package and any dependencies automatically.
Spark 3.x
Spark Shell
$ ./bin/spark-shell --packages io.graphframes:graphframes-spark3_2.12:0.9.3
Or use the following command to force using of Scala 2.13:
$ ./bin/spark-shell --packages io.graphframes:graphframes-spark3_2.13:0.9.3
PySpark
$ pip install graphframes-py==0.9.3
$ ./bin/pyspark --packages io.graphframes:graphframes-spark3_2.12:0.9.3
Spark 4.x
Spark Shell
$ ./bin/spark-shell --packages io.graphframes:graphframes-spark4_2.13:0.9.3
PySpark
$ pip install graphframes-py==0.9.3
$ ./bin/pyspark --packages io.graphframes:graphframes-spark4_2.13:0.9.3
Spark Connect Server Extension
To add GraphFrames to your spark connect server, you need to specify the plugin name:
For Spark 4.x:
./sbin/start-connect-server.sh \
--conf spark.connect.extensions.relation.classes=\
org.apache.spark.sql.graphframes.GraphFramesConnect \
--packages io.graphframes.graphframes-connect-spark4_2.13:0.9.3
For Spark 3.x:
./sbin/start-connect-server.sh \
--conf spark.connect.extensions.relation.classes=\
org.apache.spark.sql.graphframes.GraphFramesConnect \
--packages io.graphframes.graphframes-connect-spark3_2.12:0.9.3
WARNING: The GraphFrames Connect Server Extension is not compatible with managed SparkConnect from Databricks. To make it work, you need to use build GraphFrames Connect Server Extension from source with a flag:
./build/sbt -Dvendor.name=dbx connect/assembly
Spark Connect Clients
At the moment GraphFrames has only PySpark client bundled with the package: pip install graphframes-py==0.9.3
. In Runtime GraphFrames PySpark client will automatically handle the connection to the GraphFrames Connect Server Extension in case it is Spark Connect environment.
Messages
At the moment, the following APIs are exposed:
message GraphFramesAPI {
bytes vertices = 1;
bytes edges = 2;
oneof method {
AggregateMessages aggregate_messages = 3;
BFS bfs = 4;
ConnectedComponents connected_components = 5;
DropIsolatedVertices drop_isolated_vertices = 6;
FilterEdges filter_edges = 7;
FilterVertices filter_vertices = 8;
Find find = 9;
LabelPropagation label_propagation = 10;
PageRank page_rank = 11;
ParallelPersonalizedPageRank parallel_personalized_page_rank = 12;
PowerIterationClustering power_iteration_clustering = 13;
Pregel pregel = 14;
ShortestPaths shortest_paths = 15;
StronglyConnectedComponents strongly_connected_components = 16;
SVDPlusPlus svd_plus_plus = 17;
TriangleCount triangle_count = 18;
Triplets triplets = 19;
}
}
Building GraphFrames from Source
./build/sbt package
Nightly Builds
GraphFrames project is publishing SNAPSHOTS (nightly builds) to the "Central Portal Snapshots." Please read this section of the Sonatype documentation to check how can you use snapshots in your project.
GroupId: io.graphframes
ArtifactIds:
graphframes-spark3_2.12
graphframes-spark3_2.13
graphframes-connect-spark3_2.12
graphframes-connect-spark3_2.13
graphframes-spark4_2.13
graphframes-connect-spark4_2.13