Contributing Guide
The GraphFrames project welcomes new contributors. This guide provides an end-to-end checklist—from preparing your workstation to submitting a pull request—so that you can iterate with confidence and keep the test suites green.
1. Prerequisites
Ensure the following tools are installed before cloning the repository:
| Tool | Recommended Version | Notes |
|---|---|---|
| Git | Latest stable | Required for version control and contribution workflows. |
| Java Development Kit (JDK) | 11 or 17 | Spark 3.x supports Java 8/11/17; GraphFrames CI runs on JDK 17. |
| Python | 3.10 – 3.12 | Required for the Python APIs and tests. |
| Apache Spark (binary distribution) | 3.5.x (default) or 4.0.x | Needed for the Python test suite. |
| Poetry | ≥ 1.8 | Dependency manager used by the Python package. Install via pipx or pip. |
Protocol Buffers compiler (protoc) |
≥ 3.21 | Required for the GraphFrames Connect protobuf build. |
| Buf CLI | Latest stable | Used to lint and generate protobuf sources. |
| Apache Spark (optional) | 3.5.x (default) or 4.0.x | Only required if you want the standalone Spark shell outside PySpark. |
| Docker (optional) | Latest stable | Useful for isolated environments but not mandatory. |
1.1 Install required tooling
macOS (Homebrew)
brew update
brew install git openjdk@17 python@3.12 pipx protobuf bufbuild/buf/buf
pipx install poetry
Add the Java toolchain to your shell profile (for example ~/.zshrc or ~/.bashrc):
export JAVA_HOME="$(/usr/libexec/java_home -v17)"
export PATH="$JAVA_HOME/bin:$PATH"
Ubuntu / Debian
sudo apt update
sudo apt install -y git openjdk-17-jdk python3 python3-venv python3-pip curl protobuf-compiler
python3 -m pip install --user pipx
python3 -m pipx ensurepath
pipx install poetry
curl -sSL "https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64.tar.gz" \
| sudo tar -xzf - -C /usr/local --strip-components=1
Add the Java toolchain to your shell profile (for example ~/.zshrc or ~/.bashrc):
export JAVA_HOME="/usr/lib/jvm/java-17-openjdk-amd64"
export PATH="$JAVA_HOME/bin:$PATH"
Optional: Standalone Apache Spark distribution
poetry install (described later) already brings in the matching version of PySpark and Spark
Connect. If you also want the standalone Spark shell or spark-submit, download the distribution
that matches the build’s spark.version (currently 3.5.6) and expose it via SPARK_HOME:
curl -O https://downloads.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
mkdir -p "$HOME/.local/spark"
tar -xzf spark-3.5.6-bin-hadoop3.tgz -C "$HOME/.local/spark"
export SPARK_HOME="$HOME/.local/spark/spark-3.5.6-bin-hadoop3"
export PATH="$SPARK_HOME/bin:$PATH"
Tip: To build against Spark 4.x, pass-Dspark.version=<version>to./build/sbtor to the jar-building helper script referenced in the Python workflow.
2. Clone the repository
- Fork the repository on GitHub (recommended) and copy the clone URL.
-
Clone your fork and switch into the workspace:
git clone https://github.com/<your-user>/graphframes.git cd graphframes -
Configure the upstream remote to stay in sync:
git remote add upstream https://github.com/graphframes/graphframes.git git fetch upstream -
Create a topic branch for your change:
git checkout -b feature/my-change
3. Scala / JVM workflow
GraphFrames provides a checked-in sbt launcher at ./build/sbt; a separate sbt
installation is not required.
3.1 Compile the project
./build/sbt compile
The first run downloads all dependencies and may take several minutes. If
compilation fails with -Xfatal-warnings complaining about deprecated
java.net.URL constructors, ensure the build is using your Java 17 toolchain. You
can force it with:
./build/sbt -java-home "$JAVA_HOME" compile
3.2 Format Scala code
You can use the project’s pre-commit hooks to automatically format all Scala code at once.
3.3 Run Scala tests
Run the full test suite:
./build/sbt test
Focus on a specific suite while iterating:
./build/sbt "core/testOnly org.graphframes.lib.PageRankSuite"
3.4 Helpful sbt tasks
| Command | Purpose |
|---|---|
./build/sbt core/assembly |
Builds the uber-jar required for Python tests. |
./build/sbt doc |
Generates Scala API documentation. |
./build/sbt +package |
Cross-builds artifacts for all supported Scala versions. |
./build/sbt scalafmtAll test:scalafmt scalafixAll |
Formats and lints all Scala code using Scalafmt and Scalafix. |
./build/sbt docs/laikaPreview |
Serves the documentation site locally at http://localhost:4242. |
./build/sbt -Dspark.version=4.0.1 compile |
Compiles against Spark 4.x APIs. |
./build/sbt package -Dvendor.name=dbx |
Produces Databricks-compatible Spark Connect jars. |
4. Python workflow
The Python package resides under python/ and uses Poetry for dependency
management.
To build the GraphFrames assembly jar and run the Python test suite, you can use the provided helper script and pytest directly.
-
Install Python dependencies (from the
python/directory):poetry install --with dev,tutorials,docs -
Build the required GraphFrames JAR for your Spark version:
poetry run python ./dev/build_jar.pyThis script will automatically build the correct JAR for Spark 3.5.x (or 4.x if specified). You do not need to run
sbtdirectly. -
Run the Python test suite:
poetry run pytest -vvvThe test configuration will automatically pick up the correct JAR and Spark version (see
python/tests/conftest.pyfor details).Enable Spark Connect coverage by exporting
SPARK_CONNECT_MODE_ENABLED=1:SPARK_CONNECT_MODE_ENABLED=1 poetry run pytest -vvv -
Optionally enforce formatting and linting:
poetry run black graphframes tests poetry run isort graphframes tests poetry run flake8 graphframes tests
You can see this workflow in action in the CI configuration.
4.1 PySpark smoke tests
After building the assembly jar you can validate your build in an interactive PySpark session.
-
Export the assembly path from the repository root:
export GRAPHFRAMES_ASSEMBLY=$(ls ../core/target/scala-${SCALA_VERSION%.*}/graphframes-assembly*.jar | tail -n 1)If the command does not find a jar, rerun
./build/sbt core/assemblyand confirm thatSCALA_VERSIONmatches the directory undertarget/(for example,2.12.20). -
Launch PySpark from the Poetry environment so the
graphframespackage is on the Python path:cd python poetry run pyspark \ --driver-memory 2g \ --jars "$GRAPHFRAMES_ASSEMBLY" \ --conf spark.driver.extraClassPath="$GRAPHFRAMES_ASSEMBLY" \ --conf spark.executor.extraClassPath="$GRAPHFRAMES_ASSEMBLY" -
In the shell, run a simple PageRank example:
from graphframes import GraphFrame from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() v = spark.createDataFrame([(1,), (2,), (3,)], ["id"]) e = spark.createDataFrame([(1, 2), (2, 3), (3, 1)], ["src", "dst"]) g = GraphFrame(v, e) g.pageRank(resetProbability=0.15, tol=0.01).vertices.show()Exit the shell with
spark.stop()orCtrl+Dwhen you are done. -
Return to the repository root:
cd ..
4.2 PySpark Connect update
PySpark Connect Plugin messages are located in connect/src/main/protobuf. After making any changes to messages, for exampple, after adding a new API the following is required:
- re-compile the connect project that will trigger generation of new Java classes:
./build/sbt connect/compile - re-generate Python classes from protobuf via
buf:buf generate
5. Pre-commit hooks
Enable the bundled pre-commit hooks to catch formatting and lint issues before
pushing your branch:
pipx install pre-commit # or: pip install pre-commit
pre-commit install
pre-commit run --all-files
6. Making and testing changes
- Edit the relevant files.
- Rerun the Scala and/or Python workflows relevant to your changes.
-
Check your working tree:
git status -
Stage and commit with a descriptive message:
git add <files> git commit -m "feat: describe your change" -
Push your branch and open a pull request:
git push origin feature/my-change -
Keep your branch current by rebasing on the latest upstream changes:
git fetch upstream git rebase upstream/master
7. Update documentation
GraphFrames documentation is built with Typelevel Laika. Documentation files in a markdown format are in docs/src.
7.1 Laika Directives
The following custom Laika directives are provided on top of built-in:
@pydoc(class-name), for example,@pydoc(graphframes.GraphFrame)-- reference PySpark API documentation for the class@scaladoc(class-name), for example,@scaladoc(org.graphframes.GraphFrame)-- reference Scala API documentation for the class@srcLink(sub-path), for example,@srcLink(python/graphframes/tutorials/stackexchange.py)-- link to the source code in github
The following built-in Laika directives may be useful.
@image
An example is:
@:image(/img/graphframes-internals/graphframes-overview.png) {
intrinsicWidth = 600
alt = "An overview of GraphFrames and Apache Spark connection"
title = "GraphFrames Overview"
}
A full list of built-in directives may be found in Laika Documentation.
7.2 Build and preivew
To build documentation and run a preview server run ./build/sbt docs/laikaPreview.
8. Quick reference
| Task | Command |
|---|---|
| Compile Scala code | ./build/sbt compile |
| Format Scala code | ./build/sbt scalafmtAll test:scalafmt |
| Run Scala tests | ./build/sbt test |
| Run a specific Scala suite | ./build/sbt "core/testOnly <SuiteName>" |
| Build assembly jar | ./build/sbt core/assembly |
| Install Python dependencies | cd python && poetry install --with dev |
| Run Python tests | cd python && ./run-tests.sh |
| Run Python formatters | poetry run black graphframes tests |
| Install pre-commit hooks | pre-commit install |
You are now ready to iterate on GraphFrames. Refer back to this guide whenever setting up a new machine or refreshing the local development workflow.