Contributing Guide

The GraphFrames project welcomes new contributors. This guide provides an end-to-end checklist—from preparing your workstation to submitting a pull request—so that you can iterate with confidence and keep the test suites green.

1. Prerequisites

Ensure the following tools are installed before cloning the repository:

Tool	Recommended Version	Notes
Git	Latest stable	Required for version control and contribution workflows.
Java Development Kit (JDK)	11 or 17	Spark 3.x supports Java 8/11/17; GraphFrames CI runs on JDK 17.
Python	3.10 – 3.12	Required for the Python APIs and tests.
Apache Spark (binary distribution)	3.5.x (default) or 4.0.x	Needed for the Python test suite.
Poetry	≥ 1.8	Dependency manager used by the Python package. Install via `pipx` or `pip`.
Protocol Buffers compiler (`protoc`)	≥ 3.21	Required for the GraphFrames Connect protobuf build.
Buf CLI	Latest stable	Used to lint and generate protobuf sources.
Apache Spark (optional)	3.5.x (default) or 4.0.x	Only required if you want the standalone Spark shell outside PySpark.
Docker (optional)	Latest stable	Useful for isolated environments but not mandatory.

1.1 Install required tooling

macOS (Homebrew)

brew update
brew install git openjdk@17 python@3.12 pipx protobuf bufbuild/buf/buf
pipx install poetry

Add the Java toolchain to your shell profile (for example ~/.zshrc or ~/.bashrc):

export JAVA_HOME="$(/usr/libexec/java_home -v17)"
export PATH="$JAVA_HOME/bin:$PATH"

Ubuntu / Debian

sudo apt update
sudo apt install -y git openjdk-17-jdk python3 python3-venv python3-pip curl protobuf-compiler
python3 -m pip install --user pipx
python3 -m pipx ensurepath
pipx install poetry
curl -sSL "https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64.tar.gz" \
  | sudo tar -xzf - -C /usr/local --strip-components=1

Add the Java toolchain to your shell profile (for example ~/.zshrc or ~/.bashrc):

export JAVA_HOME="/usr/lib/jvm/java-17-openjdk-amd64"
export PATH="$JAVA_HOME/bin:$PATH"

Optional: Standalone Apache Spark distribution

poetry install (described later) already brings in the matching version of PySpark and Spark Connect. If you also want the standalone Spark shell or spark-submit, download the distribution that matches the build’s spark.version (currently 3.5.6) and expose it via SPARK_HOME:

curl -O https://downloads.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
mkdir -p "$HOME/.local/spark"
tar -xzf spark-3.5.6-bin-hadoop3.tgz -C "$HOME/.local/spark"
export SPARK_HOME="$HOME/.local/spark/spark-3.5.6-bin-hadoop3"
export PATH="$SPARK_HOME/bin:$PATH"

Tip: To build against Spark 4.x, pass -Dspark.version=<version> to ./build/sbt or to the jar-building helper script referenced in the Python workflow.

2. Clone the repository

Fork the repository on GitHub (recommended) and copy the clone URL.

Clone your fork and switch into the workspace:

git clone https://github.com/<your-user>/graphframes.git
cd graphframes

Configure the upstream remote to stay in sync:

git remote add upstream https://github.com/graphframes/graphframes.git
git fetch upstream

Create a topic branch for your change:
```
git checkout -b feature/my-change
```

3. Scala / JVM workflow

GraphFrames provides a checked-in sbt launcher at ./build/sbt; a separate sbt installation is not required.

3.1 Compile the project

./build/sbt compile

The first run downloads all dependencies and may take several minutes. If compilation fails with -Xfatal-warnings complaining about deprecated java.net.URL constructors, ensure the build is using your Java 17 toolchain. You can force it with:

./build/sbt -java-home "$JAVA_HOME" compile

3.2 Format Scala code

You can use the project’s pre-commit hooks to automatically format all Scala code at once.

3.3 Run Scala tests

Run the full test suite:

./build/sbt test

Focus on a specific suite while iterating:

./build/sbt "core/testOnly org.graphframes.lib.PageRankSuite"

3.4 Helpful sbt tasks

Command	Purpose
`./build/sbt core/assembly`	Builds the uber-jar required for Python tests.
`./build/sbt doc`	Generates Scala API documentation.
`./build/sbt +package`	Cross-builds artifacts for all supported Scala versions.
`./build/sbt scalafmtAll test:scalafmt scalafixAll`	Formats and lints all Scala code using Scalafmt and Scalafix.
`./build/sbt docs/laikaPreview`	Serves the documentation site locally at http://localhost:4242.
`./build/sbt -Dspark.version=4.0.1 compile`	Compiles against Spark 4.x APIs.
`./build/sbt package -Dvendor.name=dbx`	Produces Databricks-compatible Spark Connect jars.

4. Python workflow

The Python package resides under python/ and uses Poetry for dependency management.

To build the GraphFrames assembly jar and run the Python test suite, you can use the provided helper script and pytest directly.

Install Python dependencies (from the python/ directory):
```
poetry install --with dev,tutorials,docs
```
Build the required GraphFrames JAR for your Spark version:
```
poetry run python ./dev/build_jar.py
```
This script will automatically build the correct JAR for Spark 3.5.x (or 4.x if specified). You do not need to run sbt directly.
Run the Python test suite:
```
poetry run pytest -vvv
```
The test configuration will automatically pick up the correct JAR and Spark version (see python/tests/conftest.py for details).

Enable Spark Connect coverage by exporting SPARK_CONNECT_MODE_ENABLED=1:
```
SPARK_CONNECT_MODE_ENABLED=1 poetry run pytest -vvv
```

Optionally enforce formatting and linting:

poetry run black graphframes tests
poetry run isort graphframes tests
poetry run flake8 graphframes tests

You can see this workflow in action in the CI configuration.

4.1 PySpark smoke tests

After building the assembly jar you can validate your build in an interactive PySpark session.

Export the assembly path from the repository root:
```
export GRAPHFRAMES_ASSEMBLY=$(ls ../core/target/scala-${SCALA_VERSION%.*}/graphframes-assembly*.jar | tail -n 1)
```
If the command does not find a jar, rerun ./build/sbt core/assembly and confirm that SCALA_VERSION matches the directory under target/ (for example, 2.12.20).

Launch PySpark from the Poetry environment so the graphframes package is on the Python path:

cd python
poetry run pyspark \
 --driver-memory 2g \
 --jars "$GRAPHFRAMES_ASSEMBLY" \
 --conf spark.driver.extraClassPath="$GRAPHFRAMES_ASSEMBLY" \
 --conf spark.executor.extraClassPath="$GRAPHFRAMES_ASSEMBLY"

In the shell, run a simple PageRank example:

from graphframes import GraphFrame
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

v = spark.createDataFrame([(1,), (2,), (3,)], ["id"])
e = spark.createDataFrame([(1, 2), (2, 3), (3, 1)], ["src", "dst"])

g = GraphFrame(v, e)
g.pageRank(resetProbability=0.15, tol=0.01).vertices.show()

Exit the shell with spark.stop() or Ctrl+D when you are done.

Return to the repository root:
```
cd ..
```

4.2 PySpark Connect update

PySpark Connect Plugin messages are located in connect/src/main/protobuf. After making any changes to messages, for exampple, after adding a new API the following is required:

re-compile the connect project that will trigger generation of new Java classes: ./build/sbt connect/compile
re-generate Python classes from protobuf via buf: buf generate

5. Pre-commit hooks

Enable the bundled pre-commit hooks to catch formatting and lint issues before pushing your branch:

pipx install pre-commit  # or: pip install pre-commit
pre-commit install
pre-commit run --all-files

6. Making and testing changes

Edit the relevant files.
Rerun the Scala and/or Python workflows relevant to your changes.
Check your working tree:
```
git status
```

Stage and commit with a descriptive message:

git add <files>
git commit -m "feat: describe your change"

Push your branch and open a pull request:
```
git push origin feature/my-change
```
Keep your branch current by rebasing on the latest upstream changes:
```
git fetch upstream
git rebase upstream/master
```

7. Update documentation

GraphFrames documentation is built with Typelevel Laika. Documentation files in a markdown format are in docs/src.

7.1 Laika Directives

The following custom Laika directives are provided on top of built-in:

@pydoc(class-name), for example, @pydoc(graphframes.GraphFrame) -- reference PySpark API documentation for the class
@scaladoc(class-name), for example, @scaladoc(org.graphframes.GraphFrame) -- reference Scala API documentation for the class
@srcLink(sub-path), for example, @srcLink(python/graphframes/tutorials/stackexchange.py) -- link to the source code in github

The following built-in Laika directives may be useful.

@image

An example is:

@:image(/img/graphframes-internals/graphframes-overview.png) {
    intrinsicWidth = 600
    alt = "An overview of GraphFrames and Apache Spark connection"
    title = "GraphFrames Overview"
}

A full list of built-in directives may be found in Laika Documentation.

7.2 Build and preivew

To build documentation and run a preview server run ./build/sbt docs/laikaPreview.

8. Quick reference

Task	Command
Compile Scala code	`./build/sbt compile`
Format Scala code	`./build/sbt scalafmtAll test:scalafmt`
Run Scala tests	`./build/sbt test`
Run a specific Scala suite	`./build/sbt "core/testOnly <SuiteName>"`
Build assembly jar	`./build/sbt core/assembly`
Install Python dependencies	`cd python && poetry install --with dev`
Run Python tests	`cd python && ./run-tests.sh`
Run Python formatters	`poetry run black graphframes tests`
Install pre-commit hooks	`pre-commit install`

You are now ready to iterate on GraphFrames. Refer back to this guide whenever setting up a new machine or refreshing the local development workflow.