graphframes package¶

Subpackages¶

Contents¶

class graphframes.GraphFrame(v: DataFrame, e: DataFrame)[source]¶

Represents a graph with vertices and edges stored as DataFrames.

Parameters:

v – DataFrame holding vertex information. Must contain a column named “id” that stores unique vertex IDs.
e – DataFrame holding edge information. Must contain two columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.

>>> localVertices = [(1,"A"), (2,"B"), (3, "C")]
>>> localEdges = [(1,2,"love"), (2,1,"hate"), (2,3,"follow")]
>>> v = spark.createDataFrame(localVertices, ["id", "name"])
>>> e = spark.createDataFrame(localEdges, ["src", "dst", "action"])
>>> g = GraphFrame(v, e)

Aggregates messages from the neighbours.

When specifying the messages and aggregation function, the user may reference columns using the static methods in graphframes.lib.AggregateMessages.

See Scala documentation for more details.

Parameters:

aggCol – the requested aggregation output either as pyspark.sql.Column or SQL expression string
sendToSrc – message sent to the source vertex of each triplet either as pyspark.sql.Column or SQL expression string (default: None)
sendToDst – message sent to the destination vertex of each triplet either as pyspark.sql.Column or SQL expression string (default: None)

Returns:

DataFrame with columns for the vertex ID and the resulting aggregated message

bfs(fromExpr: str, toExpr: str, edgeFilter: str | None = None, maxPathLength: int = 10) → DataFrame[source]¶

Breadth-first search (BFS).

See Scala documentation for more details.

Returns:: DataFrame with one Row for each shortest path between matching vertices.

cache() → GraphFrame[source]¶: Persist the dataframe representation of vertices and edges of the graph with the default storage level.

connectedComponents(algorithm: str = 'graphframes', checkpointInterval: int = 2, broadcastThreshold: int = 1000000, useLabelsAsComponents: bool = False) → DataFrame[source]¶

Computes the connected components of the graph.

See Scala documentation for more details.

Parameters:

algorithm – connected components algorithm to use (default: “graphframes”) Supported algorithms are “graphframes” and “graphx”.
checkpointInterval – checkpoint interval in terms of number of iterations (default: 2)
broadcastThreshold – broadcast threshold in propagating component assignments (default: 1000000)
useLabelsAsComponents – if True, uses the vertex labels as components, otherwise will use longs

Returns:

DataFrame with new vertices column “component”

property degrees: DataFrame¶

The degree of each vertex in the graph, returned as a DataFrame with two columns:

“id”: the ID of the vertex
‘degree’ (integer) the degree of the vertex

Note that vertices with 0 edges are not returned in the result.

Returns:: DataFrame with new vertices column “degree”

dropIsolatedVertices() → GraphFrame[source]¶

Drops isolated vertices, vertices are not contained in any edges.

Returns:: GraphFrame with filtered vertices.

property edges: DataFrame¶: DataFrame holding edge information, with unique columns “src” and “dst” storing source vertex IDs and destination vertex IDs of edges, respectively.

filterEdges(condition: str | Column) → GraphFrame[source]¶

Filters the edges based on expression, keep all vertices.

Parameters:: condition – String or Column describing the condition expression for filtering.
Returns:: GraphFrame with filtered edges.

filterVertices(condition: str | Column) → GraphFrame[source]¶

Filters the vertices based on expression, remove edges containing any dropped vertices.

Parameters:: condition – String or Column describing the condition expression for filtering.
Returns:: GraphFrame with filtered vertices and edges.

find(pattern: str) → DataFrame[source]¶

Motif finding.

See Scala documentation for more details.

Parameters:: pattern – String describing the motif to search for.
Returns:: DataFrame with one Row for each instance of the motif found

property inDegrees: DataFrame¶

The in-degree of each vertex in the graph, returned as a DataFame with two columns:

“id”: the ID of the vertex
“inDegree” (int) storing the in-degree of the vertex

Note that vertices with 0 in-edges are not returned in the result.

Returns:: DataFrame with new vertices column “inDegree”

labelPropagation(maxIter: int) → DataFrame[source]¶

Runs static label propagation for detecting communities in networks.

See Scala documentation for more details.

Parameters:: maxIter – the number of iterations to be performed
Returns:: DataFrame with new vertices column “label”

property outDegrees: DataFrame¶

The out-degree of each vertex in the graph, returned as a DataFrame with two columns:

“id”: the ID of the vertex
“outDegree” (integer) storing the out-degree of the vertex

Note that vertices with 0 out-edges are not returned in the result.

Returns:: DataFrame with new vertices column “outDegree”

pageRank(resetProbability: float = 0.15, sourceId: Any | None = None, maxIter: int | None = None, tol: float | None = None) → GraphFrame[source]¶

Runs the PageRank algorithm on the graph. Note: Exactly one of fixed_num_iter or tolerance must be set.

See Scala documentation for more details.

Parameters:

resetProbability – Probability of resetting to a random vertex.
sourceId – (optional) the source vertex for a personalized PageRank.
maxIter – If set, the algorithm is run for a fixed number of iterations. This may not be set if the tol parameter is set.
tol – If set, the algorithm is run until the given tolerance. This may not be set if the numIter parameter is set.

Returns:

GraphFrame with new vertices column “pagerank” and new edges column “weight”

parallelPersonalizedPageRank(resetProbability: float = 0.15, sourceIds: list[Any] | None = None, maxIter: int | None = None) → GraphFrame[source]¶

Run the personalized PageRank algorithm on the graph, from the provided list of sources in parallel for a fixed number of iterations.

See Scala documentation for more details.

Parameters:

resetProbability – Probability of resetting to a random vertex
sourceIds – the source vertices for a personalized PageRank
maxIter – the fixed number of iterations this algorithm runs

Returns:

GraphFrame with new vertices column “pageranks” and new edges column “weight”

persist(storageLevel: StorageLevel = StorageLevel(False, True, False, False, 1)) → GraphFrame[source]¶: Persist the dataframe representation of vertices and edges of the graph with the given storage level.

powerIterationClustering(k: int, maxIter: int, weightCol: str | None = None) → DataFrame[source]¶

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

Parameters:

k – the numbers of clusters to create
maxIter – param for maximum number of iterations (>= 0)
weightCol – optional name of weight column, 1.0 is used if not provided

Returns:

DataFrame with new column “cluster”

property pregel: Pregel¶

Get the graphframes.lib.Pregel object for running pregel.

See graphframes.lib.Pregel for more details.

shortestPaths(landmarks: list[Any]) → DataFrame[source]¶

Runs the shortest path algorithm from a set of landmark vertices in the graph.

See Scala documentation for more details.

Parameters:: landmarks – a set of one or more landmarks
Returns:: DataFrame with new vertices column “distances”

stronglyConnectedComponents(maxIter: int) → DataFrame[source]¶

Runs the strongly connected components algorithm on this graph.

See Scala documentation for more details.

Parameters:: maxIter – the number of iterations to run
Returns:: DataFrame with new vertex column “component”

svdPlusPlus(rank: int = 10, maxIter: int = 2, minValue: float = 0.0, maxValue: float = 5.0, gamma1: float = 0.007, gamma2: float = 0.007, gamma6: float = 0.005, gamma7: float = 0.015) → tuple[DataFrame, float][source]¶

Runs the SVD++ algorithm.

See Scala documentation for more details.

Returns:: Tuple of DataFrame with new vertex columns storing learned model, and loss value

triangleCount() → DataFrame[source]¶

Counts the number of triangles passing through each vertex in this graph.

See Scala documentation for more details.

Returns:: DataFrame with new vertex column “count”

property triplets: DataFrame¶

The triplets (source vertex)-[edge]->(destination vertex) for all edges in the graph.

Returned as a DataFrame with three columns:

“src”: source vertex with schema matching ‘vertices’
“edge”: edge with schema matching ‘edges’
‘dst’: destination vertex with schema matching ‘vertices’

Returns:: DataFrame with columns ‘src’, ‘edge’, and ‘dst’

unpersist(blocking: bool = False) → GraphFrame[source]¶: Mark the dataframe representation of vertices and edges of the graph as non-persistent, and remove all blocks for it from memory and disk.

property vertices: DataFrame¶: DataFrame holding vertex information, with unique column “id” for vertex IDs.

graphframes package¶

Subpackages¶

Contents¶

Table of Contents

Previous topic

Next topic

This Page