GraphFrames in Jupyter

GraphX

Graphframes

GraphX is to RDDs as GraphFrames are to DataFrames

A GraphFrame is always created from a vertex DataFrame (e.g. users) and an edges DataFrame (e.g. relationships between users). The schema of both DataFrames has some mandatory columns. The vertex DataFrame must contain a column named id that stores unique vertex IDs. The edges DataFrame must contain a column named src that stores the source of the edge and a column named dst that stores the destination of the edge. All other columns are optional and can be added depending on one’s needs.

g = GraphFrame(vertices, edges)
## Take a look at the DataFrames
g.vertices.show()
g.edges.show()
## Check the number of edges of each vertex
g.degrees.show()

directed vs undirected edges

有向边以及无向边部分

A GraphFrame itself can’t be filtered, but DataFrames deducted from a Graph can. Consequently, the filter-function (or any other function) can be used just as you would use it with DataFrames.

图的全连通部分

Motif finding

Finding motifs helps to execute queries to discover structural patterns in graphs

As an example we can try to find the mutual friends for any pair of users a and c. In order to be a mutual friend b, b must be a friend with both a and c (and not just followed by c, for example).

1
2
3

mutualFriends = 
g.find("(a)-[]->(b); (b)-[]->(c); (c)-[]->(b); (b)-[]->(a)")\
.dropDuplicates()

TriangleCount and PageRank

由graphFrames衍生的很自然的两个算法