Run graph algorithms
One of the overarching goals of Kùzu is to function as the go-to graph database for data science
use cases. NetworkX is a popular library in Python for graph algorithms and data science. In this
section, we demonstrate Kùzu’s ease of use in exporting subgraphs to the NetworkX format using the
get_as_networkx()
function in the Python API. In addition, the following two capabilities are
demonstrated.
- Graph Visualization: We visualize subgraphs of interest via Kùzu explorer
- PageRank: We compute PageRank on an extracted subgraph, store these values back in Kùzu’s node tables and query them.
The dataset we will use for this exercise is the MovieLens dataset, available here. The small version of the dataset is used, which contains 610 user nodes, 9724 movie nodes, 100863 rates edges, and 3684 tags edges. The schema of the dataset is shown below.
You can download the dataset locally via wget.
wget https://kuzudb.com/data/movie-lens/movies.csvwget https://kuzudb.com/data/movie-lens/users.csvwget https://kuzudb.com/data/movie-lens/ratings.csvwget https://kuzudb.com/data/movie-lens/tags.csv
Place the CSV files in a directory named movie_data
in the same directory in which you want the
database to be stored.
Insert data to Kùzu
The data is copied to a Kùzu database via the Python API as follows:
import shutil
db_path = './ml-small_db'shutil.rmtree(db_path, ignore_errors=True)
def load_data(connection): connection.execute('CREATE NODE TABLE Movie (movieId INT64, year INT64, title STRING, genres STRING, PRIMARY KEY (movieId))') connection.execute('CREATE NODE TABLE User (userId INT64, PRIMARY KEY (userId))') connection.execute('CREATE REL TABLE Rating (FROM User TO Movie, rating DOUBLE, timestamp INT64)') connection.execute('CREATE REL TABLE Tags (FROM User TO Movie, tag STRING, timestamp INT64)')
connection.execute('COPY Movie FROM "./movies.csv" (HEADER=TRUE)') connection.execute('COPY User FROM "./users.csv" (HEADER=TRUE)') connection.execute('COPY Rating FROM "./ratings.csv" (HEADER=TRUE)') connection.execute('COPY Tags FROM "./tags.csv" (HEADER=TRUE)')
db = kuzu.Database(db_path)conn = kuzu.Connection(db)load_data(conn)
Visualize subgraphs in Kùzu Explorer
You can visualize the data in Kùzu Explorer as shown in the previous section. An example is shown below.
// Return the first two users, their movies and their ratingsMATCH (u:User)-[r:Rating]->(m:Movie)WHERE u.userId IN [1, 2]RETURN u, r, m LIMIT 100;
Export subgraph to NetworkX
You can extract only the subgraph between users and movies (ignoring tags) and convert it to a
NetworkX graph G
. This assumes that the network
package is installed via pip.
# pip install networkxres = conn.execute('MATCH (u:User)-[r:Rating]->(m:Movie) RETURN u, r, m')G = res.get_as_networkx(directed=False)
We output an undirected graph as the direction doesn’t matter for the PageRank algorithm.
Compute PageRank
We can compute the PageRank of the subgraph G
using NetworkX’s pagerank
function.
pageranks = nx.pagerank(G)
The movie nodes’ PageRanks along with their IDs can then be put into a Pandas DataFrame as follows:
pagerank_df = pd.DataFrame.from_dict(pageranks, orient="index", columns=["pagerank"])movie_df = pagerank_df[pagerank_df.index.str.contains("Movie")]movie_df.index = movie_df.index.str.replace("Movie_", "").astype(int)movie_df = movie_df.reset_index(names=["id"])print(f"Calculated pageranks for {len(movie_df)} nodes\n")print(movie_df.sort_values(by="pagerank", ascending=False).head())
Calculated pageranks for 9724 nodes
id pagerank20 356 0.001155232 318 0.00109916 296 0.001075166 2571 0.00100634 593 0.000987
Similarly, we can store the PageRanks for the user nodes in a Pandas DataFrame the same way:
user_df = pagerank_df[pagerank_df.index.str.contains("User")]user_df.index = user_df.index.str.replace("User_", "").astype(int)user_df = user_df.reset_index(names=["id"])user_df.sort_values(by="pagerank", ascending=False).head()
Write PageRank values back to Kùzu
To write the values back to Kùzu, first update the node table schemas to include a new property
pagerank
.
try: # Alter original node table schemas to add pageranks conn.execute("ALTER TABLE Movie ADD pagerank DOUBLE DEFAULT 0.0;") conn.execute("ALTER TABLE User ADD pagerank DOUBLE DEFAULT 0.0;")except RuntimeError: # If the column already exists, do nothing pass
An important feature of Kùzu is its ability to natively scan Pandas DataFrames in a zero-copy manner. This allows for efficient data transfer between your data in Python and Kùzu. The following code snippet shows how this is done for the movie nodes.
# Copy pagerank values to movie nodesx = conn.execute( """ LOAD FROM movie_df MERGE (m:Movie {movieId: id}) ON MATCH SET m.pagerank = pagerank RETURN m.movieId AS movieId, m.pagerank AS pagerank; """)
movieId pagerank0 1 0.0007761 3 0.0002002 6 0.0003683 47 0.0007074 50 0.000724
The same can be done for the user nodes.
# Copy user pagerank values to user nodesy = conn.execute( """ LOAD FROM user_df MERGE (u:User {userId: id}) ON MATCH SET u.pagerank = pagerank RETURN u.userId As userId, u.pagerank AS pagerank; """)
userId pagerank0 1 0.0008671 2 0.0001342 3 0.0002543 4 0.0009294 5 0.000151
Query PageRank values in Kùzu
You can run a query to print the top 20 pagerank movies to test that the upload worked:
res1 = conn.execute( """ MATCH (m:Movie) RETURN m.title, m.pagerank ORDER BY m.pagerank DESC LIMIT 5 """)print(res1.get_as_df())
m.title m.pagerank
m.title m.pagerank0 Forrest Gump (1994) 0.0011551 Shawshank Redemption, The (1994) 0.0010992 Pulp Fiction (1994) 0.0010753 Matrix, The (1999) 0.0010064 Silence of the Lambs, The (1991) 0.000987
And similarly, for the user nodes:
res2 = conn.execute( """ MATCH (u:User) RETURN u.userId, u.pagerank ORDER BY u.pagerank DESC LIMIT 5 """)print(res2.get_as_df())
u.userId u.pagerank0 599 0.0164011 414 0.0147112 474 0.0143803 448 0.0129424 610 0.008492
Further work
You’ve now seen how to use NetworkX to run algorithms on a Kùzu graph, and move data back and forth between Kùzu and Python.
There are numerous additional computations you can perform in NetworkX and store these results in Kùzu. See the tutorial notebook on Google Colab to try it for yourself!