Vector search extension
The vector
extension provides a native disk-based HNSW vector index
for accelerating similarity search over vector embeddings (32-bit and 64-bit float arrays) stored in Kuzu.
The HNSW index is structured with two hierarchical layers. The lower layer includes all vectors, while the upper layer contains a sampled subset of the lower layer.
This extension provides the following functions:
CREATE_VECTOR_INDEX
: Create a vector indexQUERY_VECTOR_INDEX
: Query a vector indexDROP_VECTOR_INDEX
: Drop a vector index
Usage
INSTALL VECTOR;LOAD VECTOR;
Example dataset
Below is an example demonstrating two ways in which such a dataset can be created. The first is to use some external
library, such as sentence_transformers
in Python to create the embeddings. Alternatively, you can use Kuzu’s llm
extension to directly create the embeddings using Cypher.
# pip install sentence-transformersimport kuzuimport os
db = kuzu.Database("example.kuzu")conn = kuzu.Connection(db)
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("all-MiniLM-L6-v2")
conn.execute("INSTALL vector; LOAD vector;")
conn.execute("CREATE NODE TABLE Book(id SERIAL PRIMARY KEY, title STRING, title_embedding FLOAT[384], published_year INT64);")conn.execute("CREATE NODE TABLE Publisher(name STRING PRIMARY KEY);")conn.execute("CREATE REL TABLE PublishedBy(FROM Book TO Publisher);")
titles = [ "The Quantum World", "Chronicles of the Universe", "Learning Machines", "Echoes of the Past", "The Dragon's Call"]publishers = ["Harvard University Press", "Independent Publisher", "Pearson", "McGraw-Hill Ryerson", "O'Reilly"]published_years = [2004, 2022, 2019, 2010, 2015]
for title, published_year in zip(titles, published_years): embeddings = model.encode(title).tolist() conn.execute( """ CREATE (b:Book { title: $title, title_embedding: $embeddings, published_year: $year });""", {"title": title, "year": published_year, "embeddings": embeddings} )
print(f"Inserted book: {title}")
for publisher in publishers: conn.execute( """CREATE (p:Publisher {name: $publisher});""", {"publisher": publisher} ) print(f"Inserted publisher: {publisher}")
for title, publisher in zip(titles, publishers): conn.execute(""" MATCH (b:Book {title: $title}) MATCH (p:Publisher {name: $publisher}) CREATE (b)-[:PublishedBy]->(p); """, {"title": title, "publisher": publisher} ) print(f"Created relationship between {title} and {publisher}")
import kuzuimport os
db = kuzu.Database("example.kuzu")conn = kuzu.Connection(db)
conn.execute("INSTALL llm; LOAD llm;")os.environ["OPENAI_API_KEY"] = "sk-proj-key" # Replace with your own OpenAI API key
conn.execute("INSTALL vector; LOAD vector;")
conn.execute("CREATE NODE TABLE Book(id SERIAL PRIMARY KEY, title STRING, title_embedding FLOAT[384], published_year INT64);")conn.execute("CREATE NODE TABLE Publisher(name STRING PRIMARY KEY);")conn.execute("CREATE REL TABLE PublishedBy(FROM Book TO Publisher);")
titles = [ "The Quantum World", "Chronicles of the Universe", "Learning Machines", "Echoes of the Past", "The Dragon's Call"]publishers = ["Harvard University Press", "Independent Publisher", "Pearson", "McGraw-Hill Ryerson", "O'Reilly"]published_years = [2004, 2022, 2019, 2010, 2015]
for title, published_year in zip(titles, published_years): conn.execute( """ CREATE (b:Book { title: $title, title_embedding: create_embedding( $title, 'open-ai', 'text-embedding-3-small', 384 ), published_year: $year }); """, {"title": title, "year": published_year} )
print(f"Inserted book: {title}")
for publisher in publishers: conn.execute( """CREATE (p:Publisher {name: $publisher});""", {"publisher": publisher} ) print(f"Inserted publisher: {publisher}")
for title, publisher in zip(titles, publishers): conn.execute(""" MATCH (b:Book {title: $title}) MATCH (p:Publisher {name: $publisher}) CREATE (b)-[:PublishedBy]->(p); """, {"title": title, "publisher": publisher} ) print(f"Created relationship between {title} and {publisher}")
The embeddings are generated on the title properties of each Book
and
ingested into the Kuzu database.
Creating a vector index
Create a vector index as follows:
CALL CREATE_VECTOR_INDEX( <TABLE_NAME>, <INDEX_NAME>, <PROPERTY_NAME>, mu := 30, ml := 60, pu := 0.05, metric := 'cosine', efc := 200, cache_embeddings := true);
Required arguments:
TABLE_NAME
: The node table containing a property on which the index is to be created.INDEX_NAME
: The name of the vector index.PROPERTY_NAME
: The name of the vector property on which the index is to be created. The property must be aLIST
orARRAY
of typeFLOAT
orDOUBLE
.
Optional arguments to tune the index:
mu
- Max degree of nodes in the upper graph. It should be smaller than
ml
. - A higher value leads to a more accurate index, but increases the index size and construction time.
- Default:
30
- Max degree of nodes in the upper graph. It should be smaller than
ml
- Max degree of nodes in the lower graph. It should be larger than
mu
. - A higher value leads to a more accurate index, but increases the index size and construction time.
- Default:
60
- Max degree of nodes in the lower graph. It should be larger than
pu
- Percentage of nodes sampled into the upper graph.
- Supported values:
[0.0, 1.0]
- Default:
0.05
metric
- Metric (distance computation) functions.
- Supported values:
cosine
,l2
,l2sq
,dotproduct
- Default:
cosine
efc
- The number of candidate vertices to consider during the construction of the index.
- A higher value will result in a more accurate index, but will also increase the time it takes to build the index.
- Default:
200
cache_embeddings
- Determines whether the embeddings column should be fully cached in memory during the index construction.
- This will decrease the amount of time needed to construct the index but will increase the memory usage. We recommend keeping this value set to true unless you are in a memory-constrained environment.
- Default:
true
Example
You can create a vector index over the title_embedding
column from Book
table as follows:
CALL CREATE_VECTOR_INDEX( 'Book', 'book_title_index', 'title_embedding', metric := 'l2');
Query the vector index
To perform similarity search using the vector index, use the QUERY_VECTOR_INDEX
function:
CALL QUERY_VECTOR_INDEX( <TABLE_NAME>, <INDEX_NAME>, <QUERY_VECTOR>, <K>, efs := 200)RETURN node.id, distance;
Required arguments:
TABLE_NAME
: The node table on which the index was created.- Type:
STRING
- Type:
INDEX_NAME
: The name of the vector index.- Type:
STRING
- Type:
QUERY_VECTOR
: The vector to search for.- Type:
LIST[FLOAT]
- Type:
K
: The number of nearest neighbors to return.- Type:
INT64
- Type:
Optional arguments to tune the search behavior:
efs
: The number of candidate vertices to consider during search. A higher value will result in a more accurate search, but will also increase the time it takes to search.- Type:
INT64
- Default:
200
- Type:
Returns:
node
: The node object.distance
: The distance between the query vector and the node’s vector.
Example
Let’s run some example search queries on our newly created vector index.
import kuzu
# Initialize the databasedb = kuzu.Database("example.kuzu")conn = kuzu.Connection(db)
# Install and load vector extension once againconn.execute("INSTALL VECTOR;")conn.execute("LOAD VECTOR;")
from sentence_transformers import SentenceTransformer# Load a pre-trained embedding generation model# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2model = SentenceTransformer("all-MiniLM-L6-v2")
query_vector = model.encode("quantum machine learning").tolist()result = conn.execute( """ CALL QUERY_VECTOR_INDEX( 'Book', 'book_title_index', $query_vector, $limit, efs := 500 ) RETURN node.title ORDER BY distance; """, {"query_vector": query_vector, "limit": 2})
print(result.get_as_pl())
import kuzu
# Initialize the databasedb = kuzu.Database("example.kuzu")conn = kuzu.Connection(db)
# Install and load vector extension once againconn.execute("INSTALL VECTOR;")conn.execute("LOAD VECTOR;")
conn.execute("INSTALL llm; LOAD llm;")os.environ["OPENAI_API_KEY"] = "sk-proj-key" # Replace with your own OpenAI API key
result = conn.execute( """ CALL QUERY_VECTOR_INDEX( 'Book', 'book_title_index', create_embedding( 'quantum machine learning', 'open-ai', 'text-embedding-3-small', 384 ), 2 ) RETURN node.title ORDER BY distance; """)
print(result.get_as_pl())
The above query asks for the 2 nearest neighbors of the query vector “quantum machine learning”. The result is a list of book titles that are most similar to this concept.
┌───────────────────┐│ node.title ││ --- ││ str │╞═══════════════════╡│ The Quantum World ││ Learning Machines │└───────────────────┘
Next, let’s use the vector index to find an entry point to the graph, following which we do a graph traversal to find the names of publishers of the books.
result = conn.execute( """ CALL QUERY_VECTOR_INDEX('Book', 'book_title_index', $query_vector, 2) WITH node AS n, distance MATCH (n)-[:PublishedBy]->(p:Publisher) RETURN p.name AS publisher, n.title AS book, distance ORDER BY distance LIMIT 5; """, {"query_vector": query_vector})print(result.get_as_pl())
The above query asks for the 2 nearest neighbors of the query vector “quantum machine learning”.
Then, it uses the node
and distance
variables to further query the graph and return rows sorted by distance
.
┌──────────────────────────┬───────────────────┬──────────┐│ publisher ┆ book ┆ distance ││ --- ┆ --- ┆ --- ││ str ┆ str ┆ f64 │╞══════════════════════════╪═══════════════════╪══════════╡│ Harvard University Press ┆ The Quantum World ┆ 0.311872 ││ Pearson ┆ Learning Machines ┆ 0.415366 │└──────────────────────────┴───────────────────┴──────────┘
Using vector search in combination with graph traversal in this manner can be a powerful technique to find semantically related entities in a graph.
Index management
Drop an index
To remove a vector index, use the DROP_VECTOR_INDEX
function:
CALL DROP_VECTOR_INDEX('Book', 'book_title_index');
List all indexes
View all created indexes in the database using SHOW_INDEXES
:
CALL SHOW_INDEXES() RETURN *;
Example output:
┌────────────┬──────────────────┬────────────┬───────────────────┬──────────────────┬───────────────────────────────┐│ table name │ index name │ index type │ property names │ extension loaded │ index definition │├────────────┼──────────────────┼────────────┼───────────────────┼──────────────────┼───────────────────────────────┤│ Book │ book_title_index │ HNSW │ [title_embedding] │ True │ CALL CREATE_VECTOR_INDEX(...) │└────────────┴──────────────────┴────────────┴───────────────────┴──────────────────┴───────────────────────────────┘
Filtered vector search
Kuzu allows you to combine vector search with filter predicates by using projected graphs.
For example, we can search for books similar to “quantum world”, but only those that were published after 2010.
# Pass in an existing connection# ...
# Step 1: Create a projected graph that filters books by publication yearconn.execute( """ CALL PROJECT_GRAPH( 'filtered_book', {'Book': 'n.published_year > 2010'}, [] ); """)
# Step 2: Perform vector similarity search on the filtered subsetquery_vector = model.encode("quantum world").tolist()result = conn.execute(""" CALL QUERY_VECTOR_INDEX( 'filtered_book', 'book_title_index', $query_vector, 2 ) WITH node AS n, distance as dist MATCH (n)-[:PublishedBy]->(p:Publisher) RETURN n.title AS book, n.published_year AS year, p.name AS publisher ORDER BY dist; """, {"query_vector": query_vector})print(result.get_as_pl())
export OPENAI_API_KEY=sk-proj-key # Replace with your own OpenAI API key
INSTALL llm;LOAD llm;
// Step 1: Create a projected graph that filters books by publication yearCALL PROJECT_GRAPH( 'filtered_book', // Name of the projected graph {'Book': 'n.published_year > 2010'}, // Projected node table Book with a filter on published_year. `n` is a placeholder here to reference the node table. [] // No relationship tables can be projected.);
// Step 2: Perform vector similarity search on the filtered subset// In the `QUERY_VECTOR_INDEX` function, we can pass in the name of the projected graph as `table_name` parameter.CALL QUERY_VECTOR_INDEX( 'filtered_book', // Name of the projected graph 'book_title_index', // Name of the index create_embedding('quantum world', 'open-ai', 'text-embedding-3-small', 384), 2)WITH node AS n, distance as distMATCH (n)-[:PublishedBy]->(p:Publisher)RETURN n.title AS book, n.published_year AS year, p.name AS publisherORDER BY dist;
The result shows the two most similar books to the query “quantum world”. Although we have a book named “The Quantum World” in the original dataset, it does not appear in the results because it was published before 2010.
shape: (2, 3)┌────────────────────────────┬──────┬───────────────────────┐│ book ┆ year ┆ publisher ││ --- ┆ --- ┆ --- ││ str ┆ i64 ┆ str │╞════════════════════════════╪══════╪═══════════════════════╡│ Chronicles of the Universe ┆ 2022 ┆ Independent Publisher ││ Learning Machines ┆ 2019 ┆ Pearson │└────────────────────────────┴──────┴───────────────────────┘
Filtered vector search with arbitrary cypher queries
To run filtered search with arbitrary cypher queries, you can create a projected graph using project_graph_cypher
.
CALL PROJECT_GRAPH_CYPHER( <GRAPH_NAME>, <CYPHER_STATEMENT>);
GRAPH_NAME
: Name of the projected graph- Type:
STRING
- Type:
CYPHER_STATEMENT
: A cypher statement that returns a node variable- Type:
STRING
- Type:
The cypher statement can contain arbitrary pattern matching, but the return clause must contain a single node variable whose label equals to the table on which a vector index is built.
The following example creates a projected graph pearson_book
that contains books published by Pearson
publisher.
CALL PROJECT_GRAPH_CYPHER( 'pearson_book', // Name of the projected graph 'MATCH (b:Book)-[:PublishedBy]->(p:Publisher {name:'Pearson'}) RETURN b');
You can then replace filtered_book
with pearson_book
in the above examples.