Skip to content
Blog

Vector search

The VECTOR extension provides an on-disk HNSW-based vector index for accelerating similarity search over float array columns in tables.

The vector extension provides the following functions:

FunctionDescription
CREATE_VECTOR_INDEXCreate the index
QUERY_VECTOR_INDEXQuery the index
DROP_VECTOR_INDEXDrop the index

Installation

To get started with the vector index extension, you need to first install and load the extension following the Using Extensions in Kuzu instructions.

INSTALL VECTOR;
LOAD VECTOR;

Basic Usage

Below is an example demonstrating how to create and use a vector index on a Book table. Because we’ll be working with natural language texts that need to be translated into vector embeddings, we’ll use the Python client to run our queries. In principle, you can use any client code that returns an array of floats (vector embeddings) to run the queries below.

create_embeddings.py
# pip install sentence-transformers
import kuzu
from sentence_transformers import SentenceTransformer
DB_NAME = "ex_kuzu_db"
# Load a pre-trained embedding generation model
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model = SentenceTransformer("all-MiniLM-L6-v2")
# Initialize the database
db = kuzu.Database(DB_NAME)
conn = kuzu.Connection(db)
# Install and load vector extension
conn.execute("INSTALL vector; LOAD vector;")
# Create tables
conn.execute("CREATE NODE TABLE Book(id SERIAL PRIMARY KEY, title STRING, title_embedding FLOAT[384], published_year INT64);")
conn.execute("CREATE NODE TABLE Publisher(name STRING PRIMARY KEY);")
conn.execute("CREATE REL TABLE PublishedBy(FROM Book TO Publisher);")
# Sample data
titles = [
"The Quantum World",
"Chronicles of the Universe",
"Learning Machines",
"Echoes of the Past",
"The Dragon's Call"
]
publishers = ["Harvard University Press", "Independent Publisher", "Pearson", "McGraw-Hill Ryerson", "O'Reilly"]
published_years = [2004, 2022, 2019, 2010, 2015]
# Insert sample data - Books with embeddings
for title, published_year in zip(titles, published_years):
# Convert title to a 384-dimensional embedding vector
embeddings = model.encode(title).tolist()
conn.execute(
"""CREATE (b:Book {title: $title, title_embedding: $embeddings, published_year: $year});""",
{"title": title, "embeddings": embeddings, "year": published_year}
)
print(f"Inserted book: {title}")
# Insert sample data - Publishers
for publisher in publishers:
conn.execute(
"""CREATE (p:Publisher {name: $publisher});""",
{"publisher": publisher}
)
print(f"Inserted publisher: {publisher}")
# Create relationships between Books and Publishers
for title, publisher in zip(titles, publishers):
conn.execute("""
MATCH (b:Book {title: $title})
MATCH (p:Publisher {name: $publisher})
CREATE (b)-[:PublishedBy]->(p);
""",
{"title": title, "publisher": publisher}
)
print(f"Created relationship between {title} and {publisher}")

The embeddings are generated on the title properties of each Book and ingested into the Kuzu database.

Create the Vector Index

Creating a new vector index as follows:

CALL CREATE_VECTOR_INDEX(
'table_name', // Name of the table containing the vector column
'index_name', // Name to identify the vector index
'column_name', // Name of the column containing vector embeddings
[option_name := option_value] // Optional parameters for index configuration
);

We support following options during index creation. Our index is structured with two hierarchical layers. The lower layer includes all vectors, while the upper layer contains a sampled subset.

option namedescriptiondefault
muMax degree of nodes in the upper graph. It should be smaller than ml. A higher value leads to a more accurate index, but increase the index size and construction time.30
mlMax degree of nodes in the lower graph. It should be larger than mu. A higher value leads to a more accurate index, but increase the index size and construction time.60
puPercentage of nodes sampled into the upper graph in the range of [0.0, 1.0].0.05
metricMetric (distance computation) functions. Supported values are cosine, l2, l2sq, and dotproduct.cosine
efcThe number of candidate vertices to consider during the construction of the index. A higher value will result in a more accurate index, but will also increase the time it takes to build the index.200

In our example, we create a vector index over the title_embedding column from Book table.

CALL CREATE_VECTOR_INDEX(
'Book',
'title_vec_index',
'title_embedding'
);

Query the Vector Index

To perform similarity search using the vector index, use the QUERY_VECTOR_INDEX function:

// Syntax for querying the index
CALL QUERY_VECTOR_INDEX(
'table_name', // Name of the table
'index_name', // Name of the vector index
query_vector, // Vector to search for
k, // Number of nearest neighbors to return
[option_name := option_value] // Optional parameters
) RETURN node.id ORDER BY distance;

We return nodes, which can be referenced by node and their distance from the query vector, which can be referenced by distance. You can use YIELD to rename the result columns. More details on YIELD can be found here. By default, the returned result from QUERY_VECTOR_INDEX is not sorted. To get sorted result on distance, you need to manually specify ORDER BY distance in the RETURN clause.

Search Options

The following options can be used to tune the search behavior:

OptionDescriptionDefault
efsNumber of candidate vertices to consider during search. Higher values increase accuracy but also increase search time.200

Example search queries

Let’s run some example search queries on our newly created vector index. Because we’ll be working with natural language queries that need to be translated into vector embeddings, we’ll use the Python client to run our queries. In principle, you can use any client code that returns an array of floats (vector embeddings) to run the queries below.

import kuzu
from sentence_transformers import SentenceTransformer
DB_NAME = "ex_kuzu_db"
# Load a pre-trained embedding generation model
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model = SentenceTransformer("all-MiniLM-L6-v2")
# Initialize the database
db = kuzu.Database(DB_NAME)
conn = kuzu.Connection(db)
# Install and load vector extension once again
conn.execute("INSTALL VECTOR;")
conn.execute("LOAD VECTOR;")
query_vector = model.encode("quantum machine learning").tolist()
result = conn.execute(
"""
CALL QUERY_VECTOR_INDEX(
'Book',
'title_vec_index',
$query_vector,
2
)
RETURN node.title ORDER BY distance;
""",
{"query_vector": query_vector})
print(result.get_as_pl())

In the above query, we asked for the 2 nearest neighbors of the query vector “quantum machine learning”. The result is a list of book titles that are most similar to this concept.

┌───────────────────┐
│ node.title │
│ --- │
│ str │
╞═══════════════════╡
│ The Quantum World │
│ Learning Machines │
└───────────────────┘

Next, let’s use the vector index to find an entry point to the graph, following which we do a graph traversal to find the names of publishers of the books.

result = conn.execute(
"""
CALL QUERY_VECTOR_INDEX('book', 'title_vec_index', $query_vector, 2)
WITH node AS n, distance
MATCH (n)-[:PublishedBy]->(p:Publisher)
RETURN p.name AS publisher, n.title AS book, distance
ORDER BY distance LIMIT 5;
""",
{"query_vector": query_vector})
print(result.get_as_pl())

In the above query, we once asked for the 2 nearest neighbors of the query vector “quantum machine learning”. But this time, we use the node and distance variables to return the book publishers, the book titles, and the distance between the query vector and the book title vector. The results are sorted by distance.

┌──────────────────────────┬───────────────────┬──────────┐
│ publisher ┆ book ┆ distance │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════════════════════╪═══════════════════╪══════════╡
│ Harvard University Press ┆ The Quantum World ┆ 0.311872 │
│ Pearson ┆ Learning Machines ┆ 0.415366 │
└──────────────────────────┴───────────────────┴──────────┘

Using vector search in combination with graph traversal in this manner can be a powerful technique to find semantically related entities in a graph.

Index Management

Drop an Index

To remove a vector index, use the DROP_VECTOR_INDEX function:

// Remove an existing vector index
CALL DROP_VECTOR_INDEX('Book', 'title_vec_index');

List All Indexes

View all created indexes in the database using SHOW_INDEXES:

// Show all indexes and their properties
CALL SHOW_INDEXES() RETURN *;

Example output:

┌────────────┬─────────────────┬────────────┬───────────────────┬──────────────────┬───────────────────────────────┐
│ table name │ index name │ index type │ property names │ extension loaded │ index definition │
├────────────┼─────────────────┼────────────┼───────────────────┼──────────────────┼─────────────────────────────--┤
│ Book │ title_vec_index │ HNSW │ [title_embedding] │ True │ CALL CREATE_VECTOR_INDEX(...) │
└────────────┴─────────────────┴────────────┴───────────────────┴──────────────────┴───────────────────────────────┘

Advanced Usage

Kuzu allows you to perform vector similarity search with filter predicates by combining the vector index with projected graphs.

What is a projected graph?

A projected graph is a subgraph (i.e., a subset of the original graph) that contains only the nodes and relationships that match the given table names and predicates.

You can define a projected graph in Kuzu as follows:

// Create a projected graph
CALL CREATE_PROJECTED_GRAPH(
'projected_graph_name', // Name of the projected graph
{ // Node tables to project
'table_name': {
'filter': 'predicate' // Optional predicate to filter nodes
}
},
[ // Relationship tables to project
'table_name'
]
);
// Drop a projected graph
CALL DROP_PROJECTED_GRAPH('projected_graph_name');

Predicate

There are several rules with projected graph predicates:

  • The predicate must depends only on its node/relationship table, i.e. predicates involving multiple tables is not supported.
  • Since we don’t assign a variable to the node/relationship table, we use n to reference the node and r to reference the relationship table. So properties need to be in the form of n.property_name or r.property_name.

Life Cycle

A projected graph is kept alive until:

  • it is dropped explicitly; OR
  • the connection is closed (not persist on disk).

Lazy evaluation

We use lazy evaluation to evaluate the predicate. The predicate is only evaluated when the projected graph is queried. As a result, the project graph operation always runs instantly.

Example filtered search using projected graph

Here’s an example from our books dataset that demonstrates how to find books similar to the concept of “quantum world” only among those published after 2010:

# Pass in an existing connection
# ...
# Step 1: Create a projected graph that filters books by publication year
conn.execute(
"""
CALL CREATE_PROJECTED_GRAPH(
'filtered_book',
{'Book': {'filter': 'n.published_year > 2010'}},
[]
);
"""
)
# Step 2: Perform vector similarity search on the filtered subset
query_vector = model.encode("quantum world").tolist()
result = conn.execute("""
CALL QUERY_VECTOR_INDEX(
'filtered_book',
'title_vec_index',
$query_vector,
2
)
WITH node AS n, distance as dist
MATCH (n)-[:PublishedBy]->(p:Publisher)
RETURN n.title AS book,
n.published_year AS year,
p.name AS publisher
ORDER BY dist;
""",
{"query_vector": query_vector})
print(result.get_as_pl())

The result shows the two most similar books to the query “quantum world”. Although we have a book named “The Quantum World” in the original dataset, it’s left out of the result because it was published in 2005.

shape: (2, 3)
┌────────────────────────────┬──────┬───────────────────────┐
│ book ┆ year ┆ publisher │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞════════════════════════════╪══════╪═══════════════════════╡
│ Chronicles of the Universe ┆ 2022 ┆ Independent Publisher │
│ Learning Machines ┆ 2019 ┆ Pearson │
└────────────────────────────┴──────┴───────────────────────┘