Vector search extension

The vector extension provides a native disk-based HNSW vector index for accelerating similarity search over vector embeddings (32-bit and 64-bit float arrays) stored in Kuzu.

The HNSW index is structured with two hierarchical layers. The lower layer includes all vectors, while the upper layer contains a sampled subset of the lower layer.

This extension provides the following functions:

CREATE_VECTOR_INDEX: Create a vector index
QUERY_VECTOR_INDEX: Query a vector index
DROP_VECTOR_INDEX: Drop a vector index

Usage

INSTALL VECTOR;
LOAD VECTOR;

Example dataset

Below is an example demonstrating two ways in which such a dataset can be created. The first is to use some external library, such as sentence_transformers in Python to create the embeddings. Alternatively, you can use Kuzu’s llm extension to directly create the embeddings using Cypher.

sentence_transformers
LLM extension

# pip install sentence-transformers
import kuzu
import os

db = kuzu.Database("example.kuzu")
conn = kuzu.Connection(db)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

conn.execute("INSTALL vector; LOAD vector;")

conn.execute("CREATE NODE TABLE Book(id SERIAL PRIMARY KEY, title STRING, title_embedding FLOAT[384], published_year INT64);")
conn.execute("CREATE NODE TABLE Publisher(name STRING PRIMARY KEY);")
conn.execute("CREATE REL TABLE PublishedBy(FROM Book TO Publisher);")

titles = [
    "The Quantum World",
    "Chronicles of the Universe",
    "Learning Machines",
    "Echoes of the Past",
    "The Dragon's Call"
]
publishers = ["Harvard University Press", "Independent Publisher", "Pearson", "McGraw-Hill Ryerson", "O'Reilly"]
published_years = [2004, 2022, 2019, 2010, 2015]

for title, published_year in zip(titles, published_years):
    embeddings = model.encode(title).tolist()
    conn.execute(
        """
        CREATE (b:Book {
            title: $title,
            title_embedding: $embeddings,
            published_year: $year
        });""",
        {"title": title, "year": published_year, "embeddings": embeddings}
    )

    print(f"Inserted book: {title}")

for publisher in publishers:
    conn.execute(
        """CREATE (p:Publisher {name: $publisher});""",
        {"publisher": publisher}
    )
    print(f"Inserted publisher: {publisher}")

for title, publisher in zip(titles, publishers):
    conn.execute("""
        MATCH (b:Book {title: $title})
        MATCH (p:Publisher {name: $publisher})
        CREATE (b)-[:PublishedBy]->(p);
        """,
        {"title": title, "publisher": publisher}
    )
    print(f"Created relationship between {title} and {publisher}")

import kuzu
import os

db = kuzu.Database("example.kuzu")
conn = kuzu.Connection(db)

conn.execute("INSTALL llm; LOAD llm;")
os.environ["OPENAI_API_KEY"] = "sk-proj-key" # Replace with your own OpenAI API key

conn.execute("INSTALL vector; LOAD vector;")

conn.execute("CREATE NODE TABLE Book(id SERIAL PRIMARY KEY, title STRING, title_embedding FLOAT[384], published_year INT64);")
conn.execute("CREATE NODE TABLE Publisher(name STRING PRIMARY KEY);")
conn.execute("CREATE REL TABLE PublishedBy(FROM Book TO Publisher);")

titles = [
    "The Quantum World",
    "Chronicles of the Universe",
    "Learning Machines",
    "Echoes of the Past",
    "The Dragon's Call"
]
publishers = ["Harvard University Press", "Independent Publisher", "Pearson", "McGraw-Hill Ryerson", "O'Reilly"]
published_years = [2004, 2022, 2019, 2010, 2015]

for title, published_year in zip(titles, published_years):
    conn.execute(
        """
        CREATE (b:Book {
            title: $title,
            title_embedding: create_embedding(
                $title,
                'open-ai', 'text-embedding-3-small',
                384
            ),
            published_year: $year
        });
        """,
        {"title": title, "year": published_year}
    )

    print(f"Inserted book: {title}")

for publisher in publishers:
    conn.execute(
        """CREATE (p:Publisher {name: $publisher});""",
        {"publisher": publisher}
    )
    print(f"Inserted publisher: {publisher}")

for title, publisher in zip(titles, publishers):
    conn.execute("""
        MATCH (b:Book {title: $title})
        MATCH (p:Publisher {name: $publisher})
        CREATE (b)-[:PublishedBy]->(p);
        """,
        {"title": title, "publisher": publisher}
    )
    print(f"Created relationship between {title} and {publisher}")

The embeddings are generated on the title properties of each Book and ingested into the Kuzu database.

Creating a vector index

Create a vector index as follows:

CALL CREATE_VECTOR_INDEX(
    <TABLE_NAME>,
    <INDEX_NAME>,
    <PROPERTY_NAME>,
    mu := 30,
    ml := 60,
    pu := 0.05,
    metric := 'cosine',
    efc := 200,
    cache_embeddings := true
);

Required arguments:

TABLE_NAME: The node table containing a property on which the index is to be created.
INDEX_NAME: The name of the vector index.
PROPERTY_NAME: The name of the vector property on which the index is to be created. The property must be an ARRAY of type FLOAT or DOUBLE.

Optional arguments to tune the index:

mu
- Max degree of nodes in the upper graph. It should be smaller than ml.
- A higher value leads to a more accurate index, but increases the index size and construction time.
- Default: 30
ml
- Max degree of nodes in the lower graph. It should be larger than mu.
- A higher value leads to a more accurate index, but increases the index size and construction time.
- Default: 60
pu
- Percentage of nodes sampled into the upper graph.
- Supported values: [0.0, 1.0]
- Default: 0.05
metric
- Metric (distance computation) functions.
- Supported values: cosine, l2, l2sq, dotproduct
- Default: cosine
efc
- The number of candidate vertices to consider during the construction of the index.
- A higher value will result in a more accurate index, but will also increase the time it takes to build the index.
- Default: 200
cache_embeddings
- Determines whether the embeddings column should be fully cached in memory during the index construction.
- This will decrease the amount of time needed to construct the index but will increase the memory usage. We recommend keeping this value set to true unless you are in a memory-constrained environment.
- Default: true

Example

You can create a vector index over the title_embedding column from Book table as follows:

CALL CREATE_VECTOR_INDEX(
    'Book',
    'book_title_index',
    'title_embedding',
    metric := 'l2'
);

Query the vector index

To perform similarity search using the vector index, use the QUERY_VECTOR_INDEX function:

CALL QUERY_VECTOR_INDEX(
    <TABLE_NAME>,
    <INDEX_NAME>,
    <QUERY_VECTOR>,
    <K>,
    efs := 200
)
RETURN node.id, distance;

Required arguments:

TABLE_NAME: The node table on which the index was created.
- Type: STRING
INDEX_NAME: The name of the vector index.
- Type: STRING
QUERY_VECTOR: The vector to search for.
- Type: LIST[FLOAT]
K: The number of nearest neighbors to return.
- Type: INT64

Optional arguments to tune the search behavior:

efs: The number of candidate vertices to consider during search. A higher value will result in a more accurate search, but will also increase the time it takes to search.
- Type: INT64
- Default: 200

Returns:

node: The node object.
distance: The distance between the query vector and the node’s vector.

Example

Let’s run some example search queries on our newly created vector index.

sentence_transformers
LLM extension

import kuzu

# Initialize the database
db = kuzu.Database("example.kuzu")
conn = kuzu.Connection(db)

# Install and load vector extension once again
conn.execute("INSTALL VECTOR;")
conn.execute("LOAD VECTOR;")

from sentence_transformers import SentenceTransformer
# Load a pre-trained embedding generation model
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model = SentenceTransformer("all-MiniLM-L6-v2")

query_vector = model.encode("quantum machine learning").tolist()
result = conn.execute(
    """
    CALL QUERY_VECTOR_INDEX(
        'Book',
        'book_title_index',
        $query_vector,
        $limit,
        efs := 500
    )
    RETURN node.title
    ORDER BY distance;
    """,
    {"query_vector": query_vector, "limit": 2})

print(result.get_as_pl())

import kuzu

# Initialize the database
db = kuzu.Database("example.kuzu")
conn = kuzu.Connection(db)

# Install and load vector extension once again
conn.execute("INSTALL VECTOR;")
conn.execute("LOAD VECTOR;")

conn.execute("INSTALL llm; LOAD llm;")
os.environ["OPENAI_API_KEY"] = "sk-proj-key" # Replace with your own OpenAI API key

result = conn.execute(
    """
    CALL QUERY_VECTOR_INDEX(
        'Book',
        'book_title_index',
        create_embedding(
            'quantum machine learning',
            'open-ai', 'text-embedding-3-small',
            384
        ),
        $limit
    )
    RETURN node.title ORDER BY distance;
    """,
    {"limit": 2})

print(result.get_as_pl())

The above query asks for the 2 nearest neighbors of the query vector “quantum machine learning”. The result is a list of book titles that are most similar to this concept.

┌───────────────────┐
│ node.title        │
│ ---               │
│ str               │
╞═══════════════════╡
│ The Quantum World │
│ Learning Machines │
└───────────────────┘

Next, let’s use the vector index to find an entry point to the graph, following which we do a graph traversal to find the names of publishers of the books.

result = conn.execute(
    """
    CALL QUERY_VECTOR_INDEX('Book', 'book_title_index', $query_vector, 2)
    WITH node AS n, distance
    MATCH (n)-[:PublishedBy]->(p:Publisher)
    RETURN p.name AS publisher, n.title AS book, distance
    ORDER BY distance
    LIMIT $limit;
    """,
    {"query_vector": query_vector, "limit": 2})
print(result.get_as_pl())

The above query asks for the 2 nearest neighbors of the query vector “quantum machine learning”. Then, it uses the node and distance variables to further query the graph and return rows sorted by distance.

┌──────────────────────────┬───────────────────┬──────────┐
│ publisher                ┆ book              ┆ distance │
│ ---                      ┆ ---               ┆ ---      │
│ str                      ┆ str               ┆ f64      │
╞══════════════════════════╪═══════════════════╪══════════╡
│ Harvard University Press ┆ The Quantum World ┆ 0.311872 │
│ Pearson                  ┆ Learning Machines ┆ 0.415366 │
└──────────────────────────┴───────────────────┴──────────┘

Using vector search in combination with graph traversal in this manner can be a powerful technique to find semantically related entities in a graph.

Index management

Drop an index

To remove a vector index, use the DROP_VECTOR_INDEX function:

CALL DROP_VECTOR_INDEX('Book', 'book_title_index');

List all indexes

View all created indexes in the database using SHOW_INDEXES:

CALL SHOW_INDEXES() RETURN *;

Example output:

┌────────────┬──────────────────┬────────────┬───────────────────┬──────────────────┬───────────────────────────────┐
│ table name │ index name       │ index type │ property names    │ extension loaded │ index definition              │
├────────────┼──────────────────┼────────────┼───────────────────┼──────────────────┼───────────────────────────────┤
│ Book       │ book_title_index │ HNSW       │ [title_embedding] │ True             │ CALL CREATE_VECTOR_INDEX(...) │
└────────────┴──────────────────┴────────────┴───────────────────┴──────────────────┴───────────────────────────────┘

Filtered vector search

Kuzu allows you to combine vector search with filter predicates by using projected graphs.

For example, we can search for books similar to “quantum world”, but only those that were published after 2010.

Python
Cypher

# Pass in an existing connection
# ...

# Step 1: Create a projected graph that filters books by publication year
conn.execute(
    """
    CALL PROJECT_GRAPH(
        'filtered_book',
        {'Book': 'n.published_year > 2010'},
        []
    );
    """
)

# Step 2: Perform vector similarity search on the filtered subset
query_vector = model.encode("quantum world").tolist()
result = conn.execute("""
    CALL QUERY_VECTOR_INDEX(
        'filtered_book',
        'book_title_index',
        $query_vector,
        $limit
    )
    WITH node AS n, distance as dist
    MATCH (n)-[:PublishedBy]->(p:Publisher)
    RETURN n.title AS book,
            n.published_year AS year,
            p.name AS publisher
    ORDER BY dist;
    """,
    {"query_vector": query_vector, "limit": 2})
print(result.get_as_pl())

export OPENAI_API_KEY=sk-proj-key # Replace with your own OpenAI API key

INSTALL llm;
LOAD llm;

// Step 1: Create a projected graph that filters books by publication year
CALL PROJECT_GRAPH(
    'filtered_book',   // Name of the projected graph
    {'Book': 'n.published_year > 2010'},   // Projected node table Book with a filter on published_year. `n` is a placeholder here to reference the node table.
    []   // No relationship tables can be projected.
);

// Step 2: Perform vector similarity search on the filtered subset
// In the `QUERY_VECTOR_INDEX` function, we can pass in the name of the projected graph as `table_name` parameter.
CALL QUERY_VECTOR_INDEX(
    'filtered_book', // Name of the projected graph
    'book_title_index', // Name of the index
    create_embedding('quantum world', 'open-ai', 'text-embedding-3-small', 384),
    2
)
WITH node AS n, distance as dist
MATCH (n)-[:PublishedBy]->(p:Publisher)
RETURN n.title AS book,
        n.published_year AS year,
        p.name AS publisher
ORDER BY dist;

The result shows the two most similar books to the query “quantum world”. Although we have a book named “The Quantum World” in the original dataset, it does not appear in the results because it was published before 2010.

shape: (2, 3)
┌────────────────────────────┬──────┬───────────────────────┐
│ book                       ┆ year ┆ publisher             │
│ ---                        ┆ ---  ┆ ---                   │
│ str                        ┆ i64  ┆ str                   │
╞════════════════════════════╪══════╪═══════════════════════╡
│ Chronicles of the Universe ┆ 2022 ┆ Independent Publisher │
│ Learning Machines          ┆ 2019 ┆ Pearson               │
└────────────────────────────┴──────┴───────────────────────┘

Filtered vector search with arbitrary cypher queries

To run filtered search with arbitrary cypher queries, you can create a projected graph using project_graph_cypher.

CALL PROJECT_GRAPH_CYPHER(
    <GRAPH_NAME>,
    <CYPHER_STATEMENT>
);

GRAPH_NAME: Name of the projected graph
- Type: STRING
CYPHER_STATEMENT: A cypher statement that returns a node variable
- Type: STRING

The cypher statement can contain arbitrary pattern matching, but the return clause must contain a single node variable whose label equals to the table on which a vector index is built.

The following example creates a projected graph pearson_book that contains books published by Pearson publisher.

CALL PROJECT_GRAPH_CYPHER(
    'pearson_book',   // Name of the projected graph
    'MATCH (b:Book)-[:PublishedBy]->(p:Publisher {name:'Pearson'}) RETURN b'
);

You can then replace filtered_book with pearson_book in the above examples.