Vector search
The VECTOR
extension provides an on-disk HNSW-based vector index for accelerating similarity search over float array columns in tables.
The vector extension provides the following functions:
Function | Description |
---|---|
CREATE_VECTOR_INDEX | Create the index |
QUERY_VECTOR_INDEX | Query the index |
DROP_VECTOR_INDEX | Drop the index |
Installation
To get started with the vector index extension, you need to first install and load the extension following the Using Extensions in Kuzu instructions.
INSTALL VECTOR;LOAD VECTOR;
Basic Usage
Below is an example demonstrating how to create and use a vector index on a Book
table.
Because we’ll be working with natural language texts that need to be translated into vector embeddings,
we’ll use the Python client to run our queries. In principle, you can use any client code that returns
an array of floats (vector embeddings) to run the queries below.
# pip install sentence-transformersimport kuzufrom sentence_transformers import SentenceTransformer
DB_NAME = "ex_kuzu_db"
# Load a pre-trained embedding generation model# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2model = SentenceTransformer("all-MiniLM-L6-v2")
# Initialize the databasedb = kuzu.Database(DB_NAME)conn = kuzu.Connection(db)
# Install and load vector extensionconn.execute("INSTALL vector; LOAD vector;")
# Create tablesconn.execute("CREATE NODE TABLE Book(id SERIAL PRIMARY KEY, title STRING, title_embedding FLOAT[384], published_year INT64);")conn.execute("CREATE NODE TABLE Publisher(name STRING PRIMARY KEY);")conn.execute("CREATE REL TABLE PublishedBy(FROM Book TO Publisher);")
# Sample datatitles = [ "The Quantum World", "Chronicles of the Universe", "Learning Machines", "Echoes of the Past", "The Dragon's Call"]publishers = ["Harvard University Press", "Independent Publisher", "Pearson", "McGraw-Hill Ryerson", "O'Reilly"]published_years = [2004, 2022, 2019, 2010, 2015]
# Insert sample data - Books with embeddingsfor title, published_year in zip(titles, published_years): # Convert title to a 384-dimensional embedding vector embeddings = model.encode(title).tolist() conn.execute( """CREATE (b:Book {title: $title, title_embedding: $embeddings, published_year: $year});""", {"title": title, "embeddings": embeddings, "year": published_year} ) print(f"Inserted book: {title}")
# Insert sample data - Publishersfor publisher in publishers: conn.execute( """CREATE (p:Publisher {name: $publisher});""", {"publisher": publisher} ) print(f"Inserted publisher: {publisher}")
# Create relationships between Books and Publishersfor title, publisher in zip(titles, publishers): conn.execute(""" MATCH (b:Book {title: $title}) MATCH (p:Publisher {name: $publisher}) CREATE (b)-[:PublishedBy]->(p); """, {"title": title, "publisher": publisher} ) print(f"Created relationship between {title} and {publisher}")
The embeddings are generated on the title properties of each Book
and
ingested into the Kuzu database.
Create the Vector Index
Creating a new vector index as follows:
CALL CREATE_VECTOR_INDEX( 'table_name', // Name of the table containing the vector column 'index_name', // Name to identify the vector index 'column_name', // Name of the column containing vector embeddings [option_name := option_value] // Optional parameters for index configuration);
We support following options during index creation. Our index is structured with two hierarchical layers. The lower layer includes all vectors, while the upper layer contains a sampled subset.
option name | description | default |
---|---|---|
mu | Max degree of nodes in the upper graph. It should be smaller than ml. A higher value leads to a more accurate index, but increase the index size and construction time. | 30 |
ml | Max degree of nodes in the lower graph. It should be larger than mu. A higher value leads to a more accurate index, but increase the index size and construction time. | 60 |
pu | Percentage of nodes sampled into the upper graph in the range of [0.0, 1.0] . | 0.05 |
metric | Metric (distance computation) functions. Supported values are cosine , l2 , l2sq , and dotproduct . | cosine |
efc | The number of candidate vertices to consider during the construction of the index. A higher value will result in a more accurate index, but will also increase the time it takes to build the index. | 200 |
In our example, we create a vector index over the title_embedding
column from Book
table.
CALL CREATE_VECTOR_INDEX( 'Book', 'title_vec_index', 'title_embedding');
# Define connection to the database# ...conn.execute( """ CALL CREATE_VECTOR_INDEX( 'Book', 'title_vec_index', 'title_embedding' ); """)print("Vector index created")
Query the Vector Index
To perform similarity search using the vector index, use the QUERY_VECTOR_INDEX
function:
// Syntax for querying the indexCALL QUERY_VECTOR_INDEX( 'table_name', // Name of the table 'index_name', // Name of the vector index query_vector, // Vector to search for k, // Number of nearest neighbors to return [option_name := option_value] // Optional parameters) RETURN node.id ORDER BY distance;
We return nodes, which can be referenced by node
and their distance from the query vector, which can be referenced by distance
.
You can use YIELD
to rename the result columns. More details on YIELD
can be found here.
By default, the returned result from QUERY_VECTOR_INDEX
is not sorted.
To get sorted result on distance, you need to manually specify ORDER BY distance
in the RETURN
clause.
Search Options
The following options can be used to tune the search behavior:
Option | Description | Default |
---|---|---|
efs | Number of candidate vertices to consider during search. Higher values increase accuracy but also increase search time. | 200 |
Example search queries
Let’s run some example search queries on our newly created vector index. Because we’ll be working with natural language queries that need to be translated into vector embeddings, we’ll use the Python client to run our queries. In principle, you can use any client code that returns an array of floats (vector embeddings) to run the queries below.
import kuzufrom sentence_transformers import SentenceTransformer
DB_NAME = "ex_kuzu_db"# Load a pre-trained embedding generation model# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2model = SentenceTransformer("all-MiniLM-L6-v2")
# Initialize the databasedb = kuzu.Database(DB_NAME)conn = kuzu.Connection(db)
# Install and load vector extension once againconn.execute("INSTALL VECTOR;")conn.execute("LOAD VECTOR;")
query_vector = model.encode("quantum machine learning").tolist()result = conn.execute( """ CALL QUERY_VECTOR_INDEX( 'Book', 'title_vec_index', $query_vector, 2 ) RETURN node.title ORDER BY distance; """, {"query_vector": query_vector})print(result.get_as_pl())
In the above query, we asked for the 2 nearest neighbors of the query vector “quantum machine learning”. The result is a list of book titles that are most similar to this concept.
┌───────────────────┐│ node.title ││ --- ││ str │╞═══════════════════╡│ The Quantum World ││ Learning Machines │└───────────────────┘
Next, let’s use the vector index to find an entry point to the graph, following which we do a graph traversal to find the names of publishers of the books.
result = conn.execute( """ CALL QUERY_VECTOR_INDEX('book', 'title_vec_index', $query_vector, 2) WITH node AS n, distance MATCH (n)-[:PublishedBy]->(p:Publisher) RETURN p.name AS publisher, n.title AS book, distance ORDER BY distance LIMIT 5; """, {"query_vector": query_vector})print(result.get_as_pl())
In the above query, we once asked for the 2 nearest neighbors of the query vector “quantum machine learning”.
But this time, we use the node
and distance
variables to return the book publishers, the book titles, and the distance
between the query vector and the book title vector. The results are sorted by distance.
┌──────────────────────────┬───────────────────┬──────────┐│ publisher ┆ book ┆ distance ││ --- ┆ --- ┆ --- ││ str ┆ str ┆ f64 │╞══════════════════════════╪═══════════════════╪══════════╡│ Harvard University Press ┆ The Quantum World ┆ 0.311872 ││ Pearson ┆ Learning Machines ┆ 0.415366 │└──────────────────────────┴───────────────────┴──────────┘
Using vector search in combination with graph traversal in this manner can be a powerful technique to find semantically related entities in a graph.
Index Management
Drop an Index
To remove a vector index, use the DROP_VECTOR_INDEX
function:
// Remove an existing vector indexCALL DROP_VECTOR_INDEX('Book', 'title_vec_index');
List All Indexes
View all created indexes in the database using SHOW_INDEXES
:
// Show all indexes and their propertiesCALL SHOW_INDEXES() RETURN *;
Example output:
┌────────────┬─────────────────┬────────────┬───────────────────┬──────────────────┬───────────────────────────────┐│ table name │ index name │ index type │ property names │ extension loaded │ index definition │├────────────┼─────────────────┼────────────┼───────────────────┼──────────────────┼─────────────────────────────--┤│ Book │ title_vec_index │ HNSW │ [title_embedding] │ True │ CALL CREATE_VECTOR_INDEX(...) │└────────────┴─────────────────┴────────────┴───────────────────┴──────────────────┴───────────────────────────────┘
Advanced Usage
Filtered vector search
Kuzu allows you to perform vector similarity search with filter predicates by combining the vector index with projected graphs.
What is a projected graph?
A projected graph is a subgraph (i.e., a subset of the original graph) that contains only the nodes and relationships that match the given table names and predicates.
You can define a projected graph in Kuzu as follows:
// Create a projected graphCALL CREATE_PROJECTED_GRAPH( 'projected_graph_name', // Name of the projected graph { // Node tables to project 'table_name': { 'filter': 'predicate' // Optional predicate to filter nodes } }, [ // Relationship tables to project 'table_name' ]);// Drop a projected graphCALL DROP_PROJECTED_GRAPH('projected_graph_name');
Predicate
There are several rules with projected graph predicates:
- The predicate must depends only on its node/relationship table, i.e. predicates involving multiple tables is not supported.
- Since we don’t assign a variable to the node/relationship table, we use
n
to reference the node andr
to reference the relationship table. So properties need to be in the form ofn.property_name
orr.property_name
.
Life Cycle
A projected graph is kept alive until:
- it is dropped explicitly; OR
- the connection is closed (not persist on disk).
Lazy evaluation
We use lazy evaluation to evaluate the predicate. The predicate is only evaluated when the projected graph is queried. As a result, the project graph operation always runs instantly.
Example filtered search using projected graph
Here’s an example from our books dataset that demonstrates how to find books similar to the concept of “quantum world” only among those published after 2010:
# Pass in an existing connection# ...
# Step 1: Create a projected graph that filters books by publication yearconn.execute( """ CALL CREATE_PROJECTED_GRAPH( 'filtered_book', {'Book': {'filter': 'n.published_year > 2010'}}, [] ); """)
# Step 2: Perform vector similarity search on the filtered subsetquery_vector = model.encode("quantum world").tolist()result = conn.execute(""" CALL QUERY_VECTOR_INDEX( 'filtered_book', 'title_vec_index', $query_vector, 2 ) WITH node AS n, distance as dist MATCH (n)-[:PublishedBy]->(p:Publisher) RETURN n.title AS book, n.published_year AS year, p.name AS publisher ORDER BY dist; """, {"query_vector": query_vector})print(result.get_as_pl())
// Step 1: Create a projected graph that filters books by publication yearCALL CREATE_PROJECTED_GRAPH( 'filtered_book', // Name of the projected graph {'Book': {'filter': 'n.published_year > 2010'}}, // Projected node table Book with a filter on published_year. `n` is a place_holder here to reference the node table. [] // No relationship tables can be projected.);
// Step 2: Perform vector similarity search on the filtered subsetIn the `QUERY_VECTOR_INDEX` function, we can pass in the name of the projected graph as `table_name` parameter.CALL QUERY_VECTOR_INDEX( 'filtered_book', // Name of the projected graph 'title_vec_index', // Name of the index [-0.09874727576971054, -0.019663318991661072, 0.026379680261015892, 0.03300049901008606, -0.08525621145963669, 0.03913292661309242, -0.008519773371517658, -0.10472023487091064, -0.07655946910381317, -0.02551322802901268, -0.008726594969630241, -0.004044667351990938, -0.034863777458667755, 0.019577588886022568, -0.053255144506692886, 0.06925362348556519, 0.005541415419429541, -0.007712417747825384, -0.04724293202161789, -0.07092397660017014, -0.00885140523314476, 0.057485658675432205, -0.01551036350429058, 0.007519498933106661, 0.08363034576177597, -0.011691797524690628, 0.07842060178518295, 0.0699118822813034, 0.03983190655708313, -0.04736096411943436, 0.05344963073730469, 0.018938563764095306, -0.007491335738450289, -0.06082497537136078, -0.0507962740957737, 0.04239414632320404, -0.018963998183608055, 0.007558973040431738, 0.042515095323324203, -0.026247873902320862, 0.04227885231375694, 0.031660303473472595, -0.04580925032496452, 0.030136926099658012, 0.06916909664869308, 0.1163673847913742, 0.08413344621658325, 0.00152504607103765, -0.024930620566010475, -0.10900826007127762, -0.055435676127672195, 0.034742217510938644, -0.0032851770520210266, 0.0435388945043087, 0.006796618457883596, -0.00048406978021375835, 0.07152412086725235, -0.01019048597663641, -0.03555936738848686, -0.06622810661792755, -0.025580912828445435, -0.09381087869405746, -0.09504663944244385, 0.02437695302069187, 0.134092777967453, 0.027306102216243744, -0.013635482639074326, 0.04523690044879913, -0.045877039432525635, -0.010559560731053352, -0.034854717552661896, -0.01888713240623474, -0.045843612402677536, 0.023764878511428833, 0.061249908059835434, -0.039338767528533936, 0.07242103666067123, 0.0527820959687233, 0.043525248765945435, -0.0003186847607139498, -0.02600417472422123, -0.10006128996610641, 0.016926968470215797, 0.008328817784786224, 0.09273956716060638, 0.019676057621836662, -0.08953194320201874, 0.006387700792402029, -0.081199049949646, -0.06670287251472473, -0.0417928621172905, -0.045379575341939926, -0.0008021353278309107, -0.09036444127559662, -0.029995467513799667, 0.029478128999471664, 0.026864662766456604, -0.06496299803256989, 0.012791202403604984, 0.09779316186904907, 0.03545205295085907, 0.04149294272065163, 0.019831914454698563, 0.002529511693865061, 0.0461372546851635, -0.008579960092902184, 0.026133403182029724, 0.01814635843038559, 0.07846872508525848, -0.09352493286132812, 0.022968290373682976, 0.01182691939175129, 0.012175423093140125, -0.004965497180819511, 7.896034367149696e-05, -0.021726293489336967, 0.036760929971933365, 0.08447402715682983, -0.046395111829042435, 0.05701972171664238, -0.050493594259023666, 6.76907438901253e-05, -0.002953510032966733, 0.0860934630036354, -0.02953658252954483, -0.03056415542960167, -0.12899217009544373, -2.064106550261511e-33, 0.015488152392208576, -0.03342021629214287, 0.015619009733200073, -0.010455731302499771, 0.05864442139863968, -0.02653684839606285, 0.07258284837007523, -0.08778553456068039, 0.018826346844434738, -0.016216641291975975, -0.007404595613479614, -0.03554968908429146, 0.008319997228682041, -0.02855130471289158, 0.03193259984254837, 0.018047798424959183, -0.06740124523639679, -0.007899004966020584, 0.03631879389286041, -0.02133457362651825, 0.04704872891306877, -0.05590340495109558, -0.045616090297698975, 0.021392134949564934, -0.07849449664354324, -0.023068571463227272, 0.11903978884220123, -0.0018944559851661325, -0.04567766189575195, 0.008722291328012943, 0.006837415508925915, 0.07824835181236267, -0.08694804459810257, -0.005149406846612692, 0.0592142678797245, 0.020647693425416946, 0.019302235916256905, 0.05669999122619629, 0.004759966395795345, -0.06363297998905182, -0.03863728791475296, -0.07707936316728592, 0.06844320148229599, -0.005478093400597572, -0.020625025033950806, 0.0005605336045846343, 0.02365974709391594, -0.0857347846031189, 0.0040472401306033134, -0.02703179605305195, 0.07371301203966141, -0.08515466749668121, -0.05958851799368858, -0.00036977347917854786, 0.045746318995952606, 0.037566229701042175, 0.037339646369218826, 0.01565920189023018, -0.029106922447681427, -0.026880435645580292, 0.02145109511911869, 0.020097855478525162, 0.004611643496900797, 0.05472870543599129, -0.016886215656995773, 0.06590874493122101, 0.013731150887906551, -0.018520070239901543, 0.003863745369017124, 0.09537586569786072, 0.006862230598926544, 0.04087948799133301, 0.048243604600429535, -0.07596695423126221, 0.08784883469343185, -0.04977760463953018, -0.01720358058810234, -0.04158437252044678, -0.007139154709875584, 0.005755842197686434, -0.005658647511154413, -0.03917710483074188, -0.047673579305410385, 0.022293440997600555, -0.02965836226940155, -0.1400938630104065, -0.06738895922899246, -0.06697329133749008, -0.016933375969529152, -0.0017664943588897586, -0.0932610034942627, -0.024097386747598648, 0.11550511419773102, 0.05578150972723961, -0.12074088305234909, 1.1273514286713838e-33, -0.11707723885774612, 0.07049695402383804, 0.03651612997055054, 0.06814820319414139, -0.021599706262350082, -0.01940377987921238, -0.02033681608736515, -0.011085398495197296, -0.00800390262156725, 0.05900853872299194, 0.05606481805443764, 0.07885222882032394, 0.08607453852891922, 0.05710339546203613, -0.04390961676836014, 0.09391780197620392, 0.022916318848729134, -0.01837819069623947, 0.06715855002403259, 0.012958995997905731, 0.0018168555106967688, 0.009906788356602192, 0.028201734647154808, -0.014642609283328056, 0.02585628256201744, 0.058014560490846634, 0.05878483131527901, 0.03323346748948097, 0.026972956955432892, -0.03851533681154251, 0.011257020756602287, -0.10183282196521759, -0.06838149577379227, 0.02577545866370201, -0.09795572608709335, 0.03309926018118858, 0.02638947032392025, 0.038076382130384445, 0.0023347269743680954, 0.025551890954375267, -0.034650497138500214, -0.01582970656454563, -0.08356215059757233, -0.009231070056557655, 0.004009858705103397, -0.036289189010858536, 0.003509229514747858, 0.011391056701540947, -0.061331845819950104, 0.01540123950690031, -0.0011264500208199024, 0.10097002238035202, -0.013182193972170353, 0.011979017406702042, -0.13199655711650848, 0.02036047913134098, -0.017511453479528427, 0.05499761179089546, 0.11070820689201355, 0.014930548146367073, -0.09360553324222565, -0.019297722727060318, 0.007204619701951742, 0.09441079944372177, -0.00927896611392498, -0.04180166870355606, -0.04578719288110733, 0.05872377008199692, 0.03292301297187805, -0.020148921757936478, -0.034446872770786285, 0.024236811324954033, 0.021329764276742935, -0.01295411866158247, 0.019372068345546722, -0.007271720562130213, -0.019697433337569237, -0.053335949778556824, -0.011210829950869083, -0.0036429560277611017, 0.018726753070950508, -0.01846127398312092, -0.026368536055088043, 0.06836844235658646, 0.03631875663995743, 0.04551452025771141, 0.09708166867494583, -0.0013076072791591287, 0.038096629083156586, -0.07843992859125137, 0.02037520334124565, 0.04456569254398346, 0.026434291154146194, -0.032406117767095566, -0.03527767211198807, -1.0356031587832604e-08, -0.017991339787840843, -0.038019757717847824, 0.046414148062467575, -0.04335968568921089, 0.13680949807167053, 0.01109931617975235, 0.004367447458207607, 0.018392788246273994, -0.07819430530071259, -0.05922280624508858, 0.02490466833114624, -0.024358082562685013, -0.06704910099506378, 0.02667614072561264, 0.03479555994272232, 0.027538081631064415, -0.02667420543730259, -0.010268918238580227, -0.004408948123455048, 0.01541890762746334, 0.09345494210720062, 0.03375879302620888, 0.0555303655564785, 0.02469305880367756, 0.009483843110501766, -0.057700056582689285, -0.03969407081604004, 0.0016294082161039114, 0.02770363539457321, 0.08777989447116852, -0.07391007244586945, 0.07368578761816025, 0.07130586355924606, 0.05937817692756653, -0.029533831402659416, -0.008655997924506664, -0.010082626715302467, -0.0797964483499527, -0.025537142530083656, 0.0783148780465126, -0.06385865062475204, 0.059156548231840134, -0.0749022364616394, 0.0023516560904681683, 0.0034361558500677347, 0.003786548040807247, 0.05568376183509827, -0.09200930595397949, 0.09327029436826706, 0.08218848705291748, 0.07302675396203995, 0.02313903532922268, 0.0677151307463646, -0.0498654730618, 0.0007056242320686579, 0.07477487623691559, 0.008759301155805588, -0.06420709192752838, -0.035795655101537704, 0.053932011127471924, 0.031011711806058884, 0.025613466277718544, -0.036662112921476364, -0.008528689853847027], 2)WITH node AS n, distance as distMATCH (n)-[:PublishedBy]->(p:Publisher)RETURN n.title AS book, n.published_year AS year, p.name AS publisherORDER BY dist;
The result shows the two most similar books to the query “quantum world”. Although we have a book named “The Quantum World” in the original dataset, it’s left out of the result because it was published in 2005.
shape: (2, 3)┌────────────────────────────┬──────┬───────────────────────┐│ book ┆ year ┆ publisher ││ --- ┆ --- ┆ --- ││ str ┆ i64 ┆ str │╞════════════════════════════╪══════╪═══════════════════════╡│ Chronicles of the Universe ┆ 2022 ┆ Independent Publisher ││ Learning Machines ┆ 2019 ┆ Pearson │└────────────────────────────┴──────┴───────────────────────┘