Full-text search extension

The full-text search (FTS) extension can be used to efficiently search within the contents of any string property and return results based on a similarity score. The extension works by building an index on specified string properties.

Currently, Kuzu supports creating FTS indexes on only node tables’ STRING properties.

Usage

INSTALL FTS;
LOAD FTS;

Example dataset

Let’s create a Book table containing each book’s title and abstract.

CREATE NODE TABLE Book (ID SERIAL PRIMARY KEY, abstract STRING, title STRING);
CREATE (b:Book {
  abstract: 'An exploration of quantum mechanics.',
  title: 'The Quantum World'
});
CREATE (b:Book {
  abstract: 'A magic journey through time and space.',
  title: 'Chronicles of the Universe'
});
CREATE (b:Book {
  abstract: 'An introduction to machine learning techniques.',
  title: 'Learning Machines'
});
CREATE (b:Book {
  abstract: 'A deep dive into the history of ancient civilizations.',
  title: 'Echoes of the Past'
});
CREATE (b:Book {
  abstract: 'A fantasy tale of dragons and magic.', title: 'The Dragon\'s Call'
});

Create an FTS index

The CREATE_FTS_INDEX function can be used to create a full-text search index on a node table.

CALL CREATE_FTS_INDEX(
  <TABLE_NAME>,
  <INDEX_NAME>,
  [<PROPERTY1>, <PROPERTY2>, ...], // PROPERTIES
  stemmer := 'porter',
  stopwords := <STRING>
);

Required arguments:

TABLE_NAME: The name of the node table to build an FTS index on.
- Type: STRING
INDEX_NAME: The name of the FTS index to create.
- Type: STRING
PROPERTIES: A list of properties in the table to build FTS index on. Full text search will only search these properties.
- Type: STRING[]

Optional arguments:

stemmer: The text normalization technique to use.
- Type: STRING
- Accepted values: arabic, basque, catalan, danish, dutch, english, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, porter, portuguese, romanian, russian, serbian, spanish, swedish, tamil, or turkish.
  - Use none if you do not want to use any stemming.
- Default: english, which uses a Snowball stemmer.
stopwords: You can make the full-text search results more relevant by providing a list of omitted words that are excluded when building and querying the full-text search index. These are termed “stopwords”.
- Type: STRING
- Default: A list of built-in English stopwords.
- If you want to use a custom stopwords list, you can provide it via the stopwords parameter in the following formats:
  - A node table with only a single column of stopwords.
  - A Parquet/CSV file with only a single string column of stopwords (no header required). This file can be stored in cloud storage platforms like Amazon S3 or Google Cloud Storage (GCS) or made accessible via HTTPS. If hosted remotely, ensure the httpfs extension is enabled and valid credentials (e.g., access keys) are configured to authenticate and securely access the file.
- If the provided stopwords parameter matches both a node table and a file with the same name, the node table takes precedence and will be used.
- For best accuracy, we suggest providing custom stopwords in their stemmed form.

Example

The example below shows how to create an FTS index on the abstract and title properties of the Book node table, using the porter stemmer and a custom stopwords list.

CALL CREATE_FTS_INDEX(
    'Book',
    'book_index',
    ['abstract', 'title'],
    stemmer := 'porter',
    stopwords := './stopwords.csv'
);

Query an FTS index

The QUERY_FTS_INDEX function can be used to query an FTS index. Internally, it uses the Okapi BM25 scoring algorithm.

CALL QUERY_FTS_INDEX(
  <TABLE_NAME>,
  <INDEX_NAME>,
  <QUERY>,
  conjunctive := false,
  K := 1.2,
  B := 0.75
  TOP := 3
)
RETURN node, score;

Required arguments:

TABLE_NAME: The name of the table to query.
- Type: STRING
INDEX_NAME: The name of the FTS index to query.
- Type: STRING
QUERY: The query string that contains the keywords to search.
- Type: STRING

Optional arguments:

conjunctive: Whether all keywords in the query should appear in a string for it to be retrieved.
- Type: BOOLEAN
- Default: false
K: Controls the influence of term frequency saturation. This limits the effect of multiple occurrences of a term within a string.
- Type: DOUBLE
- Default: 1.2
B: Controls the influence of string length on length normalization.
- Type: DOUBLE
- Default: 0.75
TOP: Retrieves the top-k documents with the highest scores (return order of docs is not guaranteed).
- Type: UINT64
- Default: Retrieves all documents in the table.

You can read more about the K and B parameters here.

Example

The example below shows how to find books related to quantum machine and return the results ordered by the similarity score:

CALL
  QUERY_FTS_INDEX('Book', 'book_index', 'quantum machine')
RETURN
  node.title as title,
  node.abstract as abstract,
  score
ORDER BY score DESC;

┌───────────────────┬─────────────────────────────────────────────────┬──────────┐
│ title             │ abstract                                        │ score    │
│ STRING            │ STRING                                          │ DOUBLE   │
├───────────────────┼─────────────────────────────────────────────────┼──────────┤
│ The Quantum World │ An exploration of quantum mechanics.            │ 0.868546 │
│ Learning Machines │ An introduction to machine learning techniques. │ 0.827832 │
└───────────────────┴─────────────────────────────────────────────────┴──────────┘

The conjunctive option can be used when you want to retrieve only the books containing all the keywords in the query.

CALL
  QUERY_FTS_INDEX('Book', 'book_index', 'dragon magic', conjunctive := true)
RETURN
  node.title as title,
  node.abstract as abstract,
  score
ORDER BY score DESC;

┌───────────────────┬──────────────────────────────────────┬──────────┐
│ title             │ abstract                             │ score    │
│ STRING            │ STRING                               │ DOUBLE   │
├───────────────────┼──────────────────────────────────────┼──────────┤
│ The Dragon's Call │ A fantasy tale of dragons and magic. │ 1.208044 │
└───────────────────┴──────────────────────────────────────┴──────────┘

The top option can be used when you are only interested in retrieving the top-k documents with the highest scores. For example, to find the document most likely to contain the keywords dragon magic, you can run the following query:

CALL
  QUERY_FTS_INDEX('Book', 'book_index', 'dragon magic', top := 1)
RETURN
  node.title as title,
  score;

┌───────────────────┬──────────┐
│ title             │ score    │
│ STRING            │ DOUBLE   │
├───────────────────┼──────────┤
│ The Dragon's Call │ 1.208044 │
└───────────────────┴──────────┘

Drop an FTS index

Use the function DROP_FTS_INDEX to drop the FTS index on a table:

CALL DROP_FTS_INDEX(<TABLE_NAME>, <INDEX_NAME>);

TABLE_NAME: The name of the table to drop the FTS index from.
INDEX_NAME: The name of the FTS index to drop.

Example

The example below shows how to drop the book_index index from the Book table:

CALL DROP_FTS_INDEX('Book', 'book_index');

Show FTS indexes

There is no function to specifically show only FTS indexes. However, you can use SHOW_INDEXES to show all the available indexes in Kuzu, including FTS indexes.

CALL SHOW_INDEXES() RETURN *;

┌────────────┬────────────┬────────────┬──────────────────┬──────────────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ table name │ index name │ index type │ property names   │ extension loaded │ index definition                                                                         │
│ STRING     │ STRING     │ STRING     │ STRING[]         │ BOOL             │ STRING                                                                                   │
├────────────┼────────────┼────────────┼──────────────────┼──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ Book       │ book_index │ FTS        │ [abstract,title] │ True             │ CALL CREATE_FTS_INDEX('Book', 'book_index', ['abstract', 'title'], stemmer := 'porter'); │
└────────────┴────────────┴────────────┴──────────────────┴──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────┘