Skip to content
Blog

Import data

There are multiple ways to import data in Kùzu. The only prerequisite for inserting data into a database is that you first create a graph schema, i.e., the structure of your node and relationship tables.

For small graphs (a few thousand nodes), the CREATE and MERGE Cypher clauses can be used to insert nodes and relationships. These are similar to SQL’s INSERT statements, but bear in mind that they are slower than the bulk import options shown below. The CREATE/MERGE clauses are intended to do small additions or updates on a sporadic basis.

In general, the recommended approach is to use COPY FROM (rather than creating or merging nodes one by one), for larger graphs of millions of nodes and beyond. This is the fastest way to bulk insert data into Kùzu.

The following sections show how to bulk import data using COPY FROM on various data formats.

COPY FROM a partial subset

In certain cases, you may only want to partially fill your Kùzu table using the data from your input source. This is common in cases where you want to create extra columns during DDL that will hold a default value, and only a subset of the Kùzu table’s columns are filled during the COPY operation.

This is better illustrated with an example. Consider a case where you have a CSV file person.csv that has 3 columns: id, name and age. Assume you have a data model where you want to attach and addtional property to a Person node, i.e., the address they live in, initially populated as nulls but to be filled in with appropriate values at a later time.

The DDL for the Person node table is as follows:

CREATE NODE TABLE Person(id INT64, name STRING, age INT64, address STRING, PRIMARY KEY(id));

However, the given CSV file to COPY FROM has only 3 columns — the address information will be obtained from another source.

In such a case, you can explicitly tell the COPY pipeline to partially fill the node table and only map the named columns from the CSV file as follows:

COPY Person(id, name, age) FROM 'person.csv'

This will only copy the id, name and age columns from the CSV file, while filling the fourth column, address with null values. These values can be populated at a later time via MERGE.

Copy large relationship tables

Importing large relationship tables from external files (100M+ scale) can use significant amounts of memory. If the memory used to prepare the imported relationship table data comes close to exceeding the buffer pool size, Kùzu will spill some of this data to a .tmp file in the database directory. You can disable spilling to disk by CALL spill_to_disk=false.

This feature is disabled for in-memory or read-only databases.

Ignore erroneous rows

By default, your COPY FROM and LOAD FROM clauses will fail if there are any erroneous rows/tuples in your input file. You can instead skip these rows by setting the IGNORE_ERRORS parameter to true. By default this parameter is set to false. There can be several reasons for errors when scanning a row/tuple from a file, such as the row/tuple being malformed (e.g., in a CSV file) or that the row leads to duplicate primary keys if the clause was COPY FROM. We support skipping over different types of errors for different files, which are documented below.

Here is an example of using IGNORE_ERRORS when reading from a CSV file. Consider the following User node table:

CREATE NODE TABLE User (name String, age INT32, PRIMARY KEY (name));

Let the CSV file user.csv contain the following rows:

Alice,4
Bob,2147483650

Note that 2147483650 does not fit into an INT32 and will cause an error. The following statement will load only the first row of user.csv into User table, skipping the malformed second row.

COPY User FROM "user.csv" (header=false, ignore_errors=true);

To check that only one User node has been inserted, we can run the following query:

MATCH (a:User) RETURN count(*)

Output:

┌──────────┬
│ count(*) │
│ UINT64 │
├──────────┼
│ 1 │
└──────────┴

Warnings table: Inspecting skipped rows

Information about skipped rows are kept in a system-level Warnings table. Warnings table is a connection-level table that accumulates warnings that are generated due to erroneous and skipped rows during all COPY FROM and LOAD FROM executions during the lifetime of a database connection. You can inspect the contents of this table, e.g., to see the warnings about the malformed rows, or clear this table.

For example, to see the warning messages generated about the malformed line when importing user.csv using COPY FROM above, you can run the SHOW_WARNINGS function:

CALL show_warnings() RETURN *;

Output:

┌──────────┬─────────────────────────────────────────────────────────────────────────────┬───────────┬─────────────┬────────────────────────┐
│ query_id │ message │ file_path │ line_number │ skipped_line_or_record │
│ UINT64 │ STRING │ STRING │ UINT64 │ STRING │
├──────────┼─────────────────────────────────────────────────────────────────────────────┼───────────┼─────────────┼────────────────────────┤
│ 1 │ Conversion exception: Cast failed. Could not convert "2147483650" to INT32. │ user.csv │ 2 │ Bob,2147483650 │
└──────────┴─────────────────────────────────────────────────────────────────────────────┴───────────┴─────────────┴────────────────────────┘

At any point in time you can also call the CLEAR_WARNINGS function to clear the Warnings table.

CALL clear_warnings();

After clearing the warnings, the Warnings table will be empty.

CALL show_warnings() RETURN *;

Output:

┌──────────┬─────────┬───────────┬─────────────┬────────────────────────┐
│ query_id │ message │ file_path │ line_number │ skipped_line_or_record │
│ UINT64 │ STRING │ STRING │ UINT64 │ STRING │
├──────────┼─────────┼───────────┼─────────────┼────────────────────────┤
└──────────┴─────────┴───────────┴─────────────┴────────────────────────┘

By default, Kùzu stores a limited number of warnings per connection, determined by the warning_limit connection configuration parameter. You can change this configuration as follows (see the Connection Configuration section for more details):

CALL warning_limit=1024;

Skippable Errors By Source

Currently IGNORE_ERRORS option works when scanning files (and not in-memory data frames or when running COPY/LOAD FROM on sub-queries). For different files, the errors that can be skipped can be different. If the error is not skippable in a file format, COPY/LOAD FROM will instead error and fail. Below is a table that shows the errors that are skippable by each source.

Parsing ErrorsCasting ErrorsDuplicate/Null/Missing Primary-Key errors
CSVXXX
JSON/Numpy/ParquetX