Import data
There are multiple ways to import data in Kùzu. The only prerequisite for inserting data into a database is that you first create a graph schema, i.e., the structure of your node and relationship tables.
For small graphs (a few thousand nodes), the CREATE
and MERGE
Cypher clauses
can be used to insert nodes and
relationships. These are similar to SQL’s INSERT
statements, but bear in mind that they are slower than the bulk import
options shown below. The CREATE
/MERGE
clauses are intended to do small additions or updates on a sporadic basis.
In general, the recommended approach is to use COPY FROM
(rather than creating or
merging nodes one by one), for larger graphs of millions of nodes and beyond.
This is the fastest way to bulk insert data into Kùzu.
The following sections show how to bulk import data using COPY FROM
on various data formats.
COPY FROM
a partial subset
In certain cases, you may only want to partially fill your Kùzu table using the data from your input
source. This is common in cases where you want to create extra columns during DDL that will hold
a default value, and only a subset of the Kùzu table’s columns are filled during the COPY
operation.
This is better illustrated with an example. Consider a case where you have a CSV file person.csv
that has 3 columns: id
, name
and age
.
Assume you have a data model where you want to attach and addtional property to a Person
node,
i.e., the address they live in, initially populated as nulls but to be filled in with appropriate
values at a later time.
The DDL for the Person
node table is as follows:
However, the given CSV file to COPY FROM
has only 3 columns — the address information will be
obtained from another source.
In such a case, you can explicitly tell the COPY
pipeline to partially fill the node table and
only map the named columns from the CSV file as follows:
This will only copy the id
, name
and age
columns from the CSV file, while filling the fourth
column, address
with null values. These values can be populated at a later time via MERGE
.
Copy large relationship tables
Importing large relationship tables from external files (100M+ scale) can use significant amounts of memory.
If the memory used to prepare the
imported relationship table data comes close to exceeding the buffer pool size,
Kùzu will spill some of this data to a .tmp
file in the database directory.
You can disable spilling to disk by CALL spill_to_disk=false
.
This feature is disabled for in-memory or read-only databases.
Ignore erroneous rows
By default, your COPY FROM
and LOAD FROM
clauses will fail if there are any erroneous rows/tuples
in your input file.
You can instead skip these rows by setting the IGNORE_ERRORS
parameter to true
.
By default this parameter is set to false
.
There can be several reasons for errors when scanning a row/tuple from
a file, such as the row/tuple being malformed (e.g., in a CSV file) or that the row leads
to duplicate primary keys if the clause was COPY FROM. We support skipping over different types
of errors for different files, which are documented below.
Here is an example of using IGNORE_ERRORS
when reading from a CSV file.
Consider the following User
node table:
Let the CSV file user.csv
contain the following rows:
Note that 2147483650
does not fit into an INT32 and will cause an error.
The following statement will load only the first row of user.csv
into User
table, skipping the malformed second row.
To check that only one User
node has been inserted, we can run the following query:
Output:
Warnings table: Inspecting skipped rows
Information about skipped rows are kept in a system-level Warnings
table.
Warnings table is a connection-level table that accumulates warnings that are
generated due to erroneous and skipped rows during all COPY FROM
and LOAD FROM
executions during the
lifetime of a database connection. You can inspect the contents of this table, e.g., to see the warnings
about the malformed rows, or clear this table.
For example, to see the warning messages generated about the malformed line when importing user.csv
using COPY FROM
above,
you can run the SHOW_WARNINGS
function:
Output:
At any point in time you can also call the CLEAR_WARNINGS
function to clear the Warnings table.
After clearing the warnings, the Warnings table will be empty.
Output:
By default, Kùzu stores a limited number of warnings per connection, determined by the warning_limit
connection configuration parameter.
You can change this configuration as follows (see the Connection Configuration section for more details):
Skippable Errors By Source
Currently IGNORE_ERRORS
option works when scanning files (and not in-memory data frames or when running
COPY/LOAD FROM
on sub-queries). For different files, the errors that can be skipped can be different.
If the error is not skippable in a file format, COPY/LOAD FROM
will instead error and fail.
Below is a table that shows the errors that are skippable by each source.
Parsing Errors | Casting Errors | Duplicate/Null/Missing Primary-Key errors | |
---|---|---|---|
CSV | X | X | X |
JSON/Numpy/Parquet | X |