Import data
There are multiple ways to import data in Kùzu. The only prerequisite for inserting data into a database is that you first create a graph schema, i.e., the structure of your node and relationship tables.
For small graphs (a few thousand nodes), the CREATE
and MERGE
Cypher clauses
can be used to insert nodes and
relationships. These are similar to SQL’s INSERT
statements, but bear in mind that they are slower than the bulk import
options shown below. The CREATE
/MERGE
clauses are intended to do small additions or updates on a sporadic basis.
In general, the recommended approach is to use COPY FROM
(rather than creating or
merging nodes one by one), for larger graphs of millions of nodes and beyond.
This is the fastest way to bulk insert data into Kùzu.
COPY FROM
CSV
The COPY FROM
command is used to bulk import data from a CSV file into a node or relationship table.
See the linked card below for more information and examples.
COPY FROM
Parquet
Similar to CSV, the COPY FROM
command is used to bulk import data from a Parquet file into a node or relationship table.
See the linked card below for more information and examples.
COPY FROM
NumPy
Importing from NumPy is a specific use case that allows you to import numeric data from a NumPy file into a node table.
COPY FROM
subquery results
You can copy data from a subquery’s results into a node or relationship table.
This is useful when you need to modify existing data and re-insert it into the database, or if youw
want to copy data from a LOAD FROM
scan operation on another data format.
COPY FROM
DataFrames
You can copy data from a Pandas or Polars DataFrame into a node or relationship table. This is useful when you are already doing your data transformations in either Pandas or Polars DataFrames and want to directly insert results from their columns into a Kùzu table.
COPY FROM
JSON
Using Kùzu’s JSON extension, you can copy data directly from JSON files that store nested data. This is useful when you have semi-structured data that you want to analyze as a graph.
COPY FROM
a partial subset
In certain cases, you may only want to partially fill your Kùzu table using the data from your input
source. This is common in cases where you want to create extra columns during DDL that will hold
a default value, and only a subset of the Kùzu table’s columns are filled during the COPY
operation.
This is better illustrated with an example. Consider a case where you have a CSV file person.csv
that has 3 columns: id
, name
and age
.
Assume you have a data model where you want to attach and addtional property to a Person
node,
i.e., the address they live in, initially populated as nulls but to be filled in with appropriate
values at a later time.
The DDL for the Person
node table is as follows:
However, the given CSV file to COPY FROM
has only 3 columns — the address information will be
obtained from another source.
In such a case, you can explicitly tell the COPY
pipeline to partially fill the node table and
only map the named columns from the CSV file as follows:
This will only copy the id
, name
and age
columns from the CSV file, while filling the fourth
column, address
with null values. These values can be populated at a later time via MERGE
.