Import Parquet
Apache Parquet is an open source, column-oriented persistent storage format
designed for efficient data storage and retrieval. Kùzu supports bulk data import from Parquet files
using the COPY FROM
command. You can use COPY FROM
to import data into an empty table or to append data to an existing table.
Import to node table
Similar to CSV import, the order of columns in a Parquet file need to match the order of predefined properties for node tables in the catalog, i.e. the order used when defining the schema of a node table.
The following example is for a file named user.parquet
. The output is obtained by using print(pyarrow.Table)
.
To load this Parquet file into a User
table, simply run:
Import to relationship table
Similar to CSV import, the first two columns for a relationship file should the from
and the to
columns
that represent existing nodes’ primary keys.
The following example is for a file named follows.parquet
. The output is obtained by using print(pyarrow.Table)
.
To load this Parquet file into a Follows
table, simply run:
Import multiple files to a single table
It is common practice to divide large Parquet files into several smaller files for cleaner data management. Kùzu can read multiple files with the same structure, consolidating their data into a single node or relationship table. You can specify that multiple files are loaded in the following ways:
Glob pattern
This is similar to the Unix glob pattern, where you specify file paths that match a given pattern. The following wildcard characters are supported:
Wildcard | Description |
---|---|
* | match any number of any characters (including none) |
? | match any single character |
[abc] | match any one of the characters enclosed within the brackets |
[a-z] | match any one of the characters within the range |
List of files
Alternatively, you can just specify a list of files to be loaded.
Ignore erroneous rows
See the Ignore erroneous rows section for more details.