Load (Scan)
The LOAD FROM
clause performs a direct scan over an input file without copying it into the database.
This clause is very useful to inspect a subset of a larger file to display or load into a node table, or to
perform simple transformation tasks like rearranging column order.
LOAD FROM
is designed to be used in the exact same way as MATCH
, meaning that it can be followed
by arbitrary clauses like CREATE
, WHERE
, RETURN
, and so on.
Example usage
Some example usage for the LOAD FROM
clause is shown below.
Filtering/aggregating
To skip the first 2 lines of the CSV file, you can use the SKIP
parameter as follows:
Create nodes from input file
Reorder and subset columns
You can also use the scan functionality to reorder and subset columns from a given dataset. For
example, the following query will return just the age
and name
in that order, even if the
input file has more columns specified in a different order.
Enforce Schema
By default, Kùzu will infer the column names and data types from the scan source automatically.
- For Parquet, Pandas, Polars and PyArrow, column names and data types are always available in the data source
- For CSV, we use header names as properties if available, otherwise we fallback naming to
column0, column1, ...
. We also assume that all data types areSTRING
if no data type information is available in the header - For JSON, we use keys as column names, and infer a common data type from each key’s values. To use
LOAD FROM
with JSON, you need to have the JSON extension installed. More details on usingLOAD FROM
with JSON files is provided on the documentation page for the JSON extension.
To enforce specific column names and data types when reading, you can use the LOAD WITH HEADERS (<name> <dataType>, ...) FROM ...
syntax.
The following query will first bind the column name
to the STRING
type and second column age
to the INT64
type.
You can combine this with a WHERE
clause to filter the data as needed.
Scan Data Formats
CSV
When loading from a CSV file, you can use a similar syntax to the COPY FROM
statement.
If no header row is available, you can simply pass in the CSV file name to the statment and Kùzu will parse
each column as STRING
type with name column0, column1, ...
.
Example:
If header names are available in the file, you can ask Kùzu to parse the header and use data types and names as specified in the header.
Example:
Parquet
Since Parquet files contain schema information in their metadata, Kùzu will always use the available schema information when loading from Parquet files.
Pandas
Kùzu allows zero-copy access to Pandas DataFrames. The data types within a Pandas DataFrame will be
used to infer the schema of the data. The Pandas DataFrame can be scanned using the LOAD FROM
clause just like we would from an external file.
Polars
Kùzu can also scan Polars DataFrames via the underlying PyArrow layer.
Arrow tables
You can scan an existing PyArrow table as follows:
JSON
Kùzu can scan JSON files using `LOAD FROM. All JSON-related features are part of the JSON extension. See the documentation on the JSON extension for details.