Load (Scan)
The LOAD FROM
clause performs a direct scan over an input file without copying it into the database.
This clause is very useful to inspect a subset of a larger file to display or load into a node table, or to
perform simple transformation tasks like rearranging column order.
LOAD FROM
can be followed by arbitrary clauses like MATCH
, CREATE
, WHERE
, RETURN
, and so on.
Suppose the input source has tuples with k columns.
LOAD FROM
will bind each tuple t=(col0, col1, …, col(k-1)) of the scanned input source to k variables
with some names and data types. Names and data types of the variables can either be specified
in the LOAD FROM
statement using the WITH HEADERS
clause. Or they will be automatically inferred from the source.
Example usage
Some example usage for the LOAD FROM
clause is shown below.
Filtering/aggregating
To skip the first 2 lines of the CSV file, you can use the SKIP
parameter as follows:
Create nodes from input file
Reorder and subset columns
You can also use the scan functionality to reorder and subset columns from a given dataset. For
example, the following query will return just the age
and name
in that order, even if the
input file has more columns specified in a different order.
Bound variable names and data types
By default, Kùzu will infer the column names and data types from the scan source automatically.
- For Parquet, Pandas, Polars and PyArrow, column names and data types are always available in the data source
- For CSV: The behavior is determined by the CSV scanning configuration, which are specified at the end of
LOAD FROM
, inside()
, similar toCOPY FROM
statements. We review the details of this behavior below. - For JSON, we use keys as column names, and infer a common data type from each key’s values. To use
LOAD FROM
with JSON, you need to have the JSON extension installed. More details on usingLOAD FROM
with JSON files is provided on the documentation page for the JSON extension.
You can enforce specific column names and data types when reading, by using the LOAD WITH HEADERS (<name> <dataType>, ...) FROM ...
syntax.
The following query will first bind the column name
to the STRING
type and second column age
to the INT64
type.
You can combine this with a WHERE
clause to filter the data as needed.
Ignore erroneous rows
Errors can happen when scanning different lines or elements of an input file with LOAD FROM
.
Error can happen for several reasons, such as a line in the scanned file is malformed (e.g., in CSV files)
or a field in the scanned line cannot be cast into its expected data type (e.g., due to an integer overflow).
You can skip erroneous lines when scanning large files by setting IGNORE_ERRORS
configuration to true
. This configuration is also supported when using COPY FROM
and the details of this feature
is documented in the ignoring erroneous rows section of COPY FROM
.
Here is an example. Suppose the CSV file user.csv
contain the following rows:
Suppose we write a LOAD FROM
statement that tries to read the second column as an INT32.
The second row (Bob,2147483650)
would be malformed because 2147483650 does not fit into an INT32 and will cause an error.
By setting IGNORE_ERRORS
to true, instead of erroring, we can make LOAD FROM
simply skip over this line:
You can also see the details of any warnings generated by the skipped lines using the SHOW_WARNINGS function.
See the ignoring erroneous rows section of COPY FROM
for more details.
Scan Data Formats
Load from can scan several raw or in-memory file formats, such as CSV, Parquet, Pandas, Polars, Arrow tables, and JSON.
Below we give examples of using LOAD FROM
to scan data from each of these formats. We assume WITH HEADERS
is not used in the examples below, so we discuss how Kùzu infers the variable names and data types of
that bind to the scanned tuples.
CSV
The syntax for using LOAD FROM
to scan a CSV file is similar to the one used for using COPY FROM
with CSV files.
CSV header
If (i) the CSV file has a header line, i.e., a first line that should not be interpreted
as a tuple to be scanned; and (ii) (header = true)
set, then the column names in the first line
provide the names of the columns. The data types are always automatically inferred from the CSV file (except of course
if LOAD WITH HEADERS (...) FROM
is used, in which case the data types provided inside the (...)
are used as
described above).
Suppose user.csv is a CSV file with the following contents:
Then if you run the following query, Kùzu will infer the column names name
and age
from the first line of the CSV:
If (header = false), then the names of the columns will be column0, column1, …, column(k-1), where k is the number of columns in the CSV file. Suppose user.csv has instead the following contents:
Parquet
Since Parquet files contain schema information in their metadata, Kùzu will always use the available
schema information when loading from Parquet files (except again
if LOAD WITH HEADERS (...) FROM
is used). Suppose we have a Parquet file user.parquet
with two columns f0
and f1
and the same content as in the user.csv
file above. Then the query below will scan the Parquet file and output the following:
Pandas
Kùzu allows zero-copy access to Pandas DataFrames. The variable names and data types of scanned columns within a Pandas DataFrame will be inferred from the schema information of the data frame. Here is an example:
Here name
and age
have string and integer types in the define Pandas Dataframe, and so the output table
contains two columns with the same names and data types.
Polars
Kùzu can also scan Polars DataFrames via the underlying PyArrow layer. The rules for determining the variable names and data types is identical to scanning Pandas data frames. Here is an example:
Arrow tables
You can scan an existing PyArrow table as follows:
JSON
Kùzu can scan JSON files using LOAD FROM
.
All JSON-related features are part of the JSON extension. See the documentation on the JSON extension for details.