Skip to content
Blog

Load (Scan)

The LOAD FROM clause performs a direct scan over an input file without copying it into the database. This clause is very useful to inspect a subset of a larger file to display or load into a node table, or to perform simple transformation tasks like rearranging column order.

LOAD FROM can be followed by arbitrary clauses like MATCH, CREATE, WHERE, RETURN, and so on. Suppose the input source has tuples with k columns. LOAD FROM will bind each tuple t=(col0, col1, …, col(k-1)) of the scanned input source to k variables with some names and data types. Names and data types of the variables can either be specified in the LOAD FROM statement using the WITH HEADERS clause. Or they will be automatically inferred from the source.

Example usage

Some example usage for the LOAD FROM clause is shown below.

Filtering/aggregating

LOAD FROM "user.csv" (header = true)
WHERE age > 25
RETURN COUNT(*);

This returns:

┌──────────────┐
│ COUNT_STAR() │
│ INT64 │
├──────────────┤
│ 3 │
└──────────────┘

Skipping lines

To skip the first 2 lines of the CSV file, you can use the SKIP parameter as follows:

LOAD FROM "user.csv" (header = true, skip = 2)
RETURN *;

Create nodes from input file

You can pass the contents of LOAD FROM to a

// Create a node table
// Scan file and use its contents to create nodes
LOAD FROM "user.csv" (header = true)
CREATE (:User {name: name, age: CAST(age AS INT64)});
// Return the nodes we just created
MATCH (u:User) RETURN u.name, u.age;
┌─────────┬───────┐
│ u.name │ u.age │
│ STRING │ INT64 │
├─────────┼───────┤
│ Adam │ 30 │
│ Karissa │ 40 │
│ Zhang │ 50 │
│ Noura │ 25 │
└─────────┴───────┘

Reorder and subset columns

You can also use the scan functionality to reorder and subset columns from a given dataset. For example, the following query will return just the age and name in that order, even if the input file has more columns specified in a different order.

// Return age column before the name column
LOAD FROM "user.csv" (header = true)
RETURN age, name LIMIT 3;
┌───────┬─────────┐
│ age │ name │
│ INT64 │ STRING │
├───────┼─────────┤
│ 30 │ Adam │
│ 40 │ Karissa │
│ 50 │ Zhang │
└───────┴─────────┘

Bound variable names and data types

By default, Kùzu will infer the column names and data types from the scan source automatically.

  • For Parquet, Pandas, Polars and PyArrow, column names and data types are always available in the data source
  • For CSV: The behavior is determined by the CSV scanning configuration, which are specified at the end of LOAD FROM, inside (), similar to COPY FROM statements. We review the details of this behavior below.
  • For JSON, we use keys as column names, and infer a common data type from each key’s values. To use LOAD FROM with JSON, you need to have the JSON extension installed. More details on using LOAD FROM with JSON files is provided on the documentation page for the JSON extension.

You can enforce specific column names and data types when reading, by using the LOAD WITH HEADERS (<name> <dataType>, ...) FROM ... syntax.

The following query will first bind the column name to the STRING type and second column age to the INT64 type. You can combine this with a WHERE clause to filter the data as needed.

LOAD WITH HEADERS (name STRING, age INT64) FROM "user.csv" (header = true)
WHERE name =~ 'Adam*'
RETURN name, age;
┌────────┬───────┐
│ name │ age │
│ STRING │ INT64 │
├────────┼───────┤
│ Adam │ 30 │
└────────┴───────┘

Ignore erroneous rows

Errors can happen when scanning different lines or elements of an input file with LOAD FROM. Error can happen for several reasons, such as a line in the scanned file is malformed (e.g., in CSV files) or a field in the scanned line cannot be cast into its expected data type (e.g., due to an integer overflow). You can skip erroneous lines when scanning large files by setting IGNORE_ERRORS configuration to true. This configuration is also supported when using COPY FROM and the details of this feature is documented in the ignoring erroneous rows section of COPY FROM.

Here is an example. Suppose the CSV file user.csv contain the following rows:

Alice,4
Bob,2147483650

Suppose we write a LOAD FROM statement that tries to read the second column as an INT32. The second row (Bob,2147483650) would be malformed because 2147483650 does not fit into an INT32 and will cause an error. By setting IGNORE_ERRORS to true, instead of erroring, we can make LOAD FROM simply skip over this line:

LOAD WITH HEADERS (name STRING, age INT32) FROM "user.csv" (ignore_errors = true)
RETURN name, age;
┌────────┬───────┐
│ name │ age │
│ STRING │ INT32 │
├────────┼───────┤
│ Alice │ 4 │
└────────┴───────┘

You can also see the details of any warnings generated by the skipped lines using the SHOW_WARNINGS function. See the ignoring erroneous rows section of COPY FROM for more details.

Scan Data Formats

LOAD FROM can scan several raw or in-memory file formats, such as CSV, Parquet, Pandas, Polars, Arrow tables, and JSON.

File format detection

LOAD FROM determines the file format based on the file extension if the file_format option is not given. For instance, files with a .csv extension are automatically recognized as CSV format.

If the file format cannot be inferred from the extension, or if you need to override the default sniffing behaviour, the file_format option can be used.

For example, to load a CSV file that has a .tsv extension (for tab-separated data), you must explicitly specify the file format using the file_format option, as shown below:

LOAD FROM 'data.tsv' (file_format='csv')
RETURN *

Below we give examples of using LOAD FROM to scan data from each of these formats. We assume WITH HEADERS is not used in the examples below, so we discuss how Kùzu infers the variable names and data types of that bind to the scanned tuples.

CSV

The syntax for using LOAD FROM to scan a CSV file is similar to the one used for using COPY FROM with CSV files.

CSV header

If (i) the CSV file has a header line, i.e., a first line that should not be interpreted as a tuple to be scanned; and (ii) (header = true) set, then the column names in the first line provide the names of the columns. The data types are always automatically inferred from the CSV file (except of course if LOAD WITH HEADERS (...) FROM is used, in which case the data types provided inside the (...) are used as described above).

Suppose user.csv is a CSV file with the following contents:

name,age
Adam,30
Karissa,40
Zhang,50

Then if you run the following query, Kùzu will infer the column names name and age from the first line of the CSV:

LOAD FROM "user.csv" (header = true) RETURN *;
┌─────────┬───────┐
nameage
STRINGINT64
├─────────┼───────┤
Adam30
Karissa40
Zhang50
└─────────┴───────┘

If (header = false), then the names of the columns will be column0, column1, …, column(k-1), where k is the number of columns in the CSV file. Suppose user.csv has instead the following contents:

Adam,30
Karissa,40
Zhang,50
LOAD FROM "user.csv" (header = false) RETURN *;
┌─────────┬─────────┐
column0column1
STRINGSTRING
├─────────┼─────────┤
nameage
Adam30
Karissa40
Zhang50
└─────────┴─────────┘

Parquet

Since Parquet files contain schema information in their metadata, Kùzu will always use the available schema information when loading from Parquet files (except again if LOAD WITH HEADERS (...) FROM is used). Suppose we have a Parquet file user.parquet with two columns f0 and f1 and the same content as in the user.csv file above. Then the query below will scan the Parquet file and output the following:

LOAD FROM "user.parquet" RETURN *;
┌─────────┬───────┐
f0f1
STRINGINT64
├─────────┼───────┤
Adam30
Karissa40
Zhang50
└─────────┴───────┘

Pandas

Kùzu allows zero-copy access to Pandas DataFrames. The variable names and data types of scanned columns within a Pandas DataFrame will be inferred from the schema information of the data frame. Here is an example:

main.py
import kuzu
import pandas as pd
db = kuzu.Database("persons")
conn = kuzu.Connection(db)
df = pd.DataFrame({
"name": ["Adam", "Karissa", "Zhang"],
"age": [30, 40, 50]
})
result = conn.execute("LOAD FROM df RETURN *;")
print(result.get_as_df())
name age
0 Adam 30
1 Karissa 40
2 Zhang 50

Here name and age have string and integer types in the define Pandas Dataframe, and so the output table contains two columns with the same names and data types.

Polars

Kùzu can also scan Polars DataFrames via the underlying PyArrow layer. The rules for determining the variable names and data types is identical to scanning Pandas data frames. Here is an example:

import kuzu
import polars as pl
db = kuzu.Database("tmp")
conn = kuzu.Connection(db)
df = pl.DataFrame({
"name": ["Adam", "Karissa", "Zhang"],
"age": [30, 40, 50]
})
res = conn.execute("LOAD FROM df RETURN *")
print(res.get_as_pl())
shape: (3, 2)
┌─────────┬─────┐
│ name ┆ age │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪═════╡
│ Adam ┆ 30 │
│ Karissa ┆ 40 │
│ Zhang ┆ 50 │
└─────────┴─────┘

Arrow tables

You can scan an existing PyArrow table as follows:

import kuzu
import pyarrow as pa
db = kuzu.Database("tmp")
conn = kuzu.Connection(db)
pa_table = pa.table({
"name": ["Adam", "Karissa", "Zhang"],
"age": [30, 40, 50]
})
print(pa_table)
res = conn.execute("LOAD FROM pa_table RETURN *")
print(res.get_as_arrow())
pyarrow.Table
name: string
age: int64
----
name: [["Adam","Karissa","Zhang"]]
age: [[30,40,50]]

JSON

Kùzu can scan JSON files using LOAD FROM, but only upon installation of the JSON extension. See the documentation on the JSON extension for details.