Load (Scan)
The LOAD FROM
clause performs a direct scan over an input file without copying it into the database.
This clause is very useful to inspect a subset of a larger file to display or load into a node table, or to
perform simple transformation tasks like rearranging column order.
LOAD FROM
is designed to be used in the exact same way as MATCH
, meaning that it can be followed
by arbitrary clauses like CREATE
, WHERE
, RETURN
, and so on.
LOAD FROM
Some example usage for the LOAD FROM
clause is shown below.
Filtering/aggregating
LOAD FROM "user.csv" (header = true)WHERE CAST(age, INT64) > 25RETURN COUNT(*);----------------| COUNT_STAR() |----------------| 3 |----------------
Create nodes from input file
LOAD FROM "user.csv" (header = true)CREATE (:User {name: name, age: CAST(age, INT64)});
MATCH (u:User) RETURN u;----------------------------------------------------| u |----------------------------------------------------| {_ID: 0:0, _LABEL: User, name: Adam, age: 30} |----------------------------------------------------| {_ID: 0:1, _LABEL: User, name: Karissa, age: 40} |----------------------------------------------------| {_ID: 0:2, _LABEL: User, name: Zhang, age: 50} |----------------------------------------------------| {_ID: 0:3, _LABEL: User, name: Noura, age: 25} |----------------------------------------------------
Reorder and subset columns
You can also use the scan functionality to reorder and subset columns from a given dataset. For
example, the following query will return just the age
and name
in that order, even if the
input file has more columns specified in a different order.
// Return age column before the name columnLOAD FROM "user.csv" (header = true)RETURN age, name LIMIT 3;
--------------------| age | name |--------------------| 30 | Adam |--------------------| 40 | Karissa |--------------------| 50 | Zhang |--------------------
Enforce Schema
To enforce a specific schema and data types when reading from CSV, you can use the LOAD WITH HEADERS (<name> <dataType>, ...) FROM ...
syntax.
The following query will bind first column name
to the STRING type and second column age
to the INT64 type.
You can combine this with a WHERE
clause to filter the data as needed.
LOAD WITH HEADERS (name STRING, age INT64) FROM "user.csv" (header = true)WHERE name =~ 'Adam*'RETURN name, age;--------------| name | age |--------------| Adam | 30 |--------------
Scan Data Formats
CSV
When loading from a CSV file, you can use a similar syntax to the COPY FROM
statement.
If no header row is available, you can simply pass in the CSV file name to the statment and Kùzu will parse
each column as STRING
type with name column0, column1, ...
.
Example:
LOAD FROM "test.csv" RETURN *;-----------| column0 |-----------| a |-----------| b |-----------
If header names are available in the file, you can ask Kùzu to parse the header and use data types and names as specified in the header.
Example:
LOAD FROM "user.csv" (header = true) RETURN *;-----------------| name | age |-----------------| Adam | 30 |-----------------| Karissa | 40 |-----------------| Zhang | 50 |-----------------| Noura | 25 |-----------------
Parquet
Since Parquet files contain schema information in their metadata, Kùzu will always use the available schema information when loading from Parquet files.
LOAD FROM "user.parquet" RETURN *;----------------| f0 | f1 |----------------| Adam | 30 |----------------| Karissa | 40 |----------------| Zhang | 50 |----------------| Noura | 25 |----------------
Pandas
Kùzu allows zero-copy access to Pandas DataFrames. The data types within a Pandas DataFrame will be used to infer the schema of the data.
Because Pandas is a Python-only DataFrame library, the following example is for Python users. We first create a Pandas DataFrame as follows:
import kuzuimport pandas as pd
db = kuzu.Database("persons")conn = kuzu.Connection(db)
df = pd.DataFrame({ "name": ["Adam", "Karissa", "Zhang", "Noura"], "age": [30, 40, 50, 25]})
The Pandas DataFrame can be scanned using the LOAD FROM
clause just like we would from
an external file. The data access occurs in a zero-copy manner, meaning that Kùzu natively scans
the underlying Pandas data objects.
result = conn.execute("LOAD FROM df RETURN *;")print(result.get_as_df())
# Result name age0 Adam 301 Karissa 402 Zhang 503 Noura 25