S3 storage

You can read, write, or glob files hosted on object storage servers using the Amazon S3 API.

Usage

Accessing S3 storage is implemented as a feature of the httpfs extension.

INSTALL httpfs;
LOAD httpfs;

Configure the connection

Once the httpfs extension is loaded, you can configure the S3 connection using CALL statements in Cypher.

CALL <option_name>=<option_value>

The following options are supported:

Option name	Description
`s3_access_key_id`	S3 access key ID
`s3_secret_access_key`	S3 secret access key
`s3_endpoint`	S3 endpoint
`s3_region`	S3 region
`s3_url_style`	Uses S3 URL style (should either be vhost or path)
`s3_uploader_max_num_parts_per_file`	Used for part size calculation
`s3_uploader_max_filesize`	Used for part size calculation
`s3_uploader_threads_limit`	Maximum number of uploader threads

You can alternatively set the following environment variables:

Environment variable	Description
`S3_ACCESS_KEY_ID`	S3 access key ID
`S3_SECRET_ACCESS_KEY`	S3 secret access key
`S3_ENDPOINT`	S3 endpoint
`S3_REGION`	S3 region
`S3_URL_STYLE`	S3 URL style

Scanning data from S3

The example below shows how to scan data from a Parquet file hosted on S3.

LOAD FROM 's3://kuzu-datasets/follows.parquet'
RETURN *;

Glob data from S3

You can glob data from S3 just as you would from a local file system. Globbing is implemented using the S3 ListObjectV2 API.

CREATE NODE TABLE tableOfTypes (
    id INT64,
    int64Column INT64,
    doubleColumn DOUBLE,
    booleanColumn BOOLEAN,
    dateColumn DATE,
    stringColumn STRING,
    listOfInt64 INT64[],
    listOfString STRING[],
    listOfListOfInt64 INT64[][],
    structColumn STRUCT(ID int64, name STRING),
    PRIMARY KEY (id));

COPY tableOfTypes
FROM "s3://kuzu-datasets/types/types_50k_*.parquet";

Writing data to S3

Writing to S3 uses the AWS multipart upload API.

COPY (
    MATCH (p:Location)
    RETURN p.*
)
TO 's3://kuzu-datasets/saved/location.parquet';

Additional configurations

Requirements on the S3 server APIs

S3 offers a standard set of APIs for read and write operations. The httpfs extension uses these APIs to communicate with remote storage services and thus should also work with other services that are compatible with the S3 API (such as Cloudflare R2).

The table below shows which parts of the S3 API are needed for each feature of the extension to work.

Feature	Required S3 API
Public file reads	HTTP Range request
Private file reads	Secret key authentication
File glob	ListObjectV2
File writes	Multipart upload

Local cache

Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the local cache section for more details.