Skip to content
Blog

S3 storage

You can read, write, or glob files hosted on object storage servers using the Amazon S3 API.

Usage

Accessing S3 storage is implemented as a feature of the httpfs extension.

INSTALL httpfs;
LOAD httpfs;

Configure the connection

Once the httpfs extension is loaded, you can configure the S3 connection using CALL statements in Cypher.

CALL <option_name>=<option_value>

The following options are supported:

Option nameDescription
s3_access_key_idS3 access key ID
s3_secret_access_keyS3 secret access key
s3_endpointS3 endpoint
s3_regionS3 region
s3_url_styleUses S3 URL style (should either be vhost or path)
s3_uploader_max_num_parts_per_fileUsed for part size calculation
s3_uploader_max_filesizeUsed for part size calculation
s3_uploader_threads_limitMaximum number of uploader threads

You can alternatively set the following environment variables:

Environment variableDescription
S3_ACCESS_KEY_IDS3 access key ID
S3_SECRET_ACCESS_KEYS3 secret access key
S3_ENDPOINTS3 endpoint
S3_REGIONS3 region
S3_URL_STYLES3 URL style

Scanning data from S3

The example below shows how to scan data from a Parquet file hosted on S3.

LOAD FROM 's3://kuzu-datasets/follows.parquet'
RETURN *;

Glob data from S3

You can glob data from S3 just as you would from a local file system. Globbing is implemented using the S3 ListObjectV2 API.

CREATE NODE TABLE tableOfTypes (
id INT64,
int64Column INT64,
doubleColumn DOUBLE,
booleanColumn BOOLEAN,
dateColumn DATE,
stringColumn STRING,
listOfInt64 INT64[],
listOfString STRING[],
listOfListOfInt64 INT64[][],
structColumn STRUCT(ID int64, name STRING),
PRIMARY KEY (id));
COPY tableOfTypes
FROM "s3://kuzu-datasets/types/types_50k_*.parquet";

Writing data to S3

Writing to S3 uses the AWS multipart upload API.

COPY (
MATCH (p:Location)
RETURN p.*
)
TO 's3://kuzu-datasets/saved/location.parquet';

Additional configurations

Requirements on the S3 server APIs

S3 offers a standard set of APIs for read and write operations. The httpfs extension uses these APIs to communicate with remote storage services and thus should also work with other services that are compatible with the S3 API (such as Cloudflare R2).

The table below shows which parts of the S3 API are needed for each feature of the extension to work.

FeatureRequired S3 API
Public file readsHTTP Range request
Private file readsSecret key authentication
File globListObjectV2
File writesMultipart upload

Local cache

Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the local cache section for more details.