S3 storage
You can read, write, or glob files hosted on object storage servers using the Amazon S3 API.
Usage
Accessing S3 storage is implemented as a feature of the httpfs
extension.
INSTALL httpfs;LOAD httpfs;
Configure the connection
Once the httpfs
extension is loaded, you can configure the S3 connection using
CALL statements in Cypher.
CALL <option_name>=<option_value>
The following options are supported:
Option name | Description |
---|---|
s3_access_key_id | S3 access key ID |
s3_secret_access_key | S3 secret access key |
s3_endpoint | S3 endpoint |
s3_region | S3 region |
s3_url_style | Uses S3 URL style (should either be vhost or path) |
s3_uploader_max_num_parts_per_file | Used for part size calculation |
s3_uploader_max_filesize | Used for part size calculation |
s3_uploader_threads_limit | Maximum number of uploader threads |
You can alternatively set the following environment variables:
Environment variable | Description |
---|---|
S3_ACCESS_KEY_ID | S3 access key ID |
S3_SECRET_ACCESS_KEY | S3 secret access key |
S3_ENDPOINT | S3 endpoint |
S3_REGION | S3 region |
S3_URL_STYLE | S3 URL style |
Scanning data from S3
The example below shows how to scan data from a Parquet file hosted on S3.
LOAD FROM 's3://kuzu-datasets/follows.parquet'RETURN *;
Glob data from S3
You can glob data from S3 just as you would from a local file system. Globbing is implemented using the S3 ListObjectV2 API.
CREATE NODE TABLE tableOfTypes ( id INT64, int64Column INT64, doubleColumn DOUBLE, booleanColumn BOOLEAN, dateColumn DATE, stringColumn STRING, listOfInt64 INT64[], listOfString STRING[], listOfListOfInt64 INT64[][], structColumn STRUCT(ID int64, name STRING), PRIMARY KEY (id));
COPY tableOfTypesFROM "s3://kuzu-datasets/types/types_50k_*.parquet";
Writing data to S3
Writing to S3 uses the AWS multipart upload API.
COPY ( MATCH (p:Location) RETURN p.*)TO 's3://kuzu-datasets/saved/location.parquet';
Additional configurations
Requirements on the S3 server APIs
S3 offers a standard set of APIs for read and write operations. The httpfs
extension uses these APIs to communicate with remote storage services and thus should also work
with other services that are compatible with the S3 API (such as Cloudflare R2).
The table below shows which parts of the S3 API are needed for each feature of the extension to work.
Feature | Required S3 API |
---|---|
Public file reads | HTTP Range request |
Private file reads | Secret key authentication |
File glob | ListObjectV2 |
File writes | Multipart upload |
Local cache
Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the local cache section for more details.