Skip to content
Blog

HTTP File System (httpfs)

The httpfs extension extends the Kuzu file system by allowing reading from/writing to files hosted on remote file systems. The following remote file systems are supported:

  • Plain HTTP(S)
  • Object storage via the AWS S3 API
  • Object storage via the Google Cloud Storage (GCS) API

Over plain HTTP(S), the extension only supports reading files. When using object storage via the S3 or GCS API, the extension supports reading, writing and globbing files. See the subsections below for more details.

Usage

httpfs is an official extension developed and maintained by Kuzu. It can be installed and loaded using the following commands:

INSTALL httpfs;
LOAD EXTENSION httpfs;

HTTP(S) file system

httpfs allows you to read from a file hosted on a http(s) server in the same way as from a local file. Example:

LOAD FROM "https://extension.kuzudb.com/dataset/test/city.csv"
RETURN *;

Result:

Waterloo|150000
Kitchener|200000
Guelph|75000

Improve performance via caching

Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the Local cache section for more details.

AWS S3 file system

The extension also allows users to read/write/glob files hosted on object storage servers using the S3 API. Before reading and writing from S3, you have to configure your AWS credentials using the CALL statement.

The following options are supported:

Option nameDescription
s3_access_key_idS3 access key id
s3_secret_access_keyS3 secret access key
s3_endpointS3 endpoint
s3_url_styleUses S3 url style (should either be vhost or path)
s3_regionS3 region
s3_uploader_max_num_parts_per_fileUsed for part size calculation
s3_uploader_max_filesizeUsed for part size calculation
s3_uploader_threads_limitMaximum number of uploader threads

Environment Variables

You can set the necessary AWS configuration parameters through environment variables: Supported environments are:

SettingSystem environment variable
S3 key IDS3_ACCESS_KEY_ID
S3 secret keyS3_SECRET_ACCESS_KEY
S3 endpointS3_ENDPOINT
S3 regionS3_REGION
S3 url styleS3_URL_STYLE

Scan data from S3

Scanning from S3 is as simple as scanning from regular files:

LOAD FROM 's3://kuzu-datasets/follows.parquet'
RETURN *;

Glob data from S3

Globbing is implemented using the S3 ListObjectV2 API, and allows you to glob files as they would in their local filesystem.

CREATE NODE TABLE tableOfTypes (
id INT64,
int64Column INT64,
doubleColumn DOUBLE,
booleanColumn BOOLEAN,
dateColumn DATE,
stringColumn STRING,
listOfInt64 INT64[],
listOfString STRING[],
listOfListOfInt64 INT64[][],
structColumn STRUCT(ID int64, name STRING),
PRIMARY KEY (id));
COPY tableOfTypes FROM "s3://kuzu-datasets/types/types_50k_*.parquet"

Write data to S3

Writing to S3 uses the AWS multipart upload API.

COPY (
MATCH (p:Location)
RETURN p.*
)
TO 's3://kuzu-datasets/saved/location.parquet'

Additional configurations

Requirements on the S3 server API

S3 offers a standard set of APIs for read and write operations. The httpfs extension uses these APIs to communicate with remote storage services and thus should also work with other services that are compatible with the S3 API (such as Cloudflare R2).

The table below shows which parts of the S3 API are needed for each feature of the extension to work.

FeatureRequired S3 API
Public file readsHTTP Range request
Private file readsSecret key authentication
File globListObjectV2
File writesMultipart upload

Improve performance via caching

Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the Local cache section for more details.

GCS file system

This section shows how to scan from/write to files hosted on Google Cloud Storage.

Before reading and writing from private GCS buckets, you will need to configure Kuzu with your Google Cloud credentials. You can do this by configuring the following options with the CALL statement:

Option nameDescription
gcs_access_key_idGCS access key id
gcs_secret_access_keyGCS secret access key

For example to set the access key id, you would run

CALL gcs_access_key_id=${access_key_id};

Environment Variables

Another way is to provide the credentials is through environment variables:

SettingSystem environment variable
GCS access key IDGCS_ACCESS_KEY_ID
GCS secret access keyGCS_SECRET_ACCESS_KEY

Additional configurations

Since Kuzu communicates with GCS using its interoperability mode, the following S3 settings also apply when uploading files to GCS. More detailed descriptions of the settings can be found here.

Option name
s3_uploader_max_num_parts_per_file
s3_uploader_max_filesize
s3_uploader_threads_limit

Scan data from GCS

Files in GCS can be accessed through URLs with the formats

  • gs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩
  • gcs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩

For example, if you wish to scan the file follows.parquet located in the root directory of bucket kuzu-datasets you could use the following query:

LOAD FROM 'gs://kuzu-datasets/follows.parquet'
RETURN *;

Glob data from GCS

You can glob data from GCS just as you would from a local file system.

For example, if the following files are in the bucket tinysnb:

gs://tinysnb/vPerson.csv
gs://tinysnb/vPerson2.csv

The following query will copy the contents of both vPerson.csv and vPerson2.csv into the table person:

COPY person FROM "gs://tinysnb/vPerson*.csv"(header=true);

Write data to GCS

You can also write to files in GCS using URLs in the formats

  • gs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩
  • gcs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩

For example, the following query will write to the file located at path saved/location.parquet in the bucket kuzu-datasets:

COPY (
MATCH (p:Location)
RETURN p.*
)
TO 'gcs://kuzu-datasets/saved/location.parquet'

Improve performance via caching

Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the Local cache section for more details.


Local cache

Remote file system calls can be expensive and highly dependent on your network conditions (bandwidth, latency). Queries involving a large number of file operations (read, write, glob) can be slow. To expedite such queries, we introduce a new option: HTTP_CACHE_FILE. A local file cache is initialized when Kuzu requests the file for the first time. Subsequent remote file operations will be translated as local file operation on the cache file. For example the below CALL statement enables the local cache for remote files:

CALL HTTP_CACHE_FILE=TRUE;

If you need to scan the same remote file multiple times and benefit from caching across multiple scans, you can run all the LOAD FROM statements in the same transaction. Here is an example:

BEGIN TRANSACTION;
LOAD FROM "https://example.com/city.csv" RETURN *;
LOAD FROM "https://example.com/city.csv" RETURN *;
COMMIT;

Now the second LOAD FROM statement will run much faster because the file is already downloaded and cached and the second scan is within the same transaction as the first one.