Skip to content
Blog

HTTP File System (httpfs)

The httpfs extension extends the Kùzu file system by allowing reading from/writing to files hosted on remote file systems. Over plain HTTP(S), the extension only supports reading files. When using object storage via the S3 API, the extension supports reading, writing and globbing files.

Usage

httpfs is an official extension developed and maintained by Kùzu. It can be installed and loaded using the following commands:

INSTALL httpfs;
LOAD EXTENSION httpfs;

HTTP(S) file system

httpfs allows users to read from a file hosted on a http(s) server in the same way as from a local file. Example:

LOAD FROM "https://extension.kuzudb.com/dataset/test/city.csv"
RETURN *;

Result:

Waterloo|150000
Kitchener|200000
Guelph|75000

S3 file system

The extension also allows users to read/write/glob files hosted on object storage servers using the S3 API.

S3 file system configuration

Before reading and writing from S3, users have to configure using the CALL statement.

Supported options:

Option nameDescription
s3_access_key_idS3 access key id
s3_secret_access_keyS3 secret access key
s3_endpointS3 endpoint
s3_url_styleUses S3 url style (should either be vhost or path)
s3_regionS3 region
s3_uploader_max_num_parts_per_fileUsed for part size calculation
s3_uploader_max_filesizeUsed for part size calculation
s3_uploader_threads_limitMaximum number of uploader threads

Requirements on the S3 server API

FeatureRequired S3 API features
Public file readsHTTP Range request
Private file readsSecret key authentication
File globListObjectV2
File writesMultipart upload

Reading from S3:

Reading from S3 is as simple as reading from regular files:

LOAD FROM 's3://kuzu-test/follows.parquet'
RETURN *;

Glob

Globbing is implemented using the S3 ListObjectV2 API, and allows the user to glob files as they would in their local filesystem.

CREATE NODE TABLE tableOfTypes (
id INT64,
int64Column INT64,
doubleColumn DOUBLE,
booleanColumn BOOLEAN,
dateColumn DATE,
stringColumn STRING,
listOfInt64 INT64[],
listOfString STRING[],
listOfListOfInt64 INT64[][],
structColumn STRUCT(ID int64, name STRING),
PRIMARY KEY (id));
COPY tableOfTypes FROM "s3://kuzu-dataset-us/glob-test/types_50k_*.parquet"

Uploading to S3

Writing to S3 uses the AWS multipart upload API.

COPY (
MATCH (p:Location)
RETURN p.*
)
TO 's3://kuzu-dataset-us/output/location.parquet'

AWS credential management

Users can set configuration parameters through environment variables: Supported environments are:

SettingSystem environment variable
AWS S3 key IDS3_ACCESS_KEY_ID
AWS S3 secret keyS3_SECRET_ACCESS_KEY
AWS S3 default endpointS3_ENDPOINT
AWS S3 default regionS3_REGION

Local cache for remote files

Remote file system calls can be expensive and highly dependent on the user’s network conditions (bandwidth, latency). Queries involving a large number of file operations (read, write, glob) can be slow. To expedite such queries, we introduce a new option: HTTP_CACHE_FILE. A local file cache is initialized when Kùzu requests the file for the first time. Subsequent remote file operations will be translated as local file operation on the cache file. For example the below CALL statement enables the local cache for remote files:

CALL HTTP_CACHE_FILE=TRUE;