HTTP file system extension

The httpfs extension extends the Kuzu file system by allowing reading from or writing to files hosted on remote file systems. The httpfs extension provides support for the following remote file systems:

This extension: Plain HTTP(S) file systems
S3 extension: Object storage via the AWS S3 API
GCS extension: Object storage via the Google Cloud Storage (GCS) API

If you’re looking to use Kuzu with Azure Blob Storage, see the Azure extension.

Usage

INSTALL httpfs;
LOAD httpfs;

HTTP(S) file system

Using the httpfs extension, you can read from a file hosted on an HTTP(S) server in the same way as you would from a local file. The example below shows how to scan data from a CSV file hosted on an HTTP(S) server.

LOAD FROM "https://extension.kuzudb.com/dataset/test/city.csv"
RETURN *;

Waterloo|150000
Kitchener|200000
Guelph|75000

Local cache

Remote file system calls can be expensive and highly dependent on your network conditions (bandwidth, latency). Queries involving a large number of file operations (read, write, glob) can be slow. To expedite such queries, we introduce a new option: HTTP_CACHE_FILE. A local file cache is initialized when Kuzu requests the file for the first time. Subsequent remote file operations will be translated as local file operation on the cache file.

The example below shows how to enable the local cache for remote files.

CALL HTTP_CACHE_FILE=TRUE;

If you need to scan the same remote file multiple times and benefit from caching across multiple scans, you can run all the LOAD FROM statements in the same transaction. The example below shows how to do this.

BEGIN TRANSACTION;
LOAD FROM "https://example.com/city.csv" RETURN *;
LOAD FROM "https://example.com/city.csv" RETURN *;
COMMIT;

Now the second LOAD FROM statement will run much faster because the file is already downloaded and cached and the second scan is within the same transaction as the first one.