HTTP file system extension
The httpfs
extension extends the Kuzu file system by allowing reading from or writing to files hosted on
remote file systems. The httpfs
extension provides support for the following remote file systems:
- This extension: Plain HTTP(S) file systems
- S3 extension: Object storage via the AWS S3 API
- GCS extension: Object storage via the Google Cloud Storage (GCS) API
If you’re looking to use Kuzu with Azure Blob Storage, see the Azure extension.
Usage
INSTALL httpfs;LOAD httpfs;
HTTP(S) file system
Using the httpfs
extension, you can read from a file hosted on an HTTP(S) server in the same way as you would
from a local file. The example below shows how to scan data from a CSV file hosted on an HTTP(S) server.
LOAD FROM "https://extension.kuzudb.com/dataset/test/city.csv"RETURN *;
Waterloo|150000Kitchener|200000Guelph|75000
Local cache
Remote file system calls can be expensive and highly dependent on your network conditions (bandwidth, latency).
Queries involving a large number of file operations (read, write, glob) can be slow.
To expedite such queries, we introduce a new option: HTTP_CACHE_FILE
.
A local file cache is initialized when Kuzu requests the file for the first time.
Subsequent remote file operations will be translated as local file operation on the cache file.
The example below shows how to enable the local cache for remote files.
CALL HTTP_CACHE_FILE=TRUE;
If you need to scan the same remote file multiple times and benefit from caching across multiple scans, you
can run all the LOAD FROM
statements in the same transaction. The example below shows how to do this.
BEGIN TRANSACTION;LOAD FROM "https://example.com/city.csv" RETURN *;LOAD FROM "https://example.com/city.csv" RETURN *;COMMIT;
Now the second LOAD FROM
statement will run much faster because the file is already downloaded
and cached and the second scan is within the same transaction as the first one.