HTTP File System (httpfs)
The httpfs
extension extends the Kùzu file system by allowing reading from/writing to files hosted on
remote file systems. Over plain HTTP(S), the extension only supports reading files.
When using object storage via the S3 API, the extension supports reading, writing and globbing files.
Usage
httpfs
is an official extension developed and maintained by Kùzu.
It can be installed and loaded using the following commands:
HTTP(S) file system
httpfs
allows users to read from a file hosted on a http(s) server in the same way as from a local file.
Example:
Result:
S3 file system
The extension also allows users to read/write/glob files hosted on object storage servers using the S3 API.
S3 file system configuration
Before reading and writing from S3, users have to configure using the CALL statement.
Supported options:
Option name | Description |
---|---|
s3_access_key_id | S3 access key id |
s3_secret_access_key | S3 secret access key |
s3_endpoint | S3 endpoint |
s3_url_style | Uses S3 url style (should either be vhost or path) |
s3_region | S3 region |
s3_uploader_max_num_parts_per_file | Used for part size calculation |
s3_uploader_max_filesize | Used for part size calculation |
s3_uploader_threads_limit | Maximum number of uploader threads |
Requirements on the S3 server API
Feature | Required S3 API features |
---|---|
Public file reads | HTTP Range request |
Private file reads | Secret key authentication |
File glob | ListObjectV2 |
File writes | Multipart upload |
Reading from S3:
Reading from S3 is as simple as reading from regular files:
Glob
Globbing is implemented using the S3 ListObjectV2 API, and allows the user to glob files as they would in their local filesystem.
Uploading to S3
Writing to S3 uses the AWS multipart upload API.
AWS credential management
Users can set configuration parameters through environment variables: Supported environments are:
Setting | System environment variable |
---|---|
AWS S3 key ID | S3_ACCESS_KEY_ID |
AWS S3 secret key | S3_SECRET_ACCESS_KEY |
AWS S3 default endpoint | S3_ENDPOINT |
AWS S3 default region | S3_REGION |
Local cache for remote files
Remote file system calls can be expensive and highly dependent on the user’s network conditions (bandwidth, latency).
Queries involving a large number of file operations (read, write, glob) can be slow.
To expedite such queries, we introduce a new option: HTTP_CACHE_FILE
.
A local file cache is initialized when Kùzu requests the file for the first time.
Subsequent remote file operations will be translated as local file operation on the cache file.
For example the below CALL
statement enables the local cache for remote files: