Skip to content
Blog

Google Cloud Storage storage

You can read, write, or glob files hosted on object storage servers using the Google Cloud Storage API.

Usage

Accessing GCS storage is implemented as a feature of the httpfs extension.

INSTALL httpfs;
LOAD httpfs;

Configure the connection

Before reading and writing from private GCS buckets, you have to configure the connection using CALL statements.

CALL <option_name>=<option_value>

The following options are supported:

OptionDescription
gcs_access_key_idGCS access key ID
gcs_secret_access_keyGCS secret access key

Alternatively, you can set the following environment variables:

Environment variableDescription
GCS_ACCESS_KEY_IDGCS access key ID
GCS_SECRET_ACCESS_KEYGCS secret access key

Since Kuzu communicates with GCS using its interoperability mode, the following S3 settings also apply when uploading files to GCS. For more detailed descriptions of these settings, see the documentation for the S3 extension.

Option name
s3_uploader_max_num_parts_per_file
s3_uploader_max_filesize
s3_uploader_threads_limit

Scan data from GCS

Files in GCS can be accessed through URLs with the formats

  • gs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩
  • gcs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩

For example, if you wish to scan the file follows.parquet located in the root directory of bucket kuzu-datasets you could use the following query:

LOAD FROM 'gs://kuzu-datasets/follows.parquet'
RETURN *;

Glob data from GCS

You can glob data from GCS just as you would from a local file system.

For example, the following query will copy the contents of all files matching the pattern vPerson*.csv into the table person:

COPY person FROM "gs://tinysnb/vPerson*.csv"(header=true);

Write data to GCS

You can also write to files in GCS using URLs in the formats

  • gs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩
  • gcs://⟨gcs_bucket⟩/⟨path_to_file_in_bucket⟩

For example, the following query will write to the file located at path saved/location.parquet in the bucket kuzu-datasets:

COPY (
MATCH (p:Location)
RETURN p.*
)
TO 'gcs://kuzu-datasets/saved/location.parquet';

Local cache

Scanning the same file multiple times can be slow and redundant. To avoid this, you can locally cache remote files to improve performance for repeated scans. See the local cache section for more details.