Iceberg extension
The iceberg
extension adds support for scanning and copying from the Apache Iceberg format.
Iceberg is an open-source table format originally developed at Netflix for large-scale analytical datasets.
Usage
INSTALL iceberg;LOAD iceberg;
Example dataset
Download iceberg_tables.zip and unzip
it to the /tmp
directory.
cd /tmpwget https://kuzudb.com/data/iceberg-extension/iceberg_tables.zipunzip iceberg_tables.zip
Scan Iceberg tables
LOAD FROM
is a Cypher query that scans a file or object element by element, but doesnβt actually
copy the data into a Kuzu table.
LOAD FROM '/tmp/iceberg_tables/university' (file_format='iceberg', allow_moved_paths=true)RETURN *;
ββββββββββββββ¬βββββββ¬βββββββββββ| University | Rank | Funding |ββββββββββββββΌβββββββΌβββββββββββ€| Stanford | 2 | 250.300 || Yale | 6 | 190.700 || Harvard | 1 | 210.500 || Cambridge | 5 | 280.200 || MIT | 3 | 170.000 || Oxford | 4 | 300.000 |ββββββββββββββ΄βββββββ΄βββββββββββ
Copy Iceberg tables into Kuzu
You can use a COPY FROM
statement to copy the contents of an Iceberg table into Kuzu.
CREATE NODE TABLE university (name STRING PRIMARY KEY, age INT64);COPY university FROM '/tmp/iceberg_tables/person_table' (file_format='iceberg', allow_moved_paths=true);
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ result ββ STRING ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β 6 tuples have been copied to the university table. ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Access Iceberg metadata
At the heart of Icebergβs table structure is the metadata, which tracks everything from the schema, to partition information, and snapshots of the tableβs state.
The ICEBERG_METADATA
function lists the metadata files for an Iceberg table.
CALL ICEBERG_METADATA( '/tmp/iceberg_tables/lineitem_iceberg', allow_moved_paths := true)RETURN *;
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ manifest_path β manifest_sequence_number β manifest_content β status β content β file_path β file_format β record_count ββ STRING β INT64 β STRING β STRING β STRING β STRING β STRING β INT64 βββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββΌβββββββββββββββββββΌββββββββββΌβββββββββββΌβββββββββββββββββββββββββββΌββββββββββββββΌβββββββββββββββ€β lineitem_iceberg/meta... β 2 β DATA β ADDED β EXISTING β lineitem_iceberg/data... β PARQUET β 51793 ββ lineitem_iceberg/meta... β 2 β DATA β DELETED β EXISTING β lineitem_iceberg/data... β PARQUET β 60175 βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ
List Iceberg snapshots
Iceberg tables maintain a series of snapshots, which are consistent views of the table at a specific point in time. Snapshots are the core of Icebergβs versioning system, allowing you to track, query, and manage changes to your table over time.
The ICEBERG_SNAPSHOTS
function lists the snapshots for an Iceberg table.
Note that for snapshots, you do not need to specify the allow_moved_paths
option.
CALL ICEBERG_SNAPSHOTS('/tmp/iceberg_tables/lineitem_iceberg') RETURN *;
βββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ sequence_number β snapshot_id β timestamp_ms β manifest_list ββ UINT64 β UINT64 β TIMESTAMP β STRING ββββββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β 1 β 3776207205136740581 β 2023-02-15 15:07:54.504 β lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro ββ 2 β 7635660646343998149 β 2023-02-15 15:08:14.73 β lineitem_iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro ββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Access Iceberg tables hosted on S3
Kuzu also supports scanning and copying Iceberg tables hosted on S3.
Configure the S3 connection
Before reading and writing from S3, you have to configure the connection using a CALL statement.
CALL <option_name>='<option_value>'
Option name | Description |
---|---|
s3_access_key_id | S3 access key id |
s3_secret_access_key | S3 secret access key |
s3_endpoint | S3 endpoint |
s3_region | S3 region |
s3_url_style | Uses S3 url style (should either be vhost or path ) |
Requirements on the S3 server APIs
Feature | Required S3 API features |
---|---|
Public file reads | HTTP Range request |
Private file reads | Secret key authentication |
Scan Iceberg tables from S3
LOAD FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true)RETURN *;
Copy Iceberg tables from S3 into Kuzu
CREATE NODE TABLE student (ID INT64 PRIMARY KEY, name STRING);COPY student FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true);
Optional parameters
The following optional parameters are supported when using the functions from the iceberg
extension.
allow_moved_paths
- Type:
BOOLEAN
- Default:
false
Allows scanning Iceberg tables that are not located in their original directory.
metadata_compression_codec
- Type:
STRING
- Allowed values:
gzip
- Default:
''
By default, this extension will look for v{version}.metadata.json
and {version}.metadata.json
files for metadata.
When metadata_compression_codec = 'gzip'
is specified, it will look for v{version}.gz.metadata.json
and {version}.gz.metadata.json
files instead.
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_gz' ( file_format='iceberg', allow_moved_paths=true, metadata_compression_codec = 'gzip')RETURN *;
version
- Type:
STRING
- Default: determined from
version-hint.txt
You can specify an explicit Iceberg metadata version:
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg' ( file_format='iceberg', allow_moved_paths=true, version='2')RETURN *;
version_name_format
- Type:
STRING
- Default:
'v%s%s.metadata.json,%s%s.metadata.json'
You can specify a custom metadata file name format.
For example, if your metadata is named as rev-2.metadata.json
:
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_alter_name' ( file_format='iceberg', allow_moved_paths=true, version_name_format = 'rev-%s.metadata.json')RETURN *;
Limitations
Currently, the iceberg
extension does not support:
- Exporting to Iceberg tables from Kuzu is not supported.
- Scanning/copying nested data (i.e., of type
STRUCT
) in Iceberg table columns is not supported.