Skip to content
Blog

Iceberg extension

The iceberg extension adds support for scanning and copying from the Apache Iceberg format. Iceberg is an open-source table format originally developed at Netflix for large-scale analytical datasets.

Usage

INSTALL iceberg;
LOAD iceberg;

Example dataset

Download iceberg_tables.zip and unzip it to the /tmp directory.

Terminal window
cd /tmp
wget https://kuzudb.com/data/iceberg-extension/iceberg_tables.zip
unzip iceberg_tables.zip

Scan Iceberg tables

LOAD FROM is a Cypher query that scans a file or object element by element, but doesn’t actually copy the data into a Kuzu table.

LOAD FROM
'/tmp/iceberg_tables/university'
(file_format='iceberg', allow_moved_paths=true)
RETURN *;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
| University | Rank | Funding |
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
| Stanford | 2 | 250.300 |
| Yale | 6 | 190.700 |
| Harvard | 1 | 210.500 |
| Cambridge | 5 | 280.200 |
| MIT | 3 | 170.000 |
| Oxford | 4 | 300.000 |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Copy Iceberg tables into Kuzu

You can use a COPY FROM statement to copy the contents of an Iceberg table into Kuzu.

CREATE NODE TABLE university (name STRING PRIMARY KEY, age INT64);
COPY university FROM
'/tmp/iceberg_tables/person_table'
(file_format='iceberg', allow_moved_paths=true);
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ result β”‚
β”‚ STRING β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6 tuples have been copied to the university table. β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Access Iceberg metadata

At the heart of Iceberg’s table structure is the metadata, which tracks everything from the schema, to partition information, and snapshots of the table’s state.

The ICEBERG_METADATA function lists the metadata files for an Iceberg table.

CALL ICEBERG_METADATA(
'/tmp/iceberg_tables/lineitem_iceberg',
allow_moved_paths := true
)
RETURN *;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ manifest_path β”‚ manifest_sequence_number β”‚ manifest_content β”‚ status β”‚ content β”‚ file_path β”‚ file_format β”‚ record_count β”‚
β”‚ STRING β”‚ INT64 β”‚ STRING β”‚ STRING β”‚ STRING β”‚ STRING β”‚ STRING β”‚ INT64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ lineitem_iceberg/meta... β”‚ 2 β”‚ DATA β”‚ ADDED β”‚ EXISTING β”‚ lineitem_iceberg/data... β”‚ PARQUET β”‚ 51793 β”‚
β”‚ lineitem_iceberg/meta... β”‚ 2 β”‚ DATA β”‚ DELETED β”‚ EXISTING β”‚ lineitem_iceberg/data... β”‚ PARQUET β”‚ 60175 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

List Iceberg snapshots

Iceberg tables maintain a series of snapshots, which are consistent views of the table at a specific point in time. Snapshots are the core of Iceberg’s versioning system, allowing you to track, query, and manage changes to your table over time.

The ICEBERG_SNAPSHOTS function lists the snapshots for an Iceberg table. Note that for snapshots, you do not need to specify the allow_moved_paths option.

CALL ICEBERG_SNAPSHOTS('/tmp/iceberg_tables/lineitem_iceberg') RETURN *;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ sequence_number β”‚ snapshot_id β”‚ timestamp_ms β”‚ manifest_list β”‚
β”‚ UINT64 β”‚ UINT64 β”‚ TIMESTAMP β”‚ STRING β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ 3776207205136740581 β”‚ 2023-02-15 15:07:54.504 β”‚ lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro β”‚
β”‚ 2 β”‚ 7635660646343998149 β”‚ 2023-02-15 15:08:14.73 β”‚ lineitem_iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Access Iceberg tables hosted on S3

Kuzu also supports scanning and copying Iceberg tables hosted on S3.

Configure the S3 connection

Before reading and writing from S3, you have to configure the connection using a CALL statement.

CALL <option_name>='<option_value>'
Option nameDescription
s3_access_key_idS3 access key id
s3_secret_access_keyS3 secret access key
s3_endpointS3 endpoint
s3_regionS3 region
s3_url_styleUses S3 url style (should either be vhost or path)

Requirements on the S3 server APIs

FeatureRequired S3 API features
Public file readsHTTP Range request
Private file readsSecret key authentication

Scan Iceberg tables from S3

LOAD FROM
's3://path/to/iceberg_table'
(file_format='iceberg', allow_moved_paths=true)
RETURN *;

Copy Iceberg tables from S3 into Kuzu

CREATE NODE TABLE student (ID INT64 PRIMARY KEY, name STRING);
COPY student FROM
's3://path/to/iceberg_table'
(file_format='iceberg', allow_moved_paths=true);

Optional parameters

The following optional parameters are supported when using the functions from the iceberg extension.

allow_moved_paths

  • Type: BOOLEAN
  • Default: false

Allows scanning Iceberg tables that are not located in their original directory.

metadata_compression_codec

  • Type: STRING
  • Allowed values: gzip
  • Default: ''

By default, this extension will look for v{version}.metadata.json and {version}.metadata.json files for metadata. When metadata_compression_codec = 'gzip' is specified, it will look for v{version}.gz.metadata.json and {version}.gz.metadata.json files instead.

LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_gz' (
file_format='iceberg',
allow_moved_paths=true,
metadata_compression_codec = 'gzip'
)
RETURN *;

version

  • Type: STRING
  • Default: determined from version-hint.txt

You can specify an explicit Iceberg metadata version:

LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg' (
file_format='iceberg',
allow_moved_paths=true,
version='2'
)
RETURN *;

version_name_format

  • Type: STRING
  • Default: 'v%s%s.metadata.json,%s%s.metadata.json'

You can specify a custom metadata file name format. For example, if your metadata is named as rev-2.metadata.json:

LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_alter_name' (
file_format='iceberg',
allow_moved_paths=true,
version_name_format = 'rev-%s.metadata.json'
)
RETURN *;

Limitations

Currently, the iceberg extension does not support:

  • Exporting to Iceberg tables from Kuzu is not supported.
  • Scanning/copying nested data (i.e., of type STRUCT) in Iceberg table columns is not supported.