Iceberg
The iceberg
extension adds support for scanning and copying from the Apache Iceberg format.
Iceberg is an open-source table format originally developed at Netflix for large-scale analytical datasets.
Using this extension, you can interact with Iceberg tables from within Kùzu using the LOAD FROM
and COPY FROM
clauses.
The Iceberg functionality is not available by default, so you would first need to install the iceberg
extension by running the following commands:
At a high level, the iceberg
extension provides the following functionality:
- Scanning an Iceberg table
- Copying an Iceberg table into a node table
- Accessing the Iceberg metadata
- Listing the Iceberg snapshots
Usage
To run the examples below, download the iceberg_tables.zip file, unzip it
and place the contents in the /tmp
directory.
Scan the Iceberg table
LOAD FROM
is a Cypher query that scans a file or object element by element, but doesn’t actually
move the data into a Kùzu table.
Here’s how you would scan an Iceberg table:
Copy the Iceberg table into a node table
You can then use a COPY FROM
statement to directly copy the contents of the Iceberg table into a node table.
Just like above in LOAD FROM
, the file_format
parameter is mandatory when specifying the COPY FROM
clause as well.
Result:
Access Iceberg metadata
At the heart of Iceberg’s table structure is the metadata, which tracks everything from the schema, to partition information and snapshots of the table’s state. This is particularly useful for understanding the underlying structure, tracking data changes, and debugging issues in Iceberg datasets.
The ICEBERG_METADATA
function lists the metadata files for an Iceberg table via the CALL
function in Kùzu.
List Iceberg snapshots
Iceberg tables maintain a series of snapshots, which are consistent views of the table at a specific point in time. Snapshots are the core of Iceberg’s versioning system, allowing you to track, query, and manage changes to your table over time.
The ICEBERG_SNAPSHOTS
function lists the snapshots for an Iceberg table via the CALL
function.
Note that for snapshots, you do not need to specify the allow_moved_paths
option.
Optional parameters
The following optional parameters are supported in the Iceberg extension:
Parameter | Type | Default | Description |
---|---|---|---|
allow_moved_paths | BOOLEAN | false | Allows scanning Iceberg tables that are not located in their original directory |
metadata_compression_codec | STRING | '' | Specifies the compression code used for the metadata files (currenly only supports gzip ) |
version | STRING | '?' | Provides an explicit Iceberg version number, if not provided, the Iceberg version number would be determined from version-hint.txt |
version_name_format | STRING | 'v%s%s.metadata.json,%s%s.metadata.json' | Provides the regular expression to find the correct metadata data file |
More details on usage are provided below.
Select metadata version
By default, the iceberg
extension will look for a version-hint.text
file to identify the proper metadata version to use.
This can be overridden by explicitly supplying a version number via the version
parameter to Iceberg table functions.
Example:
Change metadata compression codec
By default, this extension will look for both v{version}.metadata.json
and {version}.metadata.json
files for metadata, or v{version}.gz.metadata.json
and {version}.gz.metadata.json
when metadata_compression_codec = 'gzip'
is specified.
Other compression codecs are NOT supported.
Change metadata name format
To change the metadata naming format, use the version_name_format
option, for example, if your metadata is named as rev-2.metadata.json
, set this option as version_name_format = 'rev-%s.metadata.json
to make sure the metadata file can be found successfully.
Access an Iceberg table hosted on S3
Kùzu also supports scanning/copying a Iceberg table hosted on S3 in the same way as from a local file system. Before reading and writing from S3, you have to configure the connection using the CALL statement.
Supported options
Option name | Description |
---|---|
s3_access_key_id | S3 access key id |
s3_secret_access_key | S3 secret access key |
s3_endpoint | S3 endpoint |
s3_url_style | Uses S3 url style (should either be vhost or path) |
s3_region | S3 region |
Requirements on the S3 server API
Feature | Required S3 API features |
---|---|
Public file reads | HTTP Range request |
Private file reads | Secret key authentication |
Scan Iceberg table from S3
Reading or scanning a Iceberg table that’s on S3 is as simple as reading from regular files:
Copy Iceberg table hosted on S3 into a local node table
Copying from Iceberg tables on S3 is also as simple as copying from regular files:
Limitations
When using the Iceberg extension in Kùzu, keep the following limitations in mind.
- Writing (i.e., exporting to) Iceberg tables from Kùzu is currently not supported.
- We currently do not support scanning/copying nested data (i.e., of type
STRUCT
) in the Iceberg table columns.