Iceberg

Important Capabilities

Capability	Status	Notes
Data Profiling	✅	Optionally enabled via configuration.
Descriptions	✅	Enabled by default.
Detect Deleted Entities	✅	Enabled via stateful ingestion
Domains	❌	Currently not supported.
Extract Ownership	✅	Optionally enabled via configuration by specifying which Iceberg table property holds user or group ownership.
Partition Support	❌	Currently not supported.
Platform Instance	✅	Optionally enabled via configuration, an Iceberg instance represents the datalake name where the table is stored.

Integration Details

The DataHub Iceberg source plugin extracts metadata from Iceberg tables stored in a distributed or local file system. Typically, Iceberg tables are stored in a distributed file system like S3 or Azure Data Lake Storage (ADLS) and registered in a catalog. There are various catalog implementations like Filesystem-based, RDBMS-based or even REST-based catalogs. This Iceberg source plugin relies on the Iceberg python_legacy library and its support for catalogs is limited at the moment. A new version of the Iceberg Python library is currently in development and should fix this. Because of this limitation, this source plugin will only ingest HadoopCatalog-based tables that have a version-hint.text metadata file.

Ingestion of tables happens in 2 steps:

Discover Iceberg tables stored in file system.
Load discovered tables using Iceberg python_legacy library

The current implementation of the Iceberg source plugin will only discover tables stored in a local file system or in ADLS. Support for S3 could be added fairly easily.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[iceberg]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: "iceberg"
  config:
    env: PROD
    adls:
      # Will be translated to https://{account_name}.dfs.core.windows.net
      account_name: my_adls_account
      # Can use sas_token or account_key
      sas_token: "${SAS_TOKEN}"
      # account_key: "${ACCOUNT_KEY}"
      container_name: warehouse
      base_path: iceberg
    platform_instance: my_iceberg_catalog
    table_pattern:
      allow:
        - marketing.*
    profiling:
      enabled: true

sink:
  # sink configs

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
group_ownership_property string	Iceberg table property to look for a `CorpGroup` owner. Can only hold a single group value. If property has no value, no owner information will be emitted.
localfs string	Local path to crawl for Iceberg tables. This is one filesystem type supported by this source and only one can be configured.
max_path_depth integer	Maximum folder depth to crawl for Iceberg tables. Folders deeper than this value will be silently ignored. Default: 2
platform_instance string	The instance of the platform that all assets produced by this recipe belong to
user_ownership_property string	Iceberg table property to look for a `CorpUser` owner. Can only hold a single user value. If property has no value, no owner information will be emitted. Default: owner
env string	The environment that all assets produced by this connector belong to Default: PROD
adls AdlsSourceConfig	Azure Data Lake Storage to crawl for Iceberg tables. This is one filesystem type supported by this source and only one can be configured.
adls.account_name ❓ string	Name of the Azure storage account. See Microsoft official documentation on how to create a storage account.
adls.container_name ❓ string	Azure storage account container name.
adls.account_key string	Azure storage account access key that can be used as a credential. An account key, a SAS token or a client secret is required for authentication.
adls.base_path string	Base folder in hierarchical namespaces to start from. Default: /
adls.client_id string	Azure client (Application) ID required when a `client_secret` is used as a credential.
adls.client_secret string	Azure client secret that can be used as a credential. An account key, a SAS token or a client secret is required for authentication.
adls.sas_token string	Azure storage account Shared Access Signature (SAS) token that can be used as a credential. An account key, a SAS token or a client secret is required for authentication.
adls.tenant_id string	Azure tenant (Directory) ID required when a `client_secret` is used as a credential.
table_pattern AllowDenyPattern	Regex patterns for tables to filter in ingestion. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
table_pattern.allow array(string)
table_pattern.deny array(string)
table_pattern.ignoreCase boolean	Whether to ignore case sensitivity during pattern matching. Default: True
profiling IcebergProfilingConfig	Default: {'enabled': False, 'include_field_null_count': Tru...
profiling.enabled boolean	Whether profiling should be done. Default: False
profiling.include_field_max_value boolean	Whether to profile for the max value of numeric columns. Default: True
profiling.include_field_min_value boolean	Whether to profile for the min value of numeric columns. Default: True
profiling.include_field_null_count boolean	Whether to profile for the number of nulls for each column. Default: True
stateful_ingestion StatefulStaleMetadataRemovalConfig	Iceberg Stateful Ingestion Config.
stateful_ingestion.enabled boolean	The type of the ingestion state provider registered with datahub. Default: False
stateful_ingestion.ignore_new_state boolean	If set to True, ignores the current checkpoint state. Default: False
stateful_ingestion.ignore_old_state boolean	If set to True, ignores the previous checkpoint state. Default: False
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "title": "IcebergSourceConfig",
  "description": "Base configuration class for stateful ingestion for source configs to inherit from.",
  "type": "object",
  "properties": {
    "env": {
      "title": "Env",
      "description": "The environment that all assets produced by this connector belong to",
      "default": "PROD",
      "type": "string"
    },
    "platform_instance": {
      "title": "Platform Instance",
      "description": "The instance of the platform that all assets produced by this recipe belong to",
      "type": "string"
    },
    "stateful_ingestion": {
      "title": "Stateful Ingestion",
      "description": "Iceberg Stateful Ingestion Config.",
      "allOf": [
        {
          "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig"
        }
      ]
    },
    "adls": {
      "title": "Adls",
      "description": "[Azure Data Lake Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) to crawl for Iceberg tables.  This is one filesystem type supported by this source and **only one can be configured**.",
      "allOf": [
        {
          "$ref": "#/definitions/AdlsSourceConfig"
        }
      ]
    },
    "localfs": {
      "title": "Localfs",
      "description": "Local path to crawl for Iceberg tables. This is one filesystem type supported by this source and **only one can be configured**.",
      "type": "string"
    },
    "max_path_depth": {
      "title": "Max Path Depth",
      "description": "Maximum folder depth to crawl for Iceberg tables.  Folders deeper than this value will be silently ignored.",
      "default": 2,
      "type": "integer"
    },
    "table_pattern": {
      "title": "Table Pattern",
      "description": "Regex patterns for tables to filter in ingestion.",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "allOf": [
        {
          "$ref": "#/definitions/AllowDenyPattern"
        }
      ]
    },
    "user_ownership_property": {
      "title": "User Ownership Property",
      "description": "Iceberg table property to look for a `CorpUser` owner.  Can only hold a single user value.  If property has no value, no owner information will be emitted.",
      "default": "owner",
      "type": "string"
    },
    "group_ownership_property": {
      "title": "Group Ownership Property",
      "description": "Iceberg table property to look for a `CorpGroup` owner.  Can only hold a single group value.  If property has no value, no owner information will be emitted.",
      "type": "string"
    },
    "profiling": {
      "title": "Profiling",
      "default": {
        "enabled": false,
        "include_field_null_count": true,
        "include_field_min_value": true,
        "include_field_max_value": true
      },
      "allOf": [
        {
          "$ref": "#/definitions/IcebergProfilingConfig"
        }
      ]
    }
  },
  "additionalProperties": false,
  "definitions": {
    "DynamicTypedStateProviderConfig": {
      "title": "DynamicTypedStateProviderConfig",
      "type": "object",
      "properties": {
        "type": {
          "title": "Type",
          "description": "The type of the state provider to use. For DataHub use `datahub`",
          "type": "string"
        },
        "config": {
          "title": "Config",
          "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)."
        }
      },
      "required": [
        "type"
      ],
      "additionalProperties": false
    },
    "StatefulStaleMetadataRemovalConfig": {
      "title": "StatefulStaleMetadataRemovalConfig",
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "type": "object",
      "properties": {
        "enabled": {
          "title": "Enabled",
          "description": "The type of the ingestion state provider registered with datahub.",
          "default": false,
          "type": "boolean"
        },
        "ignore_old_state": {
          "title": "Ignore Old State",
          "description": "If set to True, ignores the previous checkpoint state.",
          "default": false,
          "type": "boolean"
        },
        "ignore_new_state": {
          "title": "Ignore New State",
          "description": "If set to True, ignores the current checkpoint state.",
          "default": false,
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "title": "Remove Stale Metadata",
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "default": true,
          "type": "boolean"
        }
      },
      "additionalProperties": false
    },
    "AdlsSourceConfig": {
      "title": "AdlsSourceConfig",
      "description": "Common Azure credentials config.\n\nhttps://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python",
      "type": "object",
      "properties": {
        "base_path": {
          "title": "Base Path",
          "description": "Base folder in hierarchical namespaces to start from.",
          "default": "/",
          "type": "string"
        },
        "container_name": {
          "title": "Container Name",
          "description": "Azure storage account container name.",
          "type": "string"
        },
        "account_name": {
          "title": "Account Name",
          "description": "Name of the Azure storage account.  See [Microsoft official documentation on how to create a storage account.](https://docs.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account)",
          "type": "string"
        },
        "account_key": {
          "title": "Account Key",
          "description": "Azure storage account access key that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
          "type": "string"
        },
        "sas_token": {
          "title": "Sas Token",
          "description": "Azure storage account Shared Access Signature (SAS) token that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
          "type": "string"
        },
        "client_secret": {
          "title": "Client Secret",
          "description": "Azure client secret that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
          "type": "string"
        },
        "client_id": {
          "title": "Client Id",
          "description": "Azure client (Application) ID required when a `client_secret` is used as a credential.",
          "type": "string"
        },
        "tenant_id": {
          "title": "Tenant Id",
          "description": "Azure tenant (Directory) ID required when a `client_secret` is used as a credential.",
          "type": "string"
        }
      },
      "required": [
        "container_name",
        "account_name"
      ],
      "additionalProperties": false
    },
    "AllowDenyPattern": {
      "title": "AllowDenyPattern",
      "description": "A class to store allow deny regexes",
      "type": "object",
      "properties": {
        "allow": {
          "title": "Allow",
          "description": "List of regex patterns to include in ingestion",
          "default": [
            ".*"
          ],
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "deny": {
          "title": "Deny",
          "description": "List of regex patterns to exclude from ingestion.",
          "default": [],
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "ignoreCase": {
          "title": "Ignorecase",
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "default": true,
          "type": "boolean"
        }
      },
      "additionalProperties": false
    },
    "IcebergProfilingConfig": {
      "title": "IcebergProfilingConfig",
      "type": "object",
      "properties": {
        "enabled": {
          "title": "Enabled",
          "description": "Whether profiling should be done.",
          "default": false,
          "type": "boolean"
        },
        "include_field_null_count": {
          "title": "Include Field Null Count",
          "description": "Whether to profile for the number of nulls for each column.",
          "default": true,
          "type": "boolean"
        },
        "include_field_min_value": {
          "title": "Include Field Min Value",
          "description": "Whether to profile for the min value of numeric columns.",
          "default": true,
          "type": "boolean"
        },
        "include_field_max_value": {
          "title": "Include Field Max Value",
          "description": "Whether to profile for the max value of numeric columns.",
          "default": true,
          "type": "boolean"
        }
      },
      "additionalProperties": false
    }
  }
}

Concept Mapping

This ingestion source maps the following Source System Concepts to DataHub Concepts:

Source Concept	DataHub Concept	Notes
`iceberg`	Data Platform
Table	Dataset	Each Iceberg table maps to a Dataset named using the parent folders. If a table is stored under `my/namespace/table`, the dataset name will be `my.namespace.table`. If a Platform Instance is configured, it will be used as a prefix: `<platform_instance>.my.namespace.table`.
Table property	User (a.k.a CorpUser)	The value of a table property can be used as the name of a CorpUser owner. This table property name can be configured with the source option `user_ownership_property`.
Table property	CorpGroup	The value of a table property can be used as the name of a CorpGroup owner. This table property name can be configured with the source option `group_ownership_property`.
Table parent folders (excluding warehouse catalog location)	Container	Available in a future release
Table schema	SchemaField	Maps to the fields defined within the Iceberg table schema definition.

Troubleshooting

[Common Issue]

[Provide description of common issues with this integration and steps to resolve]

Code Coordinates

Class Name: datahub.ingestion.source.iceberg.iceberg.IcebergSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Iceberg, feel free to ping us on our Slack.

Iceberg

Important Capabilities​

Integration Details​

CLI based Ingestion​

Install the Plugin​

Starter Recipe​

Config Details​

Concept Mapping​

Troubleshooting​

[Common Issue]​

Code Coordinates​