欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 文旅 > 旅游 > iceberg 用户文档(持续更新)

iceberg 用户文档(持续更新)

2024/10/23 23:19:58 来源:https://blog.csdn.net/Direction_Wind/article/details/140865977  浏览:    关键词:iceberg 用户文档(持续更新)

iceberg 用户文档

  • 表 Schema 变更
  • 查看表的元数据信息
  • 表参数变更

表 Schema 变更

Iceberg 支持使用 Alter table … alter column 语法对 Schema 进行变更,示例如下

-- spark sql
-- 更改字段类型
ALTER TABLE prod.db.sample ALTER COLUMN measurement TYPE double;-- 更新字段和 comment
ALTER TABLE prod.db.sample ALTER COLUMN measurement TYPE double COMMENT 'unit is bytes per second'-- 更改字段顺序, FIRST/AFTER
ALTER TABLE prod.db.sample ALTER COLUMN col FIRST
ALTER TABLE prod.db.sample ALTER COLUMN nested.col AFTER other_col
-- null 更改,如果该字段是主键则不支持
ALTER TABLE prod.db.sample ALTER COLUMN id DROP NOT NULL
ALTER TABLE prod.db.sample ALTER COLUMN id SET NOT NULL

查看表的元数据信息

Iceberg 采用 MVCC 的设计模式,存在多个快照,允许使用使用 Spark 语法查询 iceberg 的历史快照及相关变更信息。同时Iceberg实现了文件级别的追踪,所有data files的字段max、min等数据也可以直接通过sql查询。在工场Alpha的即席查询中即可进行Iceberg表的元数据查询。下面会展示如何通过sql查询表的历史、快照信息、分区信息、数据文件信息、manifest信息:
注意:查询表的元数据时一定要以三级结构名称对表进行引用,比如:catalog_name.db_name.table_name.history
catalog_name.db_name.table_name.snapshots

  • 查看表的历史 : SELECT * FROM catalog_name.db_name.table_name.history
+-------------------------+---------------------+---------------------+---------------------+
| made_current_at         | snapshot_id         | parent_id           | is_current_ancestor |
+-------------------------+---------------------+---------------------+---------------------+
| 2019-02-08 03:29:51.215 | 5781947118336215154 | NULL                | true                |
| 2019-02-08 03:47:55.948 | 5179299526185056830 | 5781947118336215154 | true                |
| 2019-02-09 16:24:30.13  | 296410040247533544  | 5179299526185056830 | false               |
| 2019-02-09 16:32:47.336 | 2999875608062437330 | 5179299526185056830 | true                |
| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true                |
| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true                |
+-------------------------+---------------------+---------------------+---------------------+
  • 查看表的快照

–会显示表的当前有效快照,已过期快照不包含在内。
SELECT * FROM catalog_name.db_name.table_name.snapshots

+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+
| committed_at            | snapshot_id    | parent_id | operation | manifest_list                                      | summary                                               |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+
| 2019-02-08 03:29:51.215 | 57897183625154 | null      | append    | s3://.../table/metadata/snap-57897183625154-1.avro | { added-records -> 2478404, total-records -> 2478404, |
|                         |                |           |           |                                                    |   added-data-files -> 438, total-data-files -> 438,   |
|                         |                |           |           |                                                    |   spark.app.id -> application_1520379288616_155055 }  |
| ...                     | ...            | ...       | ...       | ...                                                | ...                                                   |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+

查询快照那一刻的数据:
select * from table /+ OPTIONS(‘snapshot-id’=‘284660747962682079’)/;

summary 信息很有用,该字段记录了本次 snapshot 新增的文件数、记录数及变更的分区。你也可以通过直接指定想要的summary中字段来获取指定信息:

select snapshot_id, summary[‘total-data-files’] from catalog_name.db_name.table_name.snapshots;

  • 查看表的分区信息

–Iceberg不支持Hive那样查询分区信息:
show partitions table_name --❌
–在Iceberg中查询分区信息方式:
SELECT * FROM catalog_name.db_name.table_name.partitions --✅

+--------------------+---------------+-------------+--+
|   partition   | record_count | file_count |
+--------------------+---------------+-------------+--+
| {"date":20210407} | 1000000    | 1      |
| {"date":20210420} | 1000000    | 1      |
| {"date":20210421} | 1000000    | 1      |
| {"date":20210930} | 1000000    | 1      |
| {"date":20210418} | 1000000    | 1      |
+--------------------+---------------+-------------+--+
  • 查看数据文件的信息
    Iceberg将每一个数据文件的每个字段的max、min等信息都通过manifest文件记录了下来,可以直接通过sql查到:

SELECT * FROM catalog_name.db_name.table_name.files

+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
| file_path                                                               | file_format | record_count | file_size_in_bytes | column_sizes       | value_counts     | null_value_counts | nan_value_counts | lower_bounds    | upper_bounds    | key_metadata | split_offsets |
+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null         | [4]           |
| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null         | [4]           |
| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null         | [4]           |
+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
  • 查看Manifest文件信息

SELECT * FROM catalog_name.db_name.table_name.manifests

+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
| path                                                                 | length | partition_spec_id | added_snapshot_id   | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries                  |
+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479   | 0                 | 6668963634911763636 | 8                      | 0                         | 0                        | [[false,null,2019-05-13,2019-05-15]] |
+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+

partition_summaries列的字段依次对应了:

Filed nameTypeDescriptionv1v2
contains_nullbooleanmanifest文件是否包含了至少一个分区字段为null值的行requiredrequired
contains_nanbooleanmanifest文件是否包含了至少一个分区字段为NaN值的行optionaloptional
lower_boundbytes分区字段中非空、非NaN值的下界,如果所有值都是空或NaN,则为空optionaloptional
upper_boundbytes分区字段中非空、非NaN值的上界,如果所有值都是空或NaN,则为空optionaloptional

表参数变更

Iceberg 支持通过 Spark DDL 进行修改,示例如下:

ALTER TABLE prod.db.sample SET TBLPROPERTIES ('read.split.target-size'='268435456'
);

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com