Export partition to apache iceberg#1618
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b0e833565
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| { | ||
| auto * object_storage = dynamic_cast<StorageObjectStorageCluster *>(dest_storage.get()); | ||
|
|
||
| auto * iceberg_metadata = dynamic_cast<IcebergMetadata *>(object_storage->getExternalMetadata(query_context)); |
There was a problem hiding this comment.
Guard IcebergMetadata casts with USE_AVRO
IcebergMetadata is defined only under #if USE_AVRO in IcebergMetadata.h, but this new dynamic_cast<IcebergMetadata *> is compiled unconditionally. In non-AVRO builds (USE_AVRO=0), this translation unit (and the same pattern added in StorageReplicatedMergeTree.cpp) cannot compile, so the change breaks optional-AVRO build configurations.
Useful? React with 👍 / 👎.
| const String sidecar_path = replaceFileExtensionWithAvro( | ||
| filename_generator.convertMetadataPathToStoragePath(path)); |
There was a problem hiding this comment.
Use storage paths directly when reading export sidecars
The export path list is populated from filename.path_in_storage, but commit now treats each entry as a metadata path and calls convertMetadataPathToStoragePath before reading sidecars. With write_full_path_in_iceberg_metadata=1, table_dir is an URI prefix (for example s3://...) while these entries are plain storage paths (/...), so the conversion throws and EXPORT PARTITION cannot commit.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c6194814d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 675061716f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| manifest.lock_inside_the_task = json->getValue<bool>("lock_inside_the_task"); | ||
|
|
||
| manifest.write_full_path_in_iceberg_metadata = json->getValue<bool>("write_full_path_in_iceberg_metadata"); |
There was a problem hiding this comment.
Preserve backward compatibility for manifest parsing
Treating write_full_path_in_iceberg_metadata as mandatory breaks reading manifests that were written before this field existed. During rolling upgrades, existing ZooKeeper metadata.json entries for in-flight exports won't have this key, so getValue<bool> throws and status/polling paths that call fromJsonString cannot process those tasks. Please make this field optional with a default (false) when absent.
Useful? React with 👍 / 👎.
| auto dot_pos = data_file_storage_path.rfind('.'); | ||
| auto slash_pos = data_file_storage_path.rfind('/'); | ||
| if (dot_pos != String::npos && (slash_pos == String::npos || dot_pos > slash_pos)) | ||
| return data_file_storage_path.substr(0, dot_pos) + ".avro"; |
There was a problem hiding this comment.
Avoid overwriting Avro data files with sidecars
sidecarStoragePath rewrites any extension to .avro; if the Iceberg table writes Avro data files, the data file already ends with .avro, so the sidecar path equals the data file path. writeDataFileSidecar then writes in Rewrite mode, replacing the actual data file with sidecar metadata during onFinish, which corrupts exported data for Avro-formatted Iceberg tables.
Useful? React with 👍 / 👎.
|
I think there is a race condition on the commit procedure, and that is why I am observing failures that indicate more data was written than expected. The normal procedure of exporting partitions is that the replica that finishes the last part export will try to commit. At the same time, it is possible that some other replica is running the cleanup routine and finds a task in the pseudo "pending commit" state. In this case, it will try to commit as well, concurrently. On plain object storage that is not a problem because the commit filename will guarantee the idempotency. On Iceberg, we don't have that control. The only control we have is the I think the simplest fix is to introduce a commit lock on zookeeper |
Locking on zookeeper is not enough because the locks are ephemeral and there is a chance the replica briefly loses connection to zookeeper, the lock is released, some other replica b acquires it and the replica a doesn't hear about it. And it still tries to commit. The real deal is to re-check the |
Audit Report: PR #1618AI audit note: This review comment was generated by AI (gpt-5.3-codex). Scope: Export partition to Apache Iceberg. Confirmed defectsHigh: Iceberg write path can report success after exhausting metadata commit retries
size_t i = 0;
while (i < MAX_TRANSACTION_RETRIES)
{
if (initializeMetadata())
break;
++i;
}Medium: Export commit aggregates can overflow due 64-bit to 32-bit narrowing
Int32 total_rows = 0;
Int32 total_chunks_size = 0;
...
total_rows += static_cast<Int32>(sidecar.record_count);
total_chunks_size += static_cast<Int32>(sidecar.file_size_in_bytes);Medium: Export commit depends on mutable source partition state instead of persisted manifest state
const auto parts = storage.getDataPartsVectorInPartitionForInternalUsage(
MergeTreeDataPartState::Active, partition_id, lock);
if (parts.empty())
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"Cannot find active part for partition_id '{}' to derive Iceberg partition values...",
partition_id);if (!manifest.iceberg_metadata_json.empty())
{
iceberg_args.metadata_json_string = manifest.iceberg_metadata_json;
if (source_storage.getInMemoryMetadataPtr()->hasPartitionKey())
iceberg_args.partition_values =
getPartitionValuesForIcebergCommit(source_storage, manifest.partition_id);
}Medium: Backward-incompatible manifest parsing breaks in-flight exports after upgrade
manifest.lock_inside_the_task = json->getValue<bool>("lock_inside_the_task");
manifest.write_full_path_in_iceberg_metadata = json->getValue<bool>("write_full_path_in_iceberg_metadata");Medium: Commit-phase failures are swallowed without terminal state transition
try
{
auto context = ExportPartitionUtils::getContextCopyWithTaskSettings(storage.getContext(), manifest);
ExportPartitionUtils::commit(manifest, destination_storage, zk, storage.log.load(), export_path, context, storage);
}
catch (const Exception & e)
{
const auto error_requests = getErrorRequests(export_path, storage.replica_name, zk, storage.log.load(), part_name, e);
// logs + error bookkeeping only, no terminal status transition here
}Low: KILL can be silently overwritten by an in-flight commit (
|
|
AI audit note: This review comment was generated by AI (gpt-5.3-codex). Follow-up audit delta for PR #1618 — only newly found issues from the latest state-model-driven fault-injection pass:
|
|
But given that we rely on the Iceberg write, you think it does not make sense to fix it? |
There was a problem hiding this comment.
Iceberg spec says value_counts (total values per column, including nulls) is part of the standard per-file stats used for pruning. I don't see this field here (only column_sizes, null_value_counts, lower_bounds, upper_bounds). I have a test failing that tries to use column counts for file pruning. @arthurpassos This is not really an export partition bug, but I thought it's worth mentioning.
There was a problem hiding this comment.
I vote for ignoring this for now - perhaps you can create an issue to upstream so we don't lose track of this?
|
Possible issue with AI report: Manifest file_path ignores write_full_path_in_iceberg_metadata. Iceberg spec (v2, table 4) says data_file.file_path is a "Location URI with FS scheme". FileNamesGenerator.h explicitly distinguishes the two: path_in_metadata is the thing that should live in Iceberg metadata (carries s3://… when the setting is on); path_in_storage is the raw object-storage path. convertMetadataPathToStoragePath even throws "Paths in Iceberg must use a consistent format — either /your/path or s3://your/path. Use the write_full_path_in_iceberg_metadata setting to control this behavior" — but the manifest writer never honors that on file_path. Effect I observed on a real export with write_full_path_in_iceberg_metadata = 1: table.location() = 's3://warehouse/data/iceberg_/' (correct), but data_file.file_path = '/data/iceberg_/data/data-<…>.parquet' (missing scheme and bucket). External readers that pick a FileIO from the URI scheme will mis-route reads for that table. EDIT: Looks like this upstream PR fixes this issue ClickHouse#100420 |
| Int32 added_files_size, | ||
| Int32 num_partitions, | ||
| Int32 added_delete_files, | ||
| Int32 num_deleted_rows, |
There was a problem hiding this comment.
I should probably undo these changes
There was a problem hiding this comment.
Actually, they did the same in recent upstream. I'll keep it like this. ClickHouse#96666
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e09291a626
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| manifest.create_time = json->getValue<time_t>("create_time"); | ||
| manifest.ttl_seconds = json->getValue<size_t>("ttl_seconds"); | ||
| manifest.task_timeout_seconds = json->getValue<size_t>("task_timeout_seconds"); |
There was a problem hiding this comment.
Handle missing fields when deserializing export manifests
fromJsonString now requires newly added keys via getValue(...) calls, but manifests already stored in ZooKeeper from older versions do not contain these fields. During rolling upgrade (or while old entries still exist), manifest reads will throw and the updater/status paths that call this parser will repeatedly fail on those entries instead of progressing. Please treat new fields as optional with safe defaults to keep old manifests readable.
Useful? React with 👍 / 👎.
| if (!dest_storage->isDataLake()) | ||
| { | ||
| if (query_to_string(source_metadata_ptr->getPartitionKeyAST()) != query_to_string(destination_metadata_ptr->getPartitionKeyAST())) | ||
| throw Exception(ErrorCodes::BAD_ARGUMENTS, "Tables have different partition key"); |
There was a problem hiding this comment.
Validate Iceberg partition compatibility for direct EXPORT PART
This branch skips partition-key validation for all data-lake destinations, but the direct EXPORT PART flow does not perform an equivalent Iceberg spec compatibility check later. As a result, exporting to an Iceberg table with a different transform/order can still proceed and commit using data_part->partition.value, producing incorrect partition metadata instead of failing fast. The same compatibility validation used in the partition-export path should be enforced here too.
Useful? React with 👍 / 👎.
|
|
||
| void IcebergImportSink::finalizeBuffers() | ||
| { | ||
| writer->finalize(); |
There was a problem hiding this comment.
Skip writer finalization when no file was opened
IcebergImportSink::finalizeBuffers() always calls writer->finalize(), but MultipleFileWriter::finalize() assumes output_format/buffer were initialized by consume()/startNewFile(). If export reads zero rows (e.g., all rows masked by deletes), no file is opened and this becomes a null dereference path during onFinish(). Add a guard so finalize is only called after at least one file was started.
Useful? React with 👍 / 👎.
Export partition mechanics changes:
export_merge_tree_partition_system_table_prefer_remote_informationfalse by default (I am considering to remove it completely)getContextCopyWithTaskSettingsto avoid code duplicationApache Iceberg specifics:
write_full_path_in_iceberg_metadatain zookeeper taskf_clickhouse_export_partition_transaction_idto apache iceberg manifest so we can check it before comitting twiceChangelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
...
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: