HDDS-15105. Revise snapshot defrag design doc and add implementation details#10119
HDDS-15105. Revise snapshot defrag design doc and add implementation details#10119smengcl wants to merge 11 commits intoapache:masterfrom
Conversation
…details Generated-by: GPT-5.5
|
I eyeballed it and looks good. Inaccuraties are corrected in the process. Need to at least give it another pass before this can be merged. |
jojochuang
left a comment
There was a problem hiding this comment.
This looks like a complete rewrite and hard to know what exactly was added.
That's fair. I should convert it into a user doc |
|
Uh oh, I just realized that design doc should be under |
There was a problem hiding this comment.
Pull request overview
This PR revises the Snapshot Defragmentation documentation to align the original design proposal with the current implementation details in Ozone (local YAML metadata, locking, on-disk layout, and service workflows).
Changes:
- Document implemented snapshot-local YAML metadata (
OmSnapshotLocalData) and how defrag state/versioning is tracked locally (not via RatisSnapshotInfo). - Update the described locking, directory layout, diff computation paths, and purge/cleanup behavior to match current code.
- Refresh the workflow and mermaid diagrams to reflect the implemented defrag flow.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| structure and calls out where the implemented behavior differs from the initial | ||
| proposal. For the current user-facing behavior, configuration, and on-disk | ||
| layout, see | ||
| [Snapshot Defragmentation]({{< ref "feature/Snapshot-Defragmentation.md" >}}). |
There was a problem hiding this comment.
The Hugo ref points to feature/Snapshot-Defragmentation.md, but there is no such doc under hadoop-hdds/docs/content/feature/ (only SnapshotDefragmentation.md). This will render as a broken link; update the reference to an existing page (or add the missing target doc).
| [Snapshot Defragmentation]({{< ref "feature/Snapshot-Defragmentation.md" >}}). | |
| [Snapshot Defragmentation]({{< ref "feature/SnapshotDefragmentation.md" >}}). |
| 1. **Acquire the bootstrap read lock** and load `SnapshotInfo` plus local YAML. | ||
| 2. **Create a temporary checkpoint** in `tmp_defrag`. | ||
| * If this is the first snapshot in the bucket path chain, checkpoint the | ||
| current snapshot. | ||
| * Otherwise, checkpoint the current version of the path previous snapshot. | ||
| 3. **Drop non-incremental column families** from the temporary checkpoint. They | ||
| are reloaded from the current snapshot later. | ||
| 4. **Defrag tracked tables** (`keyTable`, `fileTable`, `directoryTable`). | ||
| * For the first snapshot in the path chain, perform full defrag by deleting | ||
| ranges outside the bucket prefix and compacting the tracked tables. | ||
| * For later snapshots, compute incremental delta SSTs, write changed records | ||
| or tombstones into temporary SST files, and ingest those files. | ||
| 5. **Acquire `SNAPSHOT_DB_CONTENT_LOCK`** for the current snapshot. Snapshot | ||
| reads and deep-clean writes follow the same DAG-based lock hierarchy. | ||
| 6. **Reload non-incremental tables** from the current snapshot into the | ||
| checkpoint. | ||
| 7. **Move the checkpoint directory** to | ||
| `db.snapshots/checkpointState/om.db-<snapshot_id>-<next_version>`. | ||
| 8. **Update local YAML**, incrementing `version`, adding the new | ||
| `versionSstFileInfos` entry, and clearing `needsDefrag`. | ||
| 9. **Delete older checkpoint directories** for the same snapshot under | ||
| `SNAPSHOT_DB_CONTENT_LOCK` and the snapshot DB cache write lock. Older YAML | ||
| version metadata is removed later by `OmSnapshotLocalDataManagerService` | ||
| only when no local version depends on it. | ||
| 10. **Release `SNAPSHOT_DB_CONTENT_LOCK`**. |
There was a problem hiding this comment.
Markdown list nesting looks inconsistent in the updated workflow section: the 10-step numbered list (starting with "1. Acquire the bootstrap read lock") is not indented under item "5. ### Snapshot Defragmentation Workflow". This will typically render as a separate top-level list (and can restart numbering unexpectedly). Consider indenting the sub-steps to be children of item 5, or convert item 5 into its own heading (not part of the outer numbered list) to avoid ambiguous rendering.
| 1. **Acquire the bootstrap read lock** and load `SnapshotInfo` plus local YAML. | |
| 2. **Create a temporary checkpoint** in `tmp_defrag`. | |
| * If this is the first snapshot in the bucket path chain, checkpoint the | |
| current snapshot. | |
| * Otherwise, checkpoint the current version of the path previous snapshot. | |
| 3. **Drop non-incremental column families** from the temporary checkpoint. They | |
| are reloaded from the current snapshot later. | |
| 4. **Defrag tracked tables** (`keyTable`, `fileTable`, `directoryTable`). | |
| * For the first snapshot in the path chain, perform full defrag by deleting | |
| ranges outside the bucket prefix and compacting the tracked tables. | |
| * For later snapshots, compute incremental delta SSTs, write changed records | |
| or tombstones into temporary SST files, and ingest those files. | |
| 5. **Acquire `SNAPSHOT_DB_CONTENT_LOCK`** for the current snapshot. Snapshot | |
| reads and deep-clean writes follow the same DAG-based lock hierarchy. | |
| 6. **Reload non-incremental tables** from the current snapshot into the | |
| checkpoint. | |
| 7. **Move the checkpoint directory** to | |
| `db.snapshots/checkpointState/om.db-<snapshot_id>-<next_version>`. | |
| 8. **Update local YAML**, incrementing `version`, adding the new | |
| `versionSstFileInfos` entry, and clearing `needsDefrag`. | |
| 9. **Delete older checkpoint directories** for the same snapshot under | |
| `SNAPSHOT_DB_CONTENT_LOCK` and the snapshot DB cache write lock. Older YAML | |
| version metadata is removed later by `OmSnapshotLocalDataManagerService` | |
| only when no local version depends on it. | |
| 10. **Release `SNAPSHOT_DB_CONTENT_LOCK`**. | |
| 1. **Acquire the bootstrap read lock** and load `SnapshotInfo` plus local YAML. | |
| 2. **Create a temporary checkpoint** in `tmp_defrag`. | |
| * If this is the first snapshot in the bucket path chain, checkpoint the | |
| current snapshot. | |
| * Otherwise, checkpoint the current version of the path previous snapshot. | |
| 3. **Drop non-incremental column families** from the temporary checkpoint. They | |
| are reloaded from the current snapshot later. | |
| 4. **Defrag tracked tables** (`keyTable`, `fileTable`, `directoryTable`). | |
| * For the first snapshot in the path chain, perform full defrag by deleting | |
| ranges outside the bucket prefix and compacting the tracked tables. | |
| * For later snapshots, compute incremental delta SSTs, write changed records | |
| or tombstones into temporary SST files, and ingest those files. | |
| 5. **Acquire `SNAPSHOT_DB_CONTENT_LOCK`** for the current snapshot. Snapshot | |
| reads and deep-clean writes follow the same DAG-based lock hierarchy. | |
| 6. **Reload non-incremental tables** from the current snapshot into the | |
| checkpoint. | |
| 7. **Move the checkpoint directory** to | |
| `db.snapshots/checkpointState/om.db-<snapshot_id>-<next_version>`. | |
| 8. **Update local YAML**, incrementing `version`, adding the new | |
| `versionSstFileInfos` entry, and clearing `needsDefrag`. | |
| 9. **Delete older checkpoint directories** for the same snapshot under | |
| `SNAPSHOT_DB_CONTENT_LOCK` and the snapshot DB cache write lock. Older YAML | |
| version metadata is removed later by `OmSnapshotLocalDataManagerService` | |
| only when no local version depends on it. | |
| 10. **Release `SNAPSHOT_DB_CONTENT_LOCK`**. |
jojochuang
left a comment
There was a problem hiding this comment.
a few future sidenotes but the doc itself is good. Let's merge it.
| Snapshots currently reside under `db.snapshots/checkpointState/` directory. The proposal introduces a `db.snapshots/checkpointStateDefragged/` directory for defragged snapshots. The directory format should be as follows: | ||
| Snapshots reside under the OM metadata directory in | ||
| `db.snapshots/checkpointState/`. The implementation does not use a separate | ||
| `checkpointStateDefragged` directory. Defragged versions are sibling |
There was a problem hiding this comment.
btw, this is no longer used checkpointStateDefragged / OM_SNAPSHOT_CHECKPOINT_DEFRAGGED_DIR
consider removing it in the future.
| The important YAML fields are `snapshotId`, `previousSnapshotId`, `version`, | ||
| `needsDefrag`, `versionSstFileInfos`, `dbTxSequenceNumber`, | ||
| `transactionInfo`, `lastDefragTime`, `checksum`, and `isSSTFiltered`. | ||
| `lastDefragTime` is serialized, but current defrag decisions are based on |
There was a problem hiding this comment.
maybe somewhere in the code we want to mark lastDefragTime deprecated.
Generated-by: GPT-5.5
What changes were proposed in this pull request?
Revise snapshot defrag design doc. And add implementation details. GPT-5.5 generated most and I tweaked some.
Current design doc was merged in #8514
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15105
How was this patch tested?