HDDS-14859. Use RocksDb secondary instance for validating volumes. by ptlrs · Pull Request #9947 · apache/ozone

ptlrs · 2026-03-18T20:11:29Z

What changes were proposed in this pull request?

In the volume scanner, we open the RocksDb that is present on each volume.
There could be errors when opening this RocksDb in readonly mode.
The volume scanner should instead open the RocksDb as a secondary instance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14859

How was this patch tested?

https://github.com/ptlrs/ozone/actions/runs/23264628727

errose28 · 2026-03-18T20:57:24Z

Thanks for looking into this @ptlrs. I think we may want to use a secondary instance instead of a read-only instance for this check. It looks like it will meet the same goal of reading CURRENT and MANIFEST files, but will only fail if the DB is truly in bad health. We will need to provide it a directory to write its own log files, but we can use the volume's specific tmp directory for this, which is already used for other temporary files used during the disk check.

ptlrs · 2026-03-18T23:59:39Z

Thanks for taking a look @errose28. Using secondary instance had crossed my mind while writing this PR but that page was the reason I didn't make the change.

In fact I saw this comment in a different discussion which says:

Read-only instances has only the state of the db when the read-only instance is opened. It's not meant to be used when the db is also opened by a normal DB instance. The normal db instance can cause file deletions unaware to the read-only instance. Future reads issued to the read-only instance may hit IOError due to trying to read from non-existing files.

So based on the documentation and the comment, to me it appears that unless you actually perform any reads in RO mode, there won't be a problem.

We don't perform any reads during the volume check but RocksDb does read the metadata from the footer of all SST files as well the MANIFEST files and the WAL when it is opened in RO mode.

If that is where the problem is then, not sure if using a secondary instance would solve our problem.

Secondary instance appears to be the same as RO instance with the extra capability of bringing the instance upto speed with the RW instance using a manual command invocation.
This is nice to have but useful only if you actually read the data.

I don't mind changing to secondary instance, as at worst we would get the same behavior but it would be good to brainstorm what could be the differences.

errose28 · 2026-03-19T14:42:07Z

@ptlrs the main documentation is unfortunately vague here, but there's also this excerpt from the FAQ:

Q: Can I write to RocksDB using multiple processes?

A: No. However, it can be opened using Secondary DB. If no write goes to the database, it can be opened in read-only mode from multiple processes.

Q: Does RocksDB support multi-process read access?

A: Yes, you can read it using secondary database using DB::OpenAsSecondary(). RocksDB can also support multi-process read only process without writing the database. This can be done by opening the database with DB::OpenForReadOnly() call.

In our case we are only using one process with multiple handles. However, since writes will be going to the DB as we are checking the volume, this seems to indicate that secondary instance is still what we want to use here.

… validation.

ptlrs · 2026-03-24T21:08:15Z

I have updated the PR to use a secondary RocksDb instance instead of a read-only instance

ptlrs · 2026-03-26T18:43:53Z

Hi @errose28 @ChenSammi @yandrey321, I have pushed some updates to this PR. Could you please take another look at it.

ss77892 · 2026-03-30T23:52:30Z

So, we know for sure that openReadOnly might fail due to some internal work performed by RocksDB (like if there was log rotation in the middle, it might fail with FNF exception). Does opening as a secondary suffer from the same problems? If not, do we really need a secondary check? If it is, then how do we know that there will be no such failure during the second check?

ptlrs · 2026-03-31T01:24:57Z

@ss77892 you are right, we don't know if secondary instance will face the same fate. There is no documentation which clearly says that secondary instance behaves differently from RO instance when it comes to opening a new instance.

The core contribution of this PR is attempting to open a db twice before declaring failure.

We can update the PR to make the choice between RO and secondary instance configurable if we think such a fallback would be helpful here.

errose28 · 2026-03-31T14:07:44Z

There is no documentation which clearly says that secondary instance behaves differently from RO instance when it comes to opening a new instance.

I think the excerpt from the FAQ I mentioned in this comment is sufficient to indicate that secondary is expected to open cleanly while writes are ongoing and read-only is not. Whether the phrasing they've used is "clear" is debatable, and it would be nice if this was in the official doc page for the feature and not the FAQ. However from my point of view this is sufficient to design this around the assumption that secondary is expected to open cleanly while writes are happening unless the global DB files are corrupted or there is a transient IO error, which our sliding windows account for.

ss77892 · 2026-03-31T16:34:58Z

However from my point of view this is sufficient to design this around the assumption that secondary is expected to open cleanly while writes are happening unless the global DB files are corrupted or there is a transient IO error, which our sliding windows account for.
Agree. Still, the purpose of this check itself is questionable. Correct me if I'm wrong, but all those 'open/openReadOnly' reads the base metadata and log files, and that wouldn't check the integrity and correctness of SST files. So any issue on this level would just cause continuous health checks without any viable results.

ptlrs · 2026-03-31T22:53:53Z

I think the excerpt from the FAQ I mentioned in this comment is sufficient to indicate that secondary is expected to open cleanly while writes are ongoing and read-only is not

@errose I would not infer "secondary is expected to open cleanly" based just on that FAQ.
The FAQ is more likely to be about reading the latest state of a key-value pair.
As per the RocksDb code, the race condition due to Manifest file rotation continues to affect secondary instances just like RO instances. This is a transient error which requires a retry attempt from the end-user. There could be more such errors.

@ss77892 In RO/Secondary mode, the SST files are opened and their metadata is checked. Each SST file's footer, metaindex blocks, properties block are read. No data related checks are performed though. So integrity and correctness of the SST files is partially present but not for the main contents of the SST file.
The main contents of the SST file are indirectly verified when we attempt to read and verify the blocks and chunks in the data scanner.
The aim of this check is to quickly see if a disk is healthy or not. Being able to open the DB is one of the indicators.

ptlrs · 2026-04-03T00:09:42Z

Hi @yandrey321 @ChenSammi @ss77892 @errose28, is there anything else we would like to update for this PR?

ss77892 · 2026-04-06T16:16:29Z

@ss77892 In RO/Secondary mode, the SST files are opened and their metadata is checked. Each SST file's footer, metaindex blocks, properties block are read. No data related checks are performed though. So integrity and correctness of the SST files is partially present but not for the main contents of the SST file.

It seems that this statement is not correct. I just checked the open RO with strace, and it doesn't touch sst files at all:

[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/CURRENT", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/CURRENT", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/MANIFEST-001732", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_DIRECTORY) = 15
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/IDENTITY", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/CURRENT", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/MANIFEST-001732", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_DIRECTORY) = 15
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/IDENTITY", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001731.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001735.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001737.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001739.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001742.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001744.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001746.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001748.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001751.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001753.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001755.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001758.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001760.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001762.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001764.log", O_RDONLY|O_CLOEXEC) = 13

So, opening it RO looks quite useless because it reads the metadata and logs only, and wastes CPU reading log files into the memory.

…-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java

…ts constraints.

…hecks.

ptlrs · 2026-04-21T17:25:18Z

Hi @yandrey321 @ChenSammi @ss77892 @errose28, can I get a review for this PR?

ChenSammi · 2026-04-22T10:22:16Z

Besides open rocksdb as secondary, I don't sure we need retry the DB open blindly 2 times. Since we already have 2 failure tolerance of IO check, which can cover transient errors too. And we have the 10 minutes timeout for a volume check task, I feel this retry will increase the possibility of task timeout when the DN volume is very busy.

ptlrs · 2026-04-23T23:41:33Z

@ChenSammi @yandrey321 I have removed the changes to retry opening the db based on @ChenSammi's suggestion and #9954

ChenSammi · 2026-04-24T08:21:35Z

  )
  private boolean isDiskCheckEnabled = true;

+  @Config(key = "hdds.datanode.rocksdb.disk.check.io.test.enabled",


How about we rename it to "hdds.datanode.disk.check.rocksdb.io.test.enabled", so that all the disk check property will share the "hdds.datanode.disk.check" prefix?

ChenSammi · 2026-04-24T08:47:46Z

    try (ManagedOptions managedOptions = new ManagedOptions();
-         ManagedRocksDB ignored = ManagedRocksDB.openReadOnly(managedOptions, dbFile.toString())) {
+         ManagedRocksDB ignored =
+             ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) {


Use diskCheckDir instead of TmpDir, diskCheckDir directory will be cleanup on DN start, TmpDir doesn't currently.

ChenSammi · 2026-04-24T09:18:33Z

@ptlrs , thanks for updating the patch. I'm sorry that I didn't check the difference between OpenAsSecondary vs OpenReadOnly seriously in last comment.

For this link https://github.com/facebook/rocksdb/wiki/rocksdb-faq, I cannot find the obvious advantage that OpenAsSecondary over OpenReadOnly in our current DB check case. And OpenAsSecondary is obviously much expensive than OpenReadOnly, as it has the capability to catch up with normal read/write DB instance, it requires an extra directory to save it's owner data, which needs cleanup on each success or failure(which is not covered in this patch yet). If don't cleanup, and DN is running for a long time, I'm not sure how much space it will consume.
and for case like, if disk is close to full but not full yet, OpenAsSecondary call make the disk full and return fail the volume checker, shall we mark the disk check failure this time? Do we easy to know that disk full is because OpenAsSecondary or other DN operations?

I believe that disk check operations are good to be simple and quick, avoid to change the disk state by itself as possible.
@errose28 , could you reconsider your suggestion about OpenAsSecondary?

…-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes

…ion.

HDDS-14859. Ignore transient errors while validating RocksDb on volumes.

9f50fba

yandrey321 reviewed Mar 19, 2026

View reviewed changes

Comment thread ...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java Outdated

yandrey321 reviewed Mar 19, 2026

View reviewed changes

Comment thread ...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java Outdated

ChenSammi reviewed Mar 20, 2026

View reviewed changes

Comment thread ...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java Outdated

HDDS-14859. Add support for opening RocksDB as a secondary for volume…

c2ffea9

… validation.

ptlrs requested review from ChenSammi and yandrey321 March 26, 2026 18:43

yandrey321 reviewed Apr 6, 2026

View reviewed changes

Comment thread ...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java Outdated

yandrey321 reviewed Apr 6, 2026

View reviewed changes

Comment thread ...ds/managed-rocksdb/src/main/java/org/apache/hadoop/hdds/utils/db/managed/ManagedRocksDB.java

ptlrs added 3 commits April 18, 2026 11:03

HDDS-14859. Add configurable retry gap for disk checks and validate i…

a2160c2

…ts constraints.

HDDS-14859. Add configurable retry attempts for RocksDb disk health c…

7dc2bfd

…hecks.

ptlrs requested a review from yandrey321 April 18, 2026 19:10

yandrey321 reviewed Apr 21, 2026

View reviewed changes

Comment thread ...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java Outdated

yandrey321 reviewed Apr 21, 2026

View reviewed changes

Comment thread ...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java Outdated

yandrey321 reviewed Apr 21, 2026

View reviewed changes

Comment thread ...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java Outdated

yandrey321 reviewed Apr 21, 2026

View reviewed changes

Comment thread ...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java Outdated

ptlrs changed the title ~~HDDS-14859. Ignore transient errors while validating RocksDb on volumes.~~ HDDS-14859. Use RocksDb secondary instance for validating volumes. Apr 23, 2026

HDDS-14859. Remove disk check retry logic and related configurations.

02a6986

ptlrs requested a review from yandrey321 April 23, 2026 23:41

HDDS-14859. Add setter for diskCheckTimeout in DatanodeConfiguration.

dafaf60

ChenSammi reviewed Apr 24, 2026

View reviewed changes

ptlrs added 2 commits April 27, 2026 10:58

Merge remote-tracking branch 'refs/remotes/upstream/master' into HDDS…

13bba73

…-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes

HDDS-14859. Remove deprecated RocksDb disk IO health check configurat…

1986d60

…ion.

Conversation

ptlrs commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

errose28 commented Mar 18, 2026

Uh oh!

ptlrs commented Mar 18, 2026

Uh oh!

errose28 commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptlrs commented Mar 24, 2026

Uh oh!

ptlrs commented Mar 26, 2026

Uh oh!

ss77892 commented Mar 30, 2026

Uh oh!

ptlrs commented Mar 31, 2026

Uh oh!

errose28 commented Mar 31, 2026

Uh oh!

ss77892 commented Mar 31, 2026

Uh oh!

ptlrs commented Mar 31, 2026

Uh oh!

ptlrs commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

ss77892 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptlrs commented Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChenSammi commented Apr 22, 2026

Uh oh!

ptlrs commented Apr 23, 2026

Uh oh!

ChenSammi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

ChenSammi Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

ChenSammi commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ptlrs commented Mar 18, 2026 •

edited

Loading

ss77892 commented Apr 6, 2026 •

edited

Loading

ChenSammi commented Apr 24, 2026 •

edited

Loading