HDDS-14859. Use RocksDb secondary instance for validating volumes.#9947
HDDS-14859. Use RocksDb secondary instance for validating volumes.#9947ptlrs wants to merge 9 commits intoapache:masterfrom
Conversation
|
Thanks for looking into this @ptlrs. I think we may want to use a secondary instance instead of a read-only instance for this check. It looks like it will meet the same goal of reading CURRENT and MANIFEST files, but will only fail if the DB is truly in bad health. We will need to provide it a directory to write its own log files, but we can use the volume's specific |
|
Thanks for taking a look @errose28. Using secondary instance had crossed my mind while writing this PR but that page was the reason I didn't make the change. In fact I saw this comment in a different discussion which says:
So based on the documentation and the comment, to me it appears that unless you actually perform any reads in RO mode, there won't be a problem. We don't perform any reads during the volume check but RocksDb does read the metadata from the footer of all SST files as well the MANIFEST files and the WAL when it is opened in RO mode. If that is where the problem is then, not sure if using a secondary instance would solve our problem. Secondary instance appears to be the same as RO instance with the extra capability of bringing the instance upto speed with the RW instance using a manual command invocation. I don't mind changing to secondary instance, as at worst we would get the same behavior but it would be good to brainstorm what could be the differences. |
|
@ptlrs the main documentation is unfortunately vague here, but there's also this excerpt from the FAQ:
In our case we are only using one process with multiple handles. However, since writes will be going to the DB as we are checking the volume, this seems to indicate that secondary instance is still what we want to use here. |
|
I have updated the PR to use a secondary RocksDb instance instead of a read-only instance |
|
Hi @errose28 @ChenSammi @yandrey321, I have pushed some updates to this PR. Could you please take another look at it. |
|
So, we know for sure that openReadOnly might fail due to some internal work performed by RocksDB (like if there was log rotation in the middle, it might fail with FNF exception). Does opening as a secondary suffer from the same problems? If not, do we really need a secondary check? If it is, then how do we know that there will be no such failure during the second check? |
|
@ss77892 you are right, we don't know if secondary instance will face the same fate. There is no documentation which clearly says that secondary instance behaves differently from RO instance when it comes to opening a new instance. The core contribution of this PR is attempting to open a db twice before declaring failure. We can update the PR to make the choice between RO and secondary instance configurable if we think such a fallback would be helpful here. |
I think the excerpt from the FAQ I mentioned in this comment is sufficient to indicate that secondary is expected to open cleanly while writes are ongoing and read-only is not. Whether the phrasing they've used is "clear" is debatable, and it would be nice if this was in the official doc page for the feature and not the FAQ. However from my point of view this is sufficient to design this around the assumption that secondary is expected to open cleanly while writes are happening unless the global DB files are corrupted or there is a transient IO error, which our sliding windows account for. |
|
@errose I would not infer "secondary is expected to open cleanly" based just on that FAQ. @ss77892 In RO/Secondary mode, the SST files are opened and their metadata is checked. Each SST file's footer, metaindex blocks, properties block are read. No data related checks are performed though. So integrity and correctness of the SST files is partially present but not for the main contents of the SST file. |
|
Hi @yandrey321 @ChenSammi @ss77892 @errose28, is there anything else we would like to update for this PR? |
It seems that this statement is not correct. I just checked the open RO with strace, and it doesn't touch sst files at all: So, opening it RO looks quite useless because it reads the metadata and logs only, and wastes CPU reading log files into the memory. |
…-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java
|
Hi @yandrey321 @ChenSammi @ss77892 @errose28, can I get a review for this PR? |
|
Besides open rocksdb as secondary, I don't sure we need retry the DB open blindly 2 times. Since we already have 2 failure tolerance of IO check, which can cover transient errors too. And we have the 10 minutes timeout for a volume check task, I feel this retry will increase the possibility of task timeout when the DN volume is very busy. |
|
@ChenSammi @yandrey321 I have removed the changes to retry opening the db based on @ChenSammi's suggestion and #9954 |
| ) | ||
| private boolean isDiskCheckEnabled = true; | ||
|
|
||
| @Config(key = "hdds.datanode.rocksdb.disk.check.io.test.enabled", |
There was a problem hiding this comment.
How about we rename it to "hdds.datanode.disk.check.rocksdb.io.test.enabled", so that all the disk check property will share the "hdds.datanode.disk.check" prefix?
| try (ManagedOptions managedOptions = new ManagedOptions(); | ||
| ManagedRocksDB ignored = ManagedRocksDB.openReadOnly(managedOptions, dbFile.toString())) { | ||
| ManagedRocksDB ignored = | ||
| ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) { |
There was a problem hiding this comment.
Use diskCheckDir instead of TmpDir, diskCheckDir directory will be cleanup on DN start, TmpDir doesn't currently.
|
@ptlrs , thanks for updating the patch. I'm sorry that I didn't check the difference between OpenAsSecondary vs OpenReadOnly seriously in last comment. For this link https://github.com/facebook/rocksdb/wiki/rocksdb-faq, I cannot find the obvious advantage that OpenAsSecondary over OpenReadOnly in our current DB check case. And OpenAsSecondary is obviously much expensive than OpenReadOnly, as it has the capability to catch up with normal read/write DB instance, it requires an extra directory to save it's owner data, which needs cleanup on each success or failure(which is not covered in this patch yet). If don't cleanup, and DN is running for a long time, I'm not sure how much space it will consume. I believe that disk check operations are good to be simple and quick, avoid to change the disk state by itself as possible. |
…-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes
What changes were proposed in this pull request?
In the volume scanner, we open the RocksDb that is present on each volume.
There could be errors when opening this RocksDb in readonly mode.
The volume scanner should instead open the RocksDb as a secondary instance.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14859
How was this patch tested?
https://github.com/ptlrs/ozone/actions/runs/23264628727