Implement train test split by SarahAlidoost · Pull Request #37 · ESMValGroup/ClimaNet

SarahAlidoost · 2026-04-13T12:19:32Z

closes #28

This PR:

implements train/test/validation
add stride to dataset to increase the number of samples (see the explanation 1 below)
handles over-fitting by changing torch.optim.Adam to torch.optim.AdamW, and exposing dropout in the model, and using validation set in training (see the explanation 2 below)
fix calculating mean/std based on target and not input
runs the notebook for residuals

explanations:

Our model has a lot of parameters (see default arguments of the model), so just sampling the whole globe doesn’t really give us enough training data. This can lead to high loss on test and validation sets. One approach is using overlapping tiles to create more samples, like they did in "2.3 Data augmentation and pre-processing" in MAESSTRO paper. That's why I added a stride option to the dataset. I also decrease the number of parameters especially embed_dim in the model in example notebook. This is something to fix later when building a proper training workflow on larger data on HPC.
Another issue was over-fitting. I used validation set during the training, similar to what they did in MAESSTRO code. but they actually used different years for training and validation (like 2012 vs 2011). Since I work with a small dataset and using stride to create more samples, there’s some overlap between train, test, and validation. This is something to fix later when building a proper training workflow on larger data on HPC.

rogerkuou

Nice implementation @SarahAlidoost !

I only have two minor comments. Just see if they are useful. Feel free to merge!

rogerkuou · 2026-04-20T14:53:19Z

+        model_path,
+    )
+    if verbose:
+        print(f"Model saved to {model_path}")


Something I just found is that the print function will not lively export status to slurm log file.

Shall we replace the print functions with logging?

To add to this, executing the Python script with -u does help, but still the logging option seems to be a more structural solution since it gives more info

Actually this is good that print statement is not in slurm log file. The print statement is mainly for example notebook. On HPC, the verbose variable should be False. Instead, we implemented proper logging using torch.utils.tensorboard in #34.

Co-authored-by: Ou Ku <o.ku@esciencecenter.nl>

meiertgrootes

Indeed, as Ou said. A very nice implementation. I have no further comments on the code at this point.
On the topic of data augmentation by using overlapping patches, I agree it seems to be a good means of getting more training data. However, this comes with a price, in particular when using a masking based strategy. This is a significant issue for MAE, but even with our physical based masks this is relevant. By creating overlapping patches including for regions where we may have data for one year, but not for another, we increase the probability that the models focuses on/learns local interpolations more at the expense of general representations. This needs to be balanced, so augmentation is fine, but should be used sparingly.

SarahAlidoost · 2026-04-23T11:48:07Z

@rogerkuou and @meiertgrootes Thanks for the reviews. I will wait for #39 to be merged first, because there are some conflict. Then I fix the conflict in this PR.

We need to implement a train/test/validation strategy and hyper-parameter tuning. I made issue #40, please share your ideas.

SarahAlidoost added 12 commits April 13, 2026 12:39

calculate stats on train/test split

2a205fc

fix predict for train/test split

ceda34b

return loss in predict

f4e2c69

fix minor things

dd14702

add set_seed to utils

8da363f

expose dropout, replace Adam with AdamW optimizer

7bc7a4b

add validation, refactor

613f039

add gap loss between train and validation set

949a89d

fix linter errors

2688d8d

add stride to dataset, fix calculating stats

69868ad

fix compute_masked_loss support for nan in target

0a5c3a0

rerun nb

1a2fb22

SarahAlidoost marked this pull request as ready for review April 20, 2026 13:06

SarahAlidoost requested review from meiertgrootes and rogerkuou April 20, 2026 13:06

rogerkuou approved these changes Apr 20, 2026

View reviewed changes

Update climanet/train.py

7d0f99f

Co-authored-by: Ou Ku <o.ku@esciencecenter.nl>

meiertgrootes approved these changes Apr 23, 2026

View reviewed changes

SarahAlidoost mentioned this pull request Apr 23, 2026

Implement train/test/validation strategy and hyper-parameter tuning. #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement train test split#37

Implement train test split#37
SarahAlidoost wants to merge 13 commits intomainfrom
train_test

SarahAlidoost commented Apr 13, 2026 •

edited

Loading

Uh oh!

rogerkuou left a comment

Uh oh!

Uh oh!

rogerkuou Apr 20, 2026

Uh oh!

rogerkuou Apr 20, 2026

Uh oh!

SarahAlidoost Apr 22, 2026

Uh oh!

meiertgrootes left a comment

Uh oh!

SarahAlidoost commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SarahAlidoost commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rogerkuou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rogerkuou Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

rogerkuou Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

SarahAlidoost Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

meiertgrootes left a comment

Choose a reason for hiding this comment

Uh oh!

SarahAlidoost commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SarahAlidoost commented Apr 13, 2026 •

edited

Loading