Skip to content

Added llama3.1-70b Benchmarking recipe on A3-Mega nodes#246

Draft
krishnakanthankam-qt wants to merge 8 commits into
AI-Hypercomputer:mainfrom
krishnakanthankam-qt:main
Draft

Added llama3.1-70b Benchmarking recipe on A3-Mega nodes#246
krishnakanthankam-qt wants to merge 8 commits into
AI-Hypercomputer:mainfrom
krishnakanthankam-qt:main

Conversation

@krishnakanthankam-qt

@krishnakanthankam-qt krishnakanthankam-qt commented Jun 5, 2026

Copy link
Copy Markdown

Description

Title

Add Llama 3.1 70B Recipe and Optimized Sequential Benchmarking

Summary

Introduces a high-performance recipe for serving and benchmarking Llama 3.1 70B on A3mega GKE node pools.

@google-cla

google-cla Bot commented Jun 5, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@depksingh depksingh marked this pull request as draft June 5, 2026 11:14
Comment thread inference/a3mega/llama3.1-70b/README.md Outdated

This recipe supports the following models. Running TRTLLM inference benchmarking on these models are only tested and validated on A3-Mega GKE nodes with certain combination of TP, PP, EP, number of GPU chips, input & output sequence length, precision, etc.

Example model configuration YAML files included in this repo only show a certain combination of parallelism hyperparameters and configs for benchmarking purposes. Input and output length in `/home/akrishnakanth/gpu-recipes/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml` need to be adjusted according to the model and its configs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this

Comment thread src/launchers/trtllm-launcher.sh Outdated
rm -rf $engine_dir
rm -f $dataset_file
rm -rf $engine_dir || true
rm -f $dataset_file || true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove

--backend "pytorch" \
--kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction \
$extra_args $vl_args > $output_file
$extra_args $vl_args | tee "$output_file"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| tee - This change can be reverted back to orginal.

--dataset $dataset_file \
--engine_dir $engine_dir \
--kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args >$output_file
--kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args | tee $output_file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| tee - This also can revert back to original.

--kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args | tee $output_file
fi

cat $output_file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this back to the file.

serverArgs:
max-model-len: 32768
max-num-seqs: 128
gpu-memory-utilization: 0.90 No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove gpu-memory-utilization: 0.90 from here, you are passing this value from trtllm-configs

helm install -f values.yaml \
--set workload.benchmarks.experiments[0].isl=128 \
--set workload.benchmarks.experiments[0].osl=128 \
--set workload.benchmarks.experiments[0].num_requests=1000 \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 235 to 237 can be removed, we are passing these values from values.yaml. we don't usually hardcode any values on Readme

$REPO_ROOT/src/helm-charts/a3mega/trtllm-inference/single-node
```
> [!NOTE]
> You can modify the benchmark configuration at runtime by changing the values for `isl`, `osl`, and `num_requests` (number of prompts) in the Helm command to test different scenarios.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the other recipes to update this line.

===========================================================
DATASET DETAILS
===========================================================
Dataset Path: /ssd/token-norm-dist_llama3.1-70b_128_128_tp4.json

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change tp4 to tp8

PYTORCH BACKEND
===========================================================
Model: nvidia/Llama3.1-70b
Model Path: /ssd/nvidia/Llama3.1-70b

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct the model name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants