CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Abstract

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA ↔ HIP) and assembly-level (Nvidia SASS ↔ AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.

Contributions

CASS Dataset: We introduce CASS, the first large-scale dataset for GPU transpilation, containing 70k semantically aligned Nvidia ↔ AMD pairs at both the source (CUDA ↔ HIP) and assembly levels (SASS ↔ RDNA3), covering 16 real-world GPU domains.
CASS-Bench: We contribute the first evaluation benchmark for cross-architecture GPU translation, with 40 curated tasks across 16 domains, including functionally verified outputs and aligned CUDA/HIP source and SASS/RDNA3 assembly.
CASS Models: We release domain-specialized CASS LLMs trained for cross-architecture code translation. Our 7B model achieves 95% source and 37.5% assembly accuracy, outperforming GPT-4o and Claude (0%) on CASS-Bench. Crucially, 85% of translated assemblies preserve execution runtime and memory compared to native, confirming semantic and performance fidelity.
CASS Dataset Pipeline: We designed a scalable pipeline for scraping, synthesizing, transpiling, and compiling CUDA/HIP code into aligned host and device assemblies across Nvidia and AMD GPUs.

CASS Dataset

CASS is the first dataset to enable source- and assembly-level translation research for GPU architectures, comprising 70k semantically aligned CUDA–HIP source pairs and their corresponding host and device assemblies.

CASS-Bench

CASS-Bench is a 40-sample evaluation suite spanning 16 GPU-centric domains, each represented by 1-5 curated prompts. The benchmark includes parallel algorithms, image processing, scientific computing, linear algebra, deep learning, and more.

Results

Performance Comparison

Our CASS-7B model significantly outperforms existing tools and LLMs in both source-to-source and assembly-to-assembly translation tasks.

Model	Assembly Accuracy (%)	Source-to-Source Accuracy (%)
ZLUDA	2.5%	27.5%
Hipify	-	87.5%
GPT-4o	0%	90.0%
Gemini-2.0-Flash	0%	80.0%
Claude-3.7	0%	90.0%
Qwen2.5-Coder-32B	25.0%	85.0%
CASS-1.5B	12.5%	90.0%
CASS-3B	20.0%	92.5%
CASS-7B	37.5%	95.0%

Ablation Study

Our ablation studies demonstrate the contribution of different data sources and techniques to the overall performance of CASS models.

Experiment	Source Accuracy	Assembly Accuracy	Δ Impact
Stack subset	87.5%	17.5%	-
+Synthetic	95.0%	30.0%	+12.5%
+OpenCL	95.0%	32.5%	+2.5%
+RoPE Extrapolation	95.0%	37.5%	+5.0%

BibTeX

@misc{heakl2025cassnvidiaamdtranspilation,
 title={CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark}, 
 author={Ahmed Heakl and Sarim Hashmi and Gustavo Bertolo Stahl and Seung Hun Eddie Han and Salman Khan and Abdulrahman Mahmoud},
 year={2025},
 eprint={2505.16968},
 archivePrefix={arXiv},
 primaryClass={cs.AR},
 url={https://arxiv.org/abs/2505.16968}, 
}