CASS Logo

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

1MBZUAI    2Australian National University
*Equal contribution   
CASS Pipeline

CASS is the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA ↔ HIP) and assembly-level (Nvidia SASS ↔ AMD RDNA3) translation.

Abstract

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA ↔ HIP) and assembly-level (Nvidia SASS ↔ AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.

Contributions

  • CASS Dataset: We introduce CASS, the first large-scale dataset for GPU transpilation, containing 70k semantically aligned Nvidia ↔ AMD pairs at both the source (CUDA ↔ HIP) and assembly levels (SASS ↔ RDNA3), covering 16 real-world GPU domains.
  • CASS-Bench: We contribute the first evaluation benchmark for cross-architecture GPU translation, with 40 curated tasks across 16 domains, including functionally verified outputs and aligned CUDA/HIP source and SASS/RDNA3 assembly.
  • CASS Models: We release domain-specialized CASS LLMs trained for cross-architecture code translation. Our 7B model achieves 95% source and 37.5% assembly accuracy, outperforming GPT-4o and Claude (0%) on CASS-Bench. Crucially, 85% of translated assemblies preserve execution runtime and memory compared to native, confirming semantic and performance fidelity.
  • CASS Dataset Pipeline: We designed a scalable pipeline for scraping, synthesizing, transpiling, and compiling CUDA/HIP code into aligned host and device assemblies across Nvidia and AMD GPUs.

CASS Dataset

CASS is the first dataset to enable source- and assembly-level translation research for GPU architectures, comprising 70k semantically aligned CUDA–HIP source pairs and their corresponding host and device assemblies.

Dataset Composition

CASS-Bench

CASS-Bench is a 40-sample evaluation suite spanning 16 GPU-centric domains, each represented by 1-5 curated prompts. The benchmark includes parallel algorithms, image processing, scientific computing, linear algebra, deep learning, and more.

CASS-Bench Domains

Results

Performance Comparison

Our CASS-7B model significantly outperforms existing tools and LLMs in both source-to-source and assembly-to-assembly translation tasks.

Model Assembly Accuracy (%) Source-to-Source Accuracy (%)
ZLUDA 2.5% 27.5%
Hipify - 87.5%
GPT-4o 0% 90.0%
Gemini-2.0-Flash 0% 80.0%
Claude-3.7 0% 90.0%
Qwen2.5-Coder-32B 25.0% 85.0%
CASS-1.5B 12.5% 90.0%
CASS-3B 20.0% 92.5%
CASS-7B 37.5% 95.0%

Ablation Study

Our ablation studies demonstrate the contribution of different data sources and techniques to the overall performance of CASS models.

Experiment Source Accuracy Assembly Accuracy Δ Impact
Stack subset 87.5% 17.5% -
+Synthetic 95.0% 30.0% +12.5%
+OpenCL 95.0% 32.5% +2.5%
+RoPE Extrapolation 95.0% 37.5% +5.0%

BibTeX

@misc{heakl2025cassnvidiaamdtranspilation,
 title={CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark}, 
 author={Ahmed Heakl and Sarim Hashmi and Gustavo Bertolo Stahl and Seung Hun Eddie Han and Salman Khan and Abdulrahman Mahmoud},
 year={2025},
 eprint={2505.16968},
 archivePrefix={arXiv},
 primaryClass={cs.AR},
 url={https://arxiv.org/abs/2505.16968}, 
}