We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA ↔ HIP) and assembly-level (Nvidia SASS ↔ AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.
CASS is the first dataset to enable source- and assembly-level translation research for GPU architectures, comprising 70k semantically aligned CUDA–HIP source pairs and their corresponding host and device assemblies.
CASS-Bench is a 40-sample evaluation suite spanning 16 GPU-centric domains, each represented by 1-5 curated prompts. The benchmark includes parallel algorithms, image processing, scientific computing, linear algebra, deep learning, and more.
Our CASS-7B model significantly outperforms existing tools and LLMs in both source-to-source and assembly-to-assembly translation tasks.
| Model | Assembly Accuracy (%) | Source-to-Source Accuracy (%) |
|---|---|---|
| ZLUDA | 2.5% | 27.5% |
| Hipify | - | 87.5% |
| GPT-4o | 0% | 90.0% |
| Gemini-2.0-Flash | 0% | 80.0% |
| Claude-3.7 | 0% | 90.0% |
| Qwen2.5-Coder-32B | 25.0% | 85.0% |
| CASS-1.5B | 12.5% | 90.0% |
| CASS-3B | 20.0% | 92.5% |
| CASS-7B | 37.5% | 95.0% |
Our ablation studies demonstrate the contribution of different data sources and techniques to the overall performance of CASS models.
| Experiment | Source Accuracy | Assembly Accuracy | Δ Impact |
|---|---|---|---|
| Stack subset | 87.5% | 17.5% | - |
| +Synthetic | 95.0% | 30.0% | +12.5% |
| +OpenCL | 95.0% | 32.5% | +2.5% |
| +RoPE Extrapolation | 95.0% | 37.5% | +5.0% |
@misc{heakl2025cassnvidiaamdtranspilation,
title={CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark},
author={Ahmed Heakl and Sarim Hashmi and Gustavo Bertolo Stahl and Seung Hun Eddie Han and Salman Khan and Abdulrahman Mahmoud},
year={2025},
eprint={2505.16968},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2505.16968},
}