We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA ↔ HIP) and assembly-level (Nvidia SASS ↔ AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.
CASS is the first dataset to enable source- and assembly-level translation research for GPU architectures, comprising 70k semantically aligned CUDA–HIP source pairs and their corresponding host and device assemblies.
CASS-Bench is a 40-sample evaluation suite spanning 16 GPU-centric domains, each represented by 1-5 curated prompts. The benchmark includes parallel algorithms, image processing, scientific computing, linear algebra, deep learning, and more.
Our CASS-7B model significantly outperforms existing tools and LLMs in both source-to-source and assembly-to-assembly translation tasks.
Model | Assembly Accuracy (%) | Source-to-Source Accuracy (%) |
---|---|---|
ZLUDA | 2.5% | 27.5% |
Hipify | - | 87.5% |
GPT-4o | 0% | 90.0% |
Gemini-2.0-Flash | 0% | 80.0% |
Claude-3.7 | 0% | 90.0% |
Qwen2.5-Coder-32B | 25.0% | 85.0% |
CASS-1.5B | 12.5% | 90.0% |
CASS-3B | 20.0% | 92.5% |
CASS-7B | 37.5% | 95.0% |
Our ablation studies demonstrate the contribution of different data sources and techniques to the overall performance of CASS models.
Experiment | Source Accuracy | Assembly Accuracy | Δ Impact |
---|---|---|---|
Stack subset | 87.5% | 17.5% | - |
+Synthetic | 95.0% | 30.0% | +12.5% |
+OpenCL | 95.0% | 32.5% | +2.5% |
+RoPE Extrapolation | 95.0% | 37.5% | +5.0% |
@misc{heakl2025cassnvidiaamdtranspilation,
title={CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark},
author={Ahmed Heakl and Sarim Hashmi and Gustavo Bertolo Stahl and Seung Hun Eddie Han and Salman Khan and Abdulrahman Mahmoud},
year={2025},
eprint={2505.16968},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2505.16968},
}