AppForgeAppForge Bench
AppForge

Leaderboards

Table 1 · Performance of LLMs on AppForge

Left: Pass@1 · Right: with Compilation Error Feedback

Last updated: 2025-10-06
LLMsPass@1with Compilation Error Feedback
Proprietary Models
GPT-5-High45.54%21.90%52.17%14.85%82.18%29.07%31.33%18.81%
Claude-4-Opus80.20%28.52%60.49%11.88%90.10%34.22%60.44%14.85%
Gemini-2.5-Pro53.47%19.63%62.96%7.92%68.32%21.63%75.36%13.86%
Claude-4-Sonnet40.59%10.35%58.54%0.99%77.23%18.36%26.92%3.96%
GPT-4.16.93%2.44%28.57%0.99%74.26%1.85%94.67%0.99%
Open-source Models
Qwen3-Coder27.72%4.42%75.00%1.98%85.15%21.45%29.07%8.91%
DeepSeek-R114.85%1.90%73.33%0.00%44.55%12.29%62.22%4.95%
DeepSeek-V35.94%2.23%83.33%0.99%26.73%10.40%48.15%4.95%
GLM-4.524.75%8.74%72.00%4.95%44.55%10.14%75.56%4.95%
Kimi K216.83%4.95%76.47%1.98%41.58%7.76%69.05%1.98%

Notes: Success indicates functionally correct apps. Percent values are shown as %.