
Leaderboards
Table 1 · Performance of LLMs on AppForge
Left: Pass@1 · Right: with Compilation Error Feedback
Last updated: 2025-10-06
LLMs | Pass@1 | with Compilation Error Feedback | ||||||
---|---|---|---|---|---|---|---|---|
Proprietary Models | ||||||||
GPT-5-High | 45.54% | 21.90% | 52.17% | 14.85% | 82.18% | 29.07% | 31.33% | 18.81% |
Claude-4-Opus | 80.20% | 28.52% | 60.49% | 11.88% | 90.10% | 34.22% | 60.44% | 14.85% |
Gemini-2.5-Pro | 53.47% | 19.63% | 62.96% | 7.92% | 68.32% | 21.63% | 75.36% | 13.86% |
Claude-4-Sonnet | 40.59% | 10.35% | 58.54% | 0.99% | 77.23% | 18.36% | 26.92% | 3.96% |
GPT-4.1 | 6.93% | 2.44% | 28.57% | 0.99% | 74.26% | 1.85% | 94.67% | 0.99% |
Open-source Models | ||||||||
Qwen3-Coder | 27.72% | 4.42% | 75.00% | 1.98% | 85.15% | 21.45% | 29.07% | 8.91% |
DeepSeek-R1 | 14.85% | 1.90% | 73.33% | 0.00% | 44.55% | 12.29% | 62.22% | 4.95% |
DeepSeek-V3 | 5.94% | 2.23% | 83.33% | 0.99% | 26.73% | 10.40% | 48.15% | 4.95% |
GLM-4.5 | 24.75% | 8.74% | 72.00% | 4.95% | 44.55% | 10.14% | 75.56% | 4.95% |
Kimi K2 | 16.83% | 4.95% | 76.47% | 1.98% | 41.58% | 7.76% | 69.05% | 1.98% |
Notes: Success indicates functionally correct apps. Percent values are shown as %.