Leaderboards

Table 1 · Performance of LLMs on AppForge

Left: Pass@1 · Right: with Compilation Error Feedback

Last updated: 2025-10-06

LLMs	Pass@1				with Compilation Error Feedback
LLMs
Proprietary Models
GPT-5-High	45.54%	21.90%	52.17%	14.85%	82.18%	29.07%	31.33%	18.81%
Claude-4-Opus	80.20%	28.52%	60.49%	11.88%	90.10%	34.22%	60.44%	14.85%
Gemini-2.5-Pro	53.47%	19.63%	62.96%	7.92%	68.32%	21.63%	75.36%	13.86%
Claude-4-Sonnet	40.59%	10.35%	58.54%	0.99%	77.23%	18.36%	26.92%	3.96%
GPT-4.1	6.93%	2.44%	28.57%	0.99%	74.26%	1.85%	94.67%	0.99%
Open-source Models
Qwen3-Coder	27.72%	4.42%	75.00%	1.98%	85.15%	21.45%	29.07%	8.91%
DeepSeek-R1	14.85%	1.90%	73.33%	0.00%	44.55%	12.29%	62.22%	4.95%
DeepSeek-V3	5.94%	2.23%	83.33%	0.99%	26.73%	10.40%	48.15%	4.95%
GLM-4.5	24.75%	8.74%	72.00%	4.95%	44.55%	10.14%	75.56%	4.95%
Kimi K2	16.83%	4.95%	76.47%	1.98%	41.58%	7.76%	69.05%	1.98%

Notes: Success indicates functionally correct apps. Percent values are shown as %.