AppForgeAppForge Bench

Analyze Results

We summarize key findings on AppForge-Bench with figures and tables from the paper. Overall, even frontier LLMs achieve modest success on full Android app development, while compilation feedback improves compile rates but not functional correctness proportionally.

Key Findings

  • End-to-end development is hard: the best-performing model reaches sub-20% functional success; many generated apps still crash at runtime even after passing tests.
  • Compilation feedback helps compile rate (large jumps for some models), yet Test Pass and Success saturate after a few rounds.
  • Task complexity matters: success decreases as LOC grows; simple apps can be robust with proactive exception handling.
  • Evasion behaviors: some models delete faulty logic merely to pass compilation, harming functionality and increasing fail-to-start cases.

Figures

Category distribution
Category Distribution. AppForge tasks cover diverse Android domains to reflect real practice.
Iteration refinement effectiveness
Iterative Refinement. Compilation rises rapidly; Test-Pass/Success saturate after 2–3 iterations.
LOC vs metrics
LOC vs Metrics. Higher complexity (more LOC) correlates with lower success; rolling means with uncertainty bands.
Pairwise variance
Model Differentiation. AppForge offers stronger inter-model separation compared to generic coding benchmarks.
Compilation error distribution
Compilation Errors. Android resource linking and related issues dominate; refinement shifts the distribution.
Defensive programming example
Defensive Programming. Successful cases show proactive exception handling (e.g., fallback intents for settings navigation).

Tables

Table 2 · Performance of Coding Agents on AppForge

SWE = mini-SWE-agent; CC = Claude Code.

AgentLLM#File#LOCCompileTest PassSuccess
SWE Claude-4-Opus 10.76558.4071.29%24.61%11.88%
SWE Qwen3-Coder 8.42430.9488.12%22.21%6.93%
CC Qwen3-Coder 5.34280.6676.24%14.64%6.93%

Agent frameworks provide modest gains; absolute success remains low.

Table 3 · GPT-5 Reasoning Levels on AppForge

Level#File#LOCCompileTest PassSuccess
Low5.91280.9122.77%8.41%2.97%
Medium7.61321.9627.72%11.11%3.96%
High7.76354.5945.54%21.90%14.85%

More reasoning helps across metrics but still far from practical Android development.

Table 4 · Runtime Crash Analysis across LLMs

ModelNative Crash (w/o Fix)Native Crash (w/ Fix)Failed to Start (w/o Fix)Failed to Start (w/ Fix)
GPT-4.10.011.02.066.0
Claude-Opus48.048.09.011.0
Gemini-Pro25.037.014.021.0
GPT-5-High21.00.05.025.0

Evasive “compile-only” fixes often backfire at runtime; many crashes are native.

* Percentages shown as %; web tables are simplified for readability. See Docs for the full paper.