Evolving a DSPy Program to Solve ARC-AGI-2
Using GEPA to discover program architectures
There's a process running on my MacBook that's been going for over 65 hours. It's evolving a DSPy program to solve ARC-AGI-2, the benchmark designed to measure "general intelligence" in AI systems. The tool doing the evolving is GEPA (Generalized Evolutionary Prompt Ascent), and as I write this, it's still running.
What is ARC-AGI?
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark created by François Chollet. Each task shows a few input-output grid pairs as training examples, then asks the model to predict the output for a new test input. The grids are small (usually under 30x30) and use 10 colors (digits 0-9).
The challenge is that every task requires inferring a novel transformation rule from just a handful of examples. Tasks might involve rotating objects, filling patterns, detecting symmetries, counting colors, or applying multi-step logic. There's no single algorithm that works for all of them. ARC-AGI-2 is the second version of this benchmark, with harder tasks.
What is GEPA?
GEPA is an optimizer from the DSPy library. Instead of tuning model weights or prompt embeddings, GEPA evolves the text of a program. It works like this:
- Start with a seed candidate (a DSPy program).
- Evaluate the candidate on a validation set.
- Use an LLM to reflect on failures and propose improvements.
- Evaluate the new candidate.
- Keep the best candidates in a Pareto front.
- Repeat until the budget is exhausted.
The key insight is that GEPA doesn't just tune prompts—it can rewrite entire program structures, add new modules, change signatures, or switch between DSPy strategies like ChainOfThought, ProgramOfThought, ReAct, or CodeAct.
The Setup
The experiment uses a custom adapter that wraps ARC-AGI-2 tasks for GEPA. The configuration:
- Training set: 900 tasks (from the 1000-task limit, using validation-from-train)
- Task LM: Gemini 3 Flash Preview (for solving tasks)
- Reflection LM: Gemini 3 Pro Preview (for proposing improvements)
- Pass@k: 2 (each task gets two attempts; correct on either counts)
- Auto budget: "heavy" (18 candidates, many trials)
The seed program is a simple chain-of-thought solver that formats the task as text and asks the model to output JSON grids. From there, GEPA is free to evolve it however it wants.
What the Evolution Discovered
By iteration 2, GEPA had evolved the program from a simple CoT solver into a hybrid code-generation system:
class CodeSolver(dspy.Module):
def __init__(self):
super().__init__()
self.generate = dspy.ChainOfThought(GenerateLogicAndCode)
def forward(self, task):
# Generate Python code that implements the transformation
pred = self.generate(task_description=_task_to_text(task))
code = getattr(pred, "code", "")
# Validate on training data
for pair in task["train"]:
res = _execute_code(code, pair["input"])
if res == pair["output"]:
correct_count += 1
return {"code": code, "score": score}
The evolved program tries to generate Python code that perfectly solves the training examples. If the code gets 100% on training, it's applied to test inputs. If not, it falls back to direct prediction via chain-of-thought.
This is notable because GEPA discovered on its own that code generation is a better strategy for ARC tasks than pure reasoning. The program validates generated code against training examples before trusting it—a form of self-verification that emerged from the evolutionary process.
The Results So Far
Progress through the first iterations:
- Iteration 0 (seed): 63.6% accuracy on 900 tasks
- Iteration 2: 76.2% accuracy after evolving to hybrid code+CoT
- Pareto front: 80.0% aggregate (best individual task scores across all candidates)
The improvement from 63.6% to 76.2% came from a single structural change: switching from pure chain-of-thought to code generation with validation.
Still Running
As I write this, the process is still evolving. The run log shows continued iterations with new candidates being proposed and evaluated. The GEPA optimizer is trying different combinations of DSPy modules, adjusting prompts, and exploring the space of possible program structures.
The process has used over 65 hours of CPU time and made thousands of LLM calls. Each full evaluation runs the candidate program on 900 tasks, and each task may require multiple LLM calls for code generation and direct prediction.
What's Interesting
Program synthesis via evolution. GEPA isn't just optimizing prompts—it's discovering program architectures. The shift from pure reasoning to code generation with self-validation is a meaningful structural change, not just a prompt tweak.
The code generation insight. For ARC tasks, generating executable code that can be verified against training examples is more reliable than asking the model to directly predict outputs. The evolved program exploits this by trying code first and falling back to direct prediction only when code fails.
Test-time compute scaling. The evolved program makes multiple attempts (4 code attempts, 2 direct prediction attempts) and picks the best. This is a form of test-time compute scaling that emerged from the evolutionary process.
Limitations
This is a training set evaluation, not a leaderboard submission. The scores here are on the public training data, not the private test set. Performance on the actual ARC-AGI-2 benchmark would likely be lower.
The experiment is also expensive. Thousands of Gemini API calls over multiple days of runtime add up. GEPA's heavy budget setting is thorough but not cheap.
What's Next
Once the evolution finishes, I'll evaluate the final candidate on the public evaluation set and see how the evolved program generalizes beyond the training tasks. The interesting question is whether GEPA's discovered strategies transfer, or whether they're overfit to the specific tasks it trained on.
For now, the MacBook keeps running.