We evaluate EvoTune on
three combinatorial optimization tasks:
bin packing, traveling salesman, and the flatpack problem.
For each task, the goal is to discover heuristic programs that minimize the
optimality gap (equivalently, maximize the reward score). We compare
against static LLM baseline FunSearch, which uses only evolutionary search without training
the LLM. All results are averaged over
ten random seeds to account for variability.
Figure 2: Reward score trajectories of the top 50 generated programs
for
FlatPack
(left), Bin Packing (middle), and Traveling Salesman Problem (right).
Shaded areas
denote standard error across 10 seeds. EvoTune
consistently attains higher top-50 rewards towards the end of the sampling budget,
demonstrating its superior search efficiency compared to the baseline.
The figure above shows how the reward score of the best 50 discovered programs improve as
more programs are sampled. Across all evaluated models and tasks, EvoTune accelerates the
discovery of best-perfroming programs that achieive the highest scores.
Notably, the performance gap between EvoTune and
the baseline tends to widen with larger sampling budgets.
Table 1: Results for Bin Packing (top), Traveling Salesman
Problem (middle), and FlatPack (bottom). We report mean optimality
gaps of the
top 50 programs and standard errors across 10 seeds on validation, perturbed-validation,
and
test sets at three sampling budgets (9.6k, 16k and 22k sampled programs). Across all
tasks and models, EvoTune
consistently
achieves the best results (highlighted in blue).
We evolve programs based on the score from the validation set of problem instances. To
further
assess the robustness and generalization of the generated programs, we evaluate their
performance on the validation-perturbed and test sets. As shown in the table above, our
method outperforms the baseline across all
tasks and models.
Training the LLM can reduce the diversity of its generated outputs, which poses a challenge
since evolutionary search critically depends on maintaining diversity. To address this, we
paid special attention to preserving diversity during training. For example, we found that
using forward KL regularization was effective, full details are provided in the paper.
To further support diverse exploration, we use an island-based approach in the program
database, where different groups of programs (or “islands”) evolve independently. As shown
in the figure below, this strategy leads to structured clusters of programs that gradually
diverge from the initial seed as the sampling budget increases. Programs within the same
island tend to display greater
semantic similarity compared to those across different islands.
Figure 3: t-SNE visualizations of program embeddings produced by
EvoTune across three tasks (bin packing, traveling salesman problem,
flatpack), using representations from a pre-trained NeoBERT encoder. In the top
rows, programs are colored by their assigned island in the program
database; in the bottom rows, coloring reflects the timestep of their discovery. For
each task, programs are taken from the best-performing model and seed.