Learning to Optimize

What Happens When Algorithms Start Remembering?

concept illustration

Let’s start with a familiar story. A TSP instance appears. a solver is invoked, and a solution is returned. End of episode.

Convenient? Yes. But, real optimization problems rarely arrive as isolated puzzles. They arrive as families. A logistics company does not solve the routing problem once; it solves routing problems repeatedly, under similar constraints, shaped by the same geography, fleets, and business rules. What changes is the daily demand. What generally stays is the structure.

Once you take that seriously, a natural question follows: if instances come from a distribution, can we extract regularities from past solves and reuse them? Not to replace solvers, but to make them faster, more predictable, and more scalable.

That’s where the conversation between between Operations Research and Machine Learning becomes interesting.

This post is structured around three questions:

Before we talk about them, we need to see why optimization becomes difficult enough to justify learning in the first place. So we start with complexity.

concept illustration

1) The Complexity Landscape: Why Heuristics Matter

To understand where learning might help, separate two regimes: problems with efficient algorithms, and problems where exact methods face combinatorial explosion in the worst case.

The Easy Case (\(P\)): MST

The Minimum Spanning Tree is a friendly problem. It admits greedy algorithms with provable guarantees. For a graph \(G=(V,E)\), Kruskal’s algorithm builds the MST by repeatedly adding the cheapest edge that doesn’t form a cycle. In such settings, learning is rarely justified on algorithmic grounds (though it may still help under system constraints).

Verdict: learning is not needed to overcome worst-case complexity here.

The Hard Case (NP-hard Optimization): TSP

In TSP, a locally appealing step can be globally damaging. Exact methods must, in general, navigate a search space that grows exponentially with \(n\) (though practical performance depends on structure and relaxations).

Verdict: this is where learning is tempting: not because NP-hard implies “learnable,” but because repeated, structured instances invite amortization.

Complexity explains why we want help. Geometry explains why help is hard. OR and ML reason in different mathematical spaces.

2) The Classical View: Optimization as Geometry (OR’s World)

In mixed-integer programming (MIP) , optimization is a search over a feasible region. A general constrained problem can be written as:

$$ \begin{aligned} \min_{x} \quad & f(x) \\ \text{s.t.} \quad & g_i(x) \le 0, \quad i=1,\dots,m \\ & h_j(x) = 0, \quad j=1,\dots,k \\ & x \in \mathcal{X} \end{aligned} $$

Where:

How Solvers (e.g., Gurobi) Work

Modern solvers are best described as branch-and-cut systems (branch-and-bound plus cutting planes, presolve, propagation, and primal heuristics). A simplified view:

Geometric goal (precise): solvers tighten the relaxation toward the convex hull of integer-feasible points (they do not “tighten the convex hull” itself), because tighter relaxations reduce the search tree dramatically.

So OR lives in a world of feasibility, and bounds. Now what happens when ML walks into the room with gradients.

3) The Neural View: Optimization as Function Approximation (ML’s World)

In deep learning , “optimization” usually means fitting parameters \(\theta\) of a function approximator. We train \(F_\theta\) to map a problem representation \(s\) (e.g., city coordinates, costs, constraints) to an output \(y\).

$$ \min_{\theta}\ \mathbb{E}_{(s,y)\sim \mathcal{D}} \left[\mathcal{L}\!\left(F_\theta(s),\,y\right)\right] $$

SGD updates parameters via:

$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t), $$

This update follows from first-order optimization: a gradient descent step that locally linearizes the loss and moves parameters in the direction of steepest decrease.

Where:

Core distinction: OR optimizes \(x\) for one instance; ML optimizes \(\theta\) to produce good \(x\)'s across many instances.

Bridging the Gap: Learning from a Solver ( Imitation Learning)

One practical route is to treat the solver as a teacher. You generate labeled data by solving instances (possibly under time limits), then train a model to imitate either the final solution or intermediate decisions.

Strategy: Behavior Cloning

If the output is a binary decision (e.g., “open facility \(i\)?” in a facility location problem ), a common loss is binary cross entropy :

$$ \mathcal{L}(\theta) = - \frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \right] $$

Where:

Important caveat: imitation may fail under distribution shift, and sequential decision imitation can accumulate errors.
concept illustration

The Solution: Hybrid Solver–Learner Systems (augmentation, not replacement)

Because feasibility and optimality guarantees remain difficult for learned models, the industrial standard is augmentation: the model proposes guidance, and the solver enforces correctness. In practice, learning accelerates search while classical optimization preserves feasibility and proofs.

Limitations (What can go wrong, even when the idea is right)

Conclusion: A Shared Vocabulary

Concept Operations Research (OR) Machine Learning (ML)
Goal Find optimal \(x\) for one instance Find \(\theta\) that works across a distribution
Method Branch-and-cut, presolve, cutting planes SGD/Adam; learned scoring/proposals
Search space Discrete / combinatorial Continuous / differentiable
Guarantee Feasibility; optimality certificate (if solved) Empirical performance; limited guarantees without strong assumptions

We are not “teaching networks to solve NP-hardness.” We are teaching them to be useful where solvers are expensive: to propose good candidates, to guide search, and to accelerate repeated decision-making; while keeping correctness in the loop.

Yunus Can Bilge, December 2025