What Happens When Algorithms Start Remembering?

Let’s start with a familiar story. A TSP instance appears. a solver is invoked, and a solution is returned. End of episode.

Convenient? Yes. But, real optimization problems rarely arrive as isolated puzzles. They arrive as families. A logistics company does not solve the routing problem once; it solves routing problems repeatedly, under similar constraints, shaped by the same geography, fleets, and business rules. What changes is the daily demand. What generally stays is the structure.

Once you take that seriously, a natural question follows: if instances come from a distribution, can we extract regularities from past solves and reuse them? Not to replace solvers, but to make them faster, more predictable, and more scalable.

That’s where the conversation between between Operations Research and Machine Learning becomes interesting.

This post is structured around three questions:

Where does the OR–ML conversation come from?
Why does it matter in practce?
How do hybrid solver–learner systems reconcile it, and where do they fail?

Before we talk about them, we need to see why optimization becomes difficult enough to justify learning in the first place. So we start with complexity.

1) The Complexity Landscape: Why Heuristics Matter

To understand where learning might help, separate two regimes: problems with efficient algorithms, and problems where exact methods face combinatorial explosion in the worst case.

The Easy Case ($P$): MST

The Minimum Spanning Tree is a friendly problem. It admits greedy algorithms with provable guarantees. For a graph $G=(V,E)$, Kruskal’s algorithm builds the MST by repeatedly adding the cheapest edge that doesn’t form a cycle. In such settings, learning is rarely justified on algorithmic grounds (though it may still help under system constraints).

Verdict: learning is not needed to overcome worst-case complexity here.

The Hard Case (NP-hard Optimization): TSP

In TSP, a locally appealing step can be globally damaging. Exact methods must, in general, navigate a search space that grows exponentially with $n$ (though practical performance depends on structure and relaxations).

Verdict: this is where learning is tempting: not because NP-hard implies “learnable,” but because repeated, structured instances invite amortization.

Complexity explains why we want help. Geometry explains why help is hard. OR and ML reason in different mathematical spaces.

2) The Classical View: Optimization as Geometry (OR’s World)

In mixed-integer programming (MIP) , optimization is a search over a feasible region. A general constrained problem can be written as:

$$ \begin{aligned} \min_{x} \quad & f(x) \\ \text{s.t.} \quad & g_i(x) \le 0, \quad i=1,\dots,m \\ & h_j(x) = 0, \quad j=1,\dots,k \\ & x \in \mathcal{X} \end{aligned} $$

Where:

$x$: decision variables.
$f(x)$: objective (cost).
$g_i(x)$: inequality constraint functions defining feasibility boundaries.
$h_j(x)$: equality constraint functions enforcing exact structural conditions.
$\mathcal{X}$: domain, often $\mathbb{Z}^n$ (the set of n-dimensional integer vectors) for integer problems.

How Solvers (e.g., Gurobi) Work

Modern solvers are best described as branch-and-cut systems (branch-and-bound plus cutting planes, presolve, propagation, and primal heuristics). A simplified view:

Relaxation: drop integrality ($x \in \mathbb{R}^n$ instead of $\mathbb{Z}^n$) to get a lower bound (minimization).
Cutting planes: add valid inequalities to tighten the relaxation toward $\mathrm{conv}(X_I)$.
Branching: split the search space to enforce integrality.
Pruning: discard branches whose bound is worse than an incumbent feasible solution.

Geometric goal (precise): solvers tighten the relaxation toward the convex hull of integer-feasible points (they do not “tighten the convex hull” itself), because tighter relaxations reduce the search tree dramatically.

So OR lives in a world of feasibility, and bounds. Now what happens when ML walks into the room with gradients.

3) The Neural View: Optimization as Function Approximation (ML’s World)

In deep learning , “optimization” usually means fitting parameters $\theta$ of a function approximator. We train $F_\theta$ to map a problem representation $s$ (e.g., city coordinates, costs, constraints) to an output $y$.

$$ \min_{\theta}\ \mathbb{E}_{(s,y)\sim \mathcal{D}} \left[\mathcal{L}\!\left(F_\theta(s),\,y\right)\right] $$

SGD updates parameters via:

$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t), $$

This update follows from first-order optimization: a gradient descent step that locally linearizes the loss and moves parameters in the direction of steepest decrease.

Where:

$\theta$: model parameters.
$F_\theta$: parametric model (e.g., a neural network).
$(s, y)\sim \mathcal{D}$: data samples drawn from the training distribution.
$\mathcal{L}(\cdot,\cdot)$: loss function.
$\eta$: learning rate.

Core distinction: OR optimizes $x$ for one instance; ML optimizes $\theta$ to produce good $x$'s across many instances.

Bridging the Gap: Learning from a Solver ( Imitation Learning)

One practical route is to treat the solver as a teacher. You generate labeled data by solving instances (possibly under time limits), then train a model to imitate either the final solution or intermediate decisions.

Strategy: Behavior Cloning

If the output is a binary decision (e.g., “open facility $i$?” in a facility location problem ), a common loss is binary cross entropy :

$$ \mathcal{L}(\theta) = - \frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \right] $$

Where:

$y_i \in \{0,1\}$: solver’s binary decision (e.g., open / closed).
$\hat{y}_i \in [0,1]$: model-predicted probability.
$N$: number of decision variables or training samples.
$\theta$: model parameters.

Important caveat: imitation may fail under distribution shift, and sequential decision imitation can accumulate errors.

The Solution: Hybrid Solver–Learner Systems (augmentation, not replacement)

Because feasibility and optimality guarantees remain difficult for learned models, the industrial standard is augmentation: the model proposes guidance, and the solver enforces correctness. In practice, learning accelerates search while classical optimization preserves feasibility and proofs.

Limitations (What can go wrong, even when the idea is right)

Distribution shift: training on one geography/size/constraint set can fail on another; failures can be silent.
Label cost: generating “optimal” labels can be expensive; many pipelines rely on time-limited solves and noisy supervision.
Baselines are strong: approximation algorithms and classical heuristics can be hard to beat; comparisons must be fair.

Conclusion: A Shared Vocabulary

Concept	Operations Research (OR)	Machine Learning (ML)
Goal	Find optimal $x$ for one instance	Find $\theta$ that works across a distribution
Method	Branch-and-cut, presolve, cutting planes	SGD/Adam; learned scoring/proposals
Search space	Discrete / combinatorial	Continuous / differentiable
Guarantee	Feasibility; optimality certificate (if solved)	Empirical performance; limited guarantees without strong assumptions

We are not “teaching networks to solve NP-hardness.” We are teaching them to be useful where solvers are expensive: to propose good candidates, to guide search, and to accelerate repeated decision-making; while keeping correctness in the loop.

Yunus Can Bilge, December 2025

Concept	Operations Research (OR)	Machine Learning (ML)
Goal	Find optimal \(x\) for one instance	Find \(\theta\) that works across a distribution
Method	Branch-and-cut, presolve, cutting planes	SGD/Adam; learned scoring/proposals
Search space	Discrete / combinatorial	Continuous / differentiable
Guarantee	Feasibility; optimality certificate (if solved)	Empirical performance; limited guarantees without strong assumptions

What Happens When Algorithms Start Remembering?

1) The Complexity Landscape: Why Heuristics Matter

The Easy Case (\(P\)): MST

The Hard Case (NP-hard Optimization): TSP

2) The Classical View: Optimization as Geometry (OR’s World)

How Solvers (e.g., Gurobi) Work

3) The Neural View: Optimization as Function Approximation (ML’s World)

Bridging the Gap: Learning from a Solver ( Imitation Learning)

Strategy: Behavior Cloning

The Solution: Hybrid Solver–Learner Systems (augmentation, not replacement)

Limitations (What can go wrong, even when the idea is right)

Conclusion: A Shared Vocabulary