Let’s start with a familiar story. A TSP instance appears. a solver is invoked, and a solution is returned. End of episode.
Convenient? Yes. But, real optimization problems rarely arrive as isolated puzzles. They arrive as families. A logistics company does not solve the routing problem once; it solves routing problems repeatedly, under similar constraints, shaped by the same geography, fleets, and business rules. What changes is the daily demand. What generally stays is the structure.
Once you take that seriously, a natural question follows: if instances come from a distribution, can we extract regularities from past solves and reuse them? Not to replace solvers, but to make them faster, more predictable, and more scalable.
That’s where the conversation between between Operations Research and Machine Learning becomes interesting.
Before we talk about them, we need to see why optimization becomes difficult enough to justify learning in the first place. So we start with complexity.
To understand where learning might help, separate two regimes: problems with efficient algorithms, and problems where exact methods face combinatorial explosion in the worst case.
The Minimum Spanning Tree is a friendly problem. It admits greedy algorithms with provable guarantees. For a graph \(G=(V,E)\), Kruskal’s algorithm builds the MST by repeatedly adding the cheapest edge that doesn’t form a cycle. In such settings, learning is rarely justified on algorithmic grounds (though it may still help under system constraints).
Verdict: learning is not needed to overcome worst-case complexity here.
In TSP, a locally appealing step can be globally damaging. Exact methods must, in general, navigate a search space that grows exponentially with \(n\) (though practical performance depends on structure and relaxations).
Verdict: this is where learning is tempting: not because NP-hard implies “learnable,” but because repeated, structured instances invite amortization.
Complexity explains why we want help. Geometry explains why help is hard. OR and ML reason in different mathematical spaces.
In mixed-integer programming (MIP) , optimization is a search over a feasible region. A general constrained problem can be written as:
$$ \begin{aligned} \min_{x} \quad & f(x) \\ \text{s.t.} \quad & g_i(x) \le 0, \quad i=1,\dots,m \\ & h_j(x) = 0, \quad j=1,\dots,k \\ & x \in \mathcal{X} \end{aligned} $$Where:
Modern solvers are best described as branch-and-cut systems (branch-and-bound plus cutting planes, presolve, propagation, and primal heuristics). A simplified view:
Geometric goal (precise): solvers tighten the relaxation toward the convex hull of integer-feasible points (they do not “tighten the convex hull” itself), because tighter relaxations reduce the search tree dramatically.
So OR lives in a world of feasibility, and bounds. Now what happens when ML walks into the room with gradients.
In deep learning , “optimization” usually means fitting parameters \(\theta\) of a function approximator. We train \(F_\theta\) to map a problem representation \(s\) (e.g., city coordinates, costs, constraints) to an output \(y\).
$$ \min_{\theta}\ \mathbb{E}_{(s,y)\sim \mathcal{D}} \left[\mathcal{L}\!\left(F_\theta(s),\,y\right)\right] $$SGD updates parameters via:
$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t), $$This update follows from first-order optimization: a gradient descent step that locally linearizes the loss and moves parameters in the direction of steepest decrease.
Where:
Core distinction: OR optimizes \(x\) for one instance; ML optimizes \(\theta\) to produce good \(x\)'s across many instances.
One practical route is to treat the solver as a teacher. You generate labeled data by solving instances (possibly under time limits), then train a model to imitate either the final solution or intermediate decisions.
If the output is a binary decision (e.g., “open facility \(i\)?” in a facility location problem ), a common loss is binary cross entropy :
$$ \mathcal{L}(\theta) = - \frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \right] $$Where:
Because feasibility and optimality guarantees remain difficult for learned models, the industrial standard is augmentation: the model proposes guidance, and the solver enforces correctness. In practice, learning accelerates search while classical optimization preserves feasibility and proofs.
| Concept | Operations Research (OR) | Machine Learning (ML) |
|---|---|---|
| Goal | Find optimal \(x\) for one instance | Find \(\theta\) that works across a distribution |
| Method | Branch-and-cut, presolve, cutting planes | SGD/Adam; learned scoring/proposals |
| Search space | Discrete / combinatorial | Continuous / differentiable |
| Guarantee | Feasibility; optimality certificate (if solved) | Empirical performance; limited guarantees without strong assumptions |
We are not “teaching networks to solve NP-hardness.” We are teaching them to be useful where solvers are expensive: to propose good candidates, to guide search, and to accelerate repeated decision-making; while keeping correctness in the loop.
Yunus Can Bilge, December 2025