MLflow fills the boundary between experiment and model in the ML lifecycle. On the experiment side it remembers “what parameters trained what.” On the model side it holds “which version is production right now.” That boundary is not the exclusive problem of large ML teams. It shows up just as clearly when you’re running a single Logistic Regression model.

This post looks at MLflow itself: what it is, where it is positioned in the ML lifecycle, and which pieces a lightweight team can pick.

ML Lifecycle

ML projects roughly move through four stages.

  1. Experiment — you look at the data, try parameters, train models. You log metrics and go back to try again.
  2. Model — once you have something worth keeping, you declare “this is our model.” A version and a lineage attach to it.
  3. Deployment — you put that model into the serving environment. Rollout, rollback, traffic shifting all live here.
  4. Monitoring — you watch the live model for drift and degradation.

Each stage has its own problem. Experiment struggles with remembering what was tried. Model struggles with agreeing on which one is real right now. Deployment struggles with swapping one thing for another. Monitoring struggles with deciding when to retrain.

MLflow mostly fills the first two. It reaches into deployment and monitoring, but its center of gravity is in experiment and model.

The model_v3_final.pkl Problem

To see why a tool belongs at this boundary, first look at what breaks when you try to manage models with just a filesystem.

It starts simple. You upload model.pkl to S3 and the inference server reads it. Every time training finishes, you overwrite the file.

Then you need to roll back. Yesterday’s version. But the file has already been overwritten. So you start splitting: model_v2.pkl, model_v3.pkl. Before long you get model_v3_final.pkl. Then model_v3_final_really.pkl.

These names don’t solve three things.

  • Lineage — there is no way to trace model_v3_final.pkl back to the code, the data, and the parameters that produced it. You cannot reproduce it even with the same code.
  • Alias — “which model is production right now” gets managed with a filename convention outside the code. Does the inference server read latest.pkl? Does it take a version via env var? Every decision becomes ad hoc.
  • Reproducibility — a few months later you want to repeat an experiment, but nothing remembers the parameters and the code from that run.

To fix these you eventually need a metadata layer on top of the “filename” layer. That is the gap MLflow fills.

The Four Pieces of MLflow

MLflow is four independent components in one package. You pick which ones you use.

Tracking

It records a training session as a run. Parameters, metrics, and artifacts (model files, plots, logs) all attach to the run. Multiple runs group under an experiment.

import mlflow

with mlflow.start_run():
    mlflow.log_param("C", 0.1)
    mlflow.log_metric("val_auc", 0.782)
    mlflow.sklearn.log_model(model, "model")

This chunk is the seed of lineage. Months later you can ask “val_auc was 0.78, what were the parameters?” and have an answer.

Model Registry

If Tracking records how it was trained, Registry records which result you want to declare yours. You promote one of the logged artifacts into a registered model and a version number attaches. v1, v2, v3 stack up automatically.

On top of those versions you can attach an alias. The champion alias is a mutable reference that points to a specific version. When a new version passes validation, you move the champion alias. No code changes, no filename rules, just one alias moving. The “model that production points at” is replaced.

mlflow.register_model("runs:/<run-id>/model", name="ctr-model")
client.set_registered_model_alias("ctr-model", "champion", version=7)

Registry clears the entire model_v3_final.pkl problem. Lineage auto-links to runs, aliases replace filename rules, and reproduction is a matter of looking up a run id.

One important constraint: using Registry requires a database backend. File storage alone (./mlruns) does not expose the registry API. Even if you want to start light, you have to stand up at least SQLite, or PostgreSQL / MySQL for real use. Since MLflow 3.7.0 the default backend switched to SQLite, which lowers the first entry barrier a little.

Models

This piece standardizes what “a model file” means. Each framework (sklearn, pytorch, xgboost) gets a flavor, and the same model can be saved under multiple flavors. A saved model can be loaded without the original framework code.

Models is the portability layer between experiment and deployment. Where Tracking and Registry deal with which model is it, Models deals with how is it serialized.

Projects

An MLproject file plus conda or docker config wraps everything so that “anyone running it gets the same environment.” mlflow run . sets up the environment and runs training.

This is the least used of the four. Teams with their own batch execution standard usually don’t add an MLproject layer; they keep their own standard.

Lifecycle Mapping

flowchart LR
    E[Experiment] -->|"Tracking
(run, param, metric)"| M[Model] M -->|"Registry
(version, alias)"| D[Deployment] M -.->|"Models
(flavor)"| D P[Projects] -.->|"runtime env"| E D --> Mo[Monitoring]
  • Tracking: inside the experiment stage
  • Registry: in the model box between experiment and deployment
  • Models: the portability axis from model into deployment
  • Projects: an optional reproducibility layer on experiment
  • Monitoring is not covered by MLflow directly. It’s another tool’s job.

That diagram shows MLflow’s scope most concisely. Four pieces, each in its own slot, and your project picks which slots to fill.

Choosing Tracking and Registry

Picture a lightweight LR model running in production. The combination you most often see, out of the four pieces, is two: Tracking and Registry.

Why Tracking. The training batch re-runs LR every cycle, and each run has different parameters and validation metrics. You need to trace, later, which run produced which number. The records pile up faster than filenames can describe them. Tracking fills exactly that gap.

Why Registry. Only the models that pass the validation step should become “champion.” The inference server loads that champion. If you manage this with filename conventions, the server ends up polling latest.pkl and you get a race where an unvalidated model reaches production before validation finishes. Aliases remove that race. The actor pulling the deploy trigger and the object being deployed are cleanly separated.

Moving the alias and swapping the inference server pods are two different events. Once the alias moves, a deployment tool (for example, Argo Rollouts) triggers the pod replacement. Rollouts starts new pods; each new pod, on boot, loads the model that champion currently points at and joins the service. MLflow says “which one is champion,” and the deployment tool handles “how to place it into service.”

This separation is the point. MLflow does not need to do everything. It just needs to fill its boundary.

Components Not Used

Models format comes along for free when you log models through Tracking. You don’t pick it explicitly, but you get its benefits. Registry can return the model as runs:/<id>/model because of this format.

Projects is often skipped. If a team already has a stable batch execution standard, adding an MLproject layer is duplication. When a batch runs inside a single framework, the reproducibility win from Projects is small.

Serving is also optional. MLflow offers its own serving endpoint (mlflow models serve), but handling lightweight-model inference directly in an existing server with sklearn is often lighter and easier to integrate with existing infrastructure. Delegating the serving layer to MLflow is rarely justified.

Using two pieces out of four is not “half using” MLflow. Filling only the boundary you need and leaving the rest to other tools is, if anything, closer to how this tool is meant to be used.

Closing

The word was “boundary.” That boundary is where meta-information (when, how, with what, which one is real right now) starts piling up faster than filenames can describe it. MLflow is the lightweight metadata layer at that point. How lightweight depends on you.

It isn’t a tool for large ML teams only. Even running a single LR, the same boundary shows up. When it does, you fill the slots you need.