Sections
Related Guides
ML Pipelines & Orchestration: Airflow, Kubeflow, and CI/CD for Models
ML System Design
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets
ML System Design
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
Experiment Tracking & Model Registry: The Version Control for ML
How production ML teams manage the model lifecycle from experiment to production — MLflow vs Weights & Biases, what metadata a model must carry, promotion workflows with gated approvals, model lineage for debugging, and the rollback mechanism that makes safe deployments possible.
The Problem: ML Without Version Control
Imagine deploying software without Git. No record of what changed, who changed it, or why. No ability to revert to a working state. That's the situation at ML teams without experiment tracking and a model registry.
Typical failures without experiment tracking:
- 'Which experiment produced the model that's currently in production?' Nobody knows — the Jupyter notebook was overwritten.
- 'Production accuracy dropped today. Did we deploy a new model yesterday?' No record. The engineer who trained it is on vacation.
- 'Let's revert to last month's model.' Which one? The weights file in S3 doesn't have metadata. What dataset was it trained on?
- 'Why does the model score user_id=12345 as high-risk?' No feature values were logged at prediction time. Cannot debug.
Experiment tracking solves: 'which code, data, and hyperparameters produced this model, and what were the results?'
Model registry solves: 'which model is in production right now, what's its lineage, and how do I roll it back?'
They are complementary: tracking manages the experiment phase, the registry manages the production lifecycle.
Setting Up Experiment Tracking — What to Log at Each Stage
Log all hyperparameters and data configuration at run start
mlflow.log_params() should capture everything that would cause different model behavior: learning rate, batch size, n_estimators, regularization, loss function, data path (S3 URI + version), feature selection logic, train/val split strategy. If you didn't log it and the experiment produces a great result, you cannot reproduce it.
Log metrics at each meaningful step, not just at the end
Log val_loss, val_auc_pr, and custom business metrics (recall_at_80pct_precision for fraud, NDCG@10 for ranking) at each epoch or K training steps. This enables early stopping analysis, learning curve comparison across runs, and identifying when a run diverged. A single final-epoch metric is nearly useless for debugging.
Log model artifacts with signature and input example
mlflow.log_model() should include: (1) the model artifact, (2) an infer_signature() output capturing input feature names and dtypes, (3) an input_example from the training set. The signature is your contract — serving infrastructure validates inputs against it. Missing signature = silent serving skew when a feature is renamed or added.
Tag runs for queryability
Set mlflow.set_tags() with: model_type, dataset_version, feature_set_version, experiment_owner. Without tags, finding 'all XGBoost runs on the v3 feature set' requires reading every run's parameters. Tags enable structured queries: mlflow.search_runs(filter_string='tags.model_type = XGBoost AND tags.dataset_version = v3').
Promote to registry only after validation, with explicit stage transitions
Never promote directly to Production. The lifecycle must be: Staging (automated tests pass: schema validation, latency benchmark, shadow traffic comparison) → Production (manual sign-off from model owner) → Archived (after new Production model is stable for 48+ hours). Keep the previous Production version in Archived for 30 days minimum to enable emergency rollback.
MLflow — The Open-Source Standard
MLflow is the most widely adopted open-source experiment tracking and model registry system. Four components:
MLflow Tracking: Logs parameters (hyperparameters, data paths), metrics (loss curves, AUC at each epoch), artifacts (model weights, confusion matrix plots, feature importance charts). Every training run creates a unique run_id. Runs are organized into experiments (one per model type or project). Access via UI (browser-based) or Python API.
import mlflow
with mlflow.start_run(run_name='fraud_model_xgb_v3') as run:
mlflow.log_params({'n_estimators': 500, 'learning_rate': 0.05, 'max_depth': 6})
mlflow.log_metric('val_auc_pr', 0.847, step=epoch)
mlflow.log_metric('val_recall_at_80pct_precision', 0.63)
mlflow.xgboost.log_model(model, 'model',
signature=mlflow.models.infer_signature(X_train, predictions),
input_example=X_train[:5]
)
run_id = run.info.run_id
MLflow Projects: Package ML code in a reproducible format (conda.yaml or Docker). mlflow run executes a project in an isolated environment. Solves 'it works on my machine' problems.
MLflow Models: A standard format for packaging ML models from any framework (PyTorch, sklearn, XGBoost). Includes: the model artifact, a MLmodel file (metadata), a conda.yaml (dependencies), input/output signature (expected feature names and types). The signature is critical: it validates that the serving infrastructure provides exactly the features the model expects.
MLflow Model Registry: Versioned catalog of models. A registered model has named versions (version 1, 2, 3...). Each version has a Stage: None → Staging → Production → Archived. Only one version should be in Production at a time. APIs to transition versions between stages.
Weights & Biases — The Richer Alternative for Research-Heavy Teams
Weights & Biases (W&B) offers richer visualization and team collaboration features than MLflow, at the cost of being a hosted SaaS product (with a self-hosted option).
Where W&B exceeds MLflow:
Media logging: Log images, audio, video, 3D point clouds, and custom visualizations directly in the UI. Essential for computer vision and audio ML teams.
Sweeps: Built-in hyperparameter optimization that automatically runs multiple experiments with different configurations (grid search, Bayesian optimization, random search) and visualizes the hyperparameter-performance landscape.
Team features: Comments on runs, run comparison UI, shared dashboards. Better for teams with multiple ML engineers collaborating on the same model.
W&B Artifacts: Versioned data and model artifacts with lineage tracking. Visualize the entire lineage graph: which dataset version + which code commit + which hyperparameters → this model version.
Practical choice: MLflow for self-hosted, privacy-sensitive environments (enterprise, regulated industries). W&B for research teams, computer vision/NLP teams that need rich media logging, teams that prioritize collaboration UI. Many teams use both: W&B for experiment tracking (better UI), MLflow for model registry (open-source, self-hosted, integrates with Spark/Databricks).
Model Registry Lifecycle — From Training to Archived
What Metadata a Model Must Carry
A model in the registry without complete metadata is a black box. When debugging a production regression, you need to answer: what changed? Without metadata, you can't.
Required metadata for every registered model version:
-
Training dataset reference: S3 path or BigQuery table + partition, with the dataset creation timestamp and git hash of the data processing code. 'Trained on s3://ml-data/fraud/training/2024-01-15/v3/' is sufficient. This enables reproducibility.
-
Feature schema: The exact list of features, their types, and expected value ranges.
{user_age: float, [18, 100]}, {merchant_category: string, [food, retail, travel, ...]}, ...}. The MLflow model signature enforces this at serving time. -
Offline evaluation metrics: AUC-PR, Recall@80%Precision, Calibration ECE, and performance on each evaluation slice (by merchant category, geography, transaction size). Stored as key-value pairs on the model version.
-
Code version: git commit hash of the training code. 'This model was produced by commit
abc1234of thefraud-modelrepository.' Reproducibility requires pointing back to the exact code. -
Run ID: link back to the MLflow/W&B experiment run where all training curves, hyperparameters, and intermediate metrics are logged.
-
Promoter and timestamp: who promoted this model to Production and when. Audit trail for compliance.
-
Serving requirements: expected input format, latency budget, GPU/CPU requirement. 'This model expects a feature vector of dimension 128 at float32. It requires a GPU with > 4GB VRAM for < 10ms P99 latency at batch=1.'
MLflow vs Weights & Biases vs SageMaker — Feature Comparison
| Feature | MLflow | Weights & Biases | SageMaker Experiments |
|---|---|---|---|
| Hosting | Self-hosted (open source) | SaaS (self-hosted option) | AWS managed |
| Model Registry | Yes (built-in) | Yes (W&B Artifacts) | Yes (SageMaker Model Registry) |
| UI Quality | Functional but basic | Rich, collaborative | AWS console (adequate) |
| Hyperparameter sweeps | Manual (use Optuna/Ray Tune) | Built-in (Sweeps) | HP Tuning Jobs |
| Media logging | Basic (images, tables) | Rich (images, video, audio, 3D) | Basic |
| Framework support | All (PyTorch, sklearn, XGBoost, etc.) | All | AWS-native (SageMaker SDK) |
| Artifact lineage | Run-level | Rich graph with dataset versions | Pipeline-level |
| Best for | Self-hosted, enterprise, Databricks integration | Research, CV/NLP teams, team collaboration | AWS-native ML teams |
Reproducibility Is Not Free — What You Must Log
A common mistake: teams log the final model metrics but not enough to reproduce the training run. Two months later, when investigating a regression: 'Can we reproduce model v12 to compare against v15?' If you didn't log enough, the answer is no.
Minimum reproducibility requirements:
- Code: git commit hash (not just branch name — branches change)
- Data: exact dataset path with version/date (not 'latest')
- Environment: Docker image digest OR requirements.txt pinned to exact versions (not 'pandas >= 1.0')
- Hyperparameters: every hyperparameter, including defaults (what you passed to the model constructor)
- Random seeds:
torch.manual_seed(42),np.random.seed(42), Pythonrandom.seed(42), and any GPU-specific seeds
With these five logged, you can re-run training from scratch and get a model with statistically equivalent performance. Without any of them, you're guessing.
Interview Answer: Model Registry and Rollback
When asked about model versioning or rollback in an interview:
'We use MLflow Model Registry. Every model goes through: None → Staging (after passing offline evaluation gates) → Production (after shadow + canary deployment passes online metrics). We only have one Production version at a time.
Rollback: the previous Production version is kept in Archived state for 30 days. If production metrics degrade after a new deployment, we promote the Archived version back to Production. This is a load balancer weight change — takes < 2 minutes. The model files are still warm in the serving cluster, so there's no cold start delay.
Audit trail: every transition is logged with who triggered it and when. For regulated industries (finance, healthcare), this is not optional — model risk management requires knowing exactly what model made what decisions and when it was deployed.'
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →