Project Overview
The ML ecosystem already has great tools — but each owns a single slice of the pipeline and none of them advise: profiling libraries (pandas-profiling) look at the data but have no concept of a target column; experiment trackers (W&B, TensorBoard) record metrics without interpreting them; code assistants (Cursor, Copilot) target syntax but miss ML-specific semantic mistakes. mlcompass fills that gap: a single advisory layer that follows your project from data to production, keeping context at every step.
Every command writes to and reads from a shared project context (.mlcompass/), so by
the time you reach deploy the tool already knows your dataset, your model choice,
your training history, and your evaluation results. That persistent project memory — in the
spirit of .git/ — is what makes mlcompass more than a chat tool.
Built as my capstone project; live on PyPI at v0.9.0, MIT-licensed, supporting Python 3.10–3.13. It is built on top of agentlite, my own small Claude agent library.
The Pipeline — From Data to Production
Eleven commands cover every stage of the ML pipeline. Every command except init,
status, and agent keeps a fully deterministic default
path and offers an opt-in --llm flag that adds a Claude-driven interpretation step
on top.
Headline Contribution — The Anti-Hallucination Contract
This is the research contribution. When an LLM narrates the structured output of a tool,
it can fabricate: a column name absent from the data, an unmeasured value, a number that
contradicts the evidence. When evaluate sees a suspiciously good metric
(AUC > 0.995, R² > 0.999 — the classic signature of data leakage), an
evidence-bound contract kicks in:
The contract has two tiers. Tier A: the enum domains of the
narrator's tool-input schema are generated at call time from the deterministic evidence
dictionary — the model can only pick columns that exist. Tier B: the returned
answer is verified in plain code, trusting no provider (entity soundness, value soundness,
completeness); a violation triggers a corrective retry, and anything still fabricated is
stripped. The result is a provider-independent guarantee: every column and every
number returned is present in, and equal to, the evidence.
Measured live (N = 200 per cell), the bare narrator fabricated at 1% to 100% across six rewordings of the same instruction on identical evidence; the contract held the user-facing rate at 0/200 on every channel of both tasks. An academic paper describing the evidence-bound runtime schema is in preparation / under review.
The Eleven Commands
| Command | When you run it | What you get |
|---|---|---|
init | Starting a project | A .mlcompass/ folder that tracks decisions |
advise | You have a CSV, now what? | Models to try, features to derive, pitfalls to avoid |
audit | Before you press train | Static analysis of the training script (8 AST rules) |
watch | While training runs | Plateau / overfit / NaN / divergence (log / TB / W&B) |
compare | After several runs | Side-by-side config + final-metric diff with verdict |
evaluate | Training done | Metrics, threshold sweep, leakage investigation |
deploy | Going to production | Model + deps + target-specific checks + checklist |
status | Any time | Project metadata, active state, decision history |
agent | "Just do it for me" | LLM router driving the other tools, with memory |
monitor | Model live, new data | PSI + KS + chi² drift, retrain verdict |
optimize | A few runs, what's next? | HPO sub-agent: leaderboard, sensitivity, N suggestions |
Use from Claude Desktop / Cursor (MCP)
mlcompass ships a Model Context Protocol server, so any MCP-capable client
(Claude Desktop, Claude Code, Cursor, Continue …) can call its eight tools directly. And
pip install mlcompass ships eleven ready-made Claude Code slash
commands — one mlcompass install-slash-commands turns
/mlc-advise, /mlc-evaluate, /mlc-leak … into one-keystroke
calls.
Self-Driving Agent (CLI)
When you're not in Claude Desktop — CI runs, cron jobs, an ssh session on a GPU box — an agent can drive the same eight tools from the terminal. It asks before mutating by default (agentlite's permission system is first-class) and streams every step to a transcript.
Why mlcompass
| pandas-profiling | W&B / TB | Cursor / Devin | mlcompass | |
|---|---|---|---|---|
| Analyzes raw data | ✅ | ❌ | ❌ | ✅ |
| Recommends models + features | ❌ | ❌ | partial | ✅ |
| Audits training scripts | ❌ | ❌ | reactive | ✅ |
| Proactive diagnosis | ❌ | ❌ | reactive | ✅ |
| Persistent project memory | ❌ | per-run | ❌ | ✅ |
| Permission-gated actions | ❌ | ❌ | partial | first-class |
mlcompass doesn't replace any of these — it's the advisor that sits next to all of them.
Tech Stack
- Language: Python 3.10–3.13
- Core: pandas (deterministic analysis) + agentlite (LLM agent layer)
- Analyzers: one pure analyzer per command — pure pandas / pure AST / pure log parser
- Interfaces: CLI + MCP server (8 tools) + 11 Claude Code slash commands
- Agent backends: Anthropic API or Claude Code (Claude Agent SDK)
- Packaging:
src/layout,pyproject.toml,mlcompasson PyPI - Quality: pytest, ruff, mypy (strict) — the anti-hallucination contract is end-to-end tested
- License: MIT