mlcompass

CLI commands

0/200

hallucinations reaching the user

v0.9

live on PyPI

MIT

open source

Project Overview

The ML ecosystem already has great tools — but each owns a single slice of the pipeline and none of them advise: profiling libraries (pandas-profiling) look at the data but have no concept of a target column; experiment trackers (W&B, TensorBoard) record metrics without interpreting them; code assistants (Cursor, Copilot) target syntax but miss ML-specific semantic mistakes. mlcompass fills that gap: a single advisory layer that follows your project from data to production, keeping context at every step.

Every command writes to and reads from a shared project context (.mlcompass/), so by the time you reach deploy the tool already knows your dataset, your model choice, your training history, and your evaluation results. That persistent project memory — in the spirit of .git/ — is what makes mlcompass more than a chat tool.

Built as my capstone project; live on PyPI at v0.9.0, MIT-licensed, supporting Python 3.10–3.13. It is built on top of agentlite, my own small Claude agent library.

The Pipeline — From Data to Production

Eleven commands cover every stage of the ML pipeline. Every command except init, status, and agent keeps a fully deterministic default path and offers an opt-in --llm flag that adds a Claude-driven interpretation step on top.

data.csv train.py two runs results.csv production │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ advise ───► audit ───► compare ───► evaluate ───► deploy watch │ init · status · agent · monitor · optimize (at every stage)

Headline Contribution — The Anti-Hallucination Contract

This is the research contribution. When an LLM narrates the structured output of a tool, it can fabricate: a column name absent from the data, an unmeasured value, a number that contradicts the evidence. When evaluate sees a suspiciously good metric (AUC > 0.995, R² > 0.999 — the classic signature of data leakage), an evidence-bound contract kicks in:

$ mlcompass evaluate predictions.csv

⚠  Suspiciously high R² (1.0000)

┌──── 🔬 Leakage investigation — evidence ────┐
│ Candidate leak column: log_price            │
│ ŷ == y match rate: 0.9%                      │
│ log_price        r=+1.0000 (spearman)       │
└─────────────────────────────────────────────┘

The contract has two tiers. Tier A: the enum domains of the narrator's tool-input schema are generated at call time from the deterministic evidence dictionary — the model can only pick columns that exist. Tier B: the returned answer is verified in plain code, trusting no provider (entity soundness, value soundness, completeness); a violation triggers a corrective retry, and anything still fabricated is stripped. The result is a provider-independent guarantee: every column and every number returned is present in, and equal to, the evidence.

Measured live (N = 200 per cell), the bare narrator fabricated at 1% to 100% across six rewordings of the same instruction on identical evidence; the contract held the user-facing rate at 0/200 on every channel of both tasks. An academic paper describing the evidence-bound runtime schema is in preparation / under review.

The Eleven Commands

Command	When you run it	What you get
`init`	Starting a project	A `.mlcompass/` folder that tracks decisions
`advise`	You have a CSV, now what?	Models to try, features to derive, pitfalls to avoid
`audit`	Before you press train	Static analysis of the training script (8 AST rules)
`watch`	While training runs	Plateau / overfit / NaN / divergence (log / TB / W&B)
`compare`	After several runs	Side-by-side config + final-metric diff with verdict
`evaluate`	Training done	Metrics, threshold sweep, leakage investigation
`deploy`	Going to production	Model + deps + target-specific checks + checklist
`status`	Any time	Project metadata, active state, decision history
`agent`	"Just do it for me"	LLM router driving the other tools, with memory
`monitor`	Model live, new data	PSI + KS + chi² drift, retrain verdict
`optimize`	A few runs, what's next?	HPO sub-agent: leaderboard, sensitivity, N suggestions

Use from Claude Desktop / Cursor (MCP)

mlcompass ships a Model Context Protocol server, so any MCP-capable client (Claude Desktop, Claude Code, Cursor, Continue …) can call its eight tools directly. And pip install mlcompass ships eleven ready-made Claude Code slash commands — one mlcompass install-slash-commands turns /mlc-advise, /mlc-evaluate, /mlc-leak … into one-keystroke calls.

$ pip install "mlcompass[mcp]"

# claude_desktop_config.json
{
  "mcpServers": {
    "mlcompass": { "command": "mlcompass-mcp" }
  }
}

Self-Driving Agent (CLI)

When you're not in Claude Desktop — CI runs, cron jobs, an ssh session on a GPU box — an agent can drive the same eight tools from the terminal. It asks before mutating by default (agentlite's permission system is first-class) and streams every step to a transcript.

$ pip install "mlcompass[agent]"
$ export ANTHROPIC_API_KEY="sk-ant-..."

$ mlcompass agent "I have data.csv, take me to a model recommendation"

Why mlcompass

	pandas-profiling	W&B / TB	Cursor / Devin	mlcompass
Analyzes raw data	✅	❌	❌	✅
Recommends models + features	❌	❌	partial	✅
Audits training scripts	❌	❌	reactive	✅
Proactive diagnosis	❌	❌	reactive	✅
Persistent project memory	❌	per-run	❌	✅
Permission-gated actions	❌	❌	partial	first-class

mlcompass doesn't replace any of these — it's the advisor that sits next to all of them.

Tech Stack

Language: Python 3.10–3.13
Core: pandas (deterministic analysis) + agentlite (LLM agent layer)
Analyzers: one pure analyzer per command — pure pandas / pure AST / pure log parser
Interfaces: CLI + MCP server (8 tools) + 11 Claude Code slash commands
Agent backends: Anthropic API or Claude Code (Claude Agent SDK)
Packaging: src/ layout, pyproject.toml, mlcompass on PyPI
Quality: pytest, ruff, mypy (strict) — the anti-hallucination contract is end-to-end tested
License: MIT