Back to Home TR

mlcompass

An LLM agent that sits next to you through your whole ML pipeline — from data, through training, all the way to deployment — one CLI that keeps context across every step.

11
CLI commands
0/200
hallucinations reaching the user
v0.9
live on PyPI
MIT
open source

Project Overview

The ML ecosystem already has great tools — but each owns a single slice of the pipeline and none of them advise: profiling libraries (pandas-profiling) look at the data but have no concept of a target column; experiment trackers (W&B, TensorBoard) record metrics without interpreting them; code assistants (Cursor, Copilot) target syntax but miss ML-specific semantic mistakes. mlcompass fills that gap: a single advisory layer that follows your project from data to production, keeping context at every step.

Every command writes to and reads from a shared project context (.mlcompass/), so by the time you reach deploy the tool already knows your dataset, your model choice, your training history, and your evaluation results. That persistent project memory — in the spirit of .git/ — is what makes mlcompass more than a chat tool.

Built as my capstone project; live on PyPI at v0.9.0, MIT-licensed, supporting Python 3.10–3.13. It is built on top of agentlite, my own small Claude agent library.

The Pipeline — From Data to Production

Eleven commands cover every stage of the ML pipeline. Every command except init, status, and agent keeps a fully deterministic default path and offers an opt-in --llm flag that adds a Claude-driven interpretation step on top.

data.csv train.py two runs results.csv production │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ advise ───► audit ───► compare ───► evaluate ───► deploy watch │ init · status · agent · monitor · optimize (at every stage)

Headline Contribution — The Anti-Hallucination Contract

This is the research contribution. When an LLM narrates the structured output of a tool, it can fabricate: a column name absent from the data, an unmeasured value, a number that contradicts the evidence. When evaluate sees a suspiciously good metric (AUC > 0.995, R² > 0.999 — the classic signature of data leakage), an evidence-bound contract kicks in:

$ mlcompass evaluate predictions.csv ⚠ Suspiciously high R² (1.0000) ┌──── 🔬 Leakage investigation — evidence ────┐ │ Candidate leak column: log_price │ │ ŷ == y match rate: 0.9% │ │ log_price r=+1.0000 (spearman) │ └─────────────────────────────────────────────┘

The contract has two tiers. Tier A: the enum domains of the narrator's tool-input schema are generated at call time from the deterministic evidence dictionary — the model can only pick columns that exist. Tier B: the returned answer is verified in plain code, trusting no provider (entity soundness, value soundness, completeness); a violation triggers a corrective retry, and anything still fabricated is stripped. The result is a provider-independent guarantee: every column and every number returned is present in, and equal to, the evidence.

Measured live (N = 200 per cell), the bare narrator fabricated at 1% to 100% across six rewordings of the same instruction on identical evidence; the contract held the user-facing rate at 0/200 on every channel of both tasks. An academic paper describing the evidence-bound runtime schema is in preparation / under review.

The Eleven Commands

Command When you run it What you get
initStarting a projectA .mlcompass/ folder that tracks decisions
adviseYou have a CSV, now what?Models to try, features to derive, pitfalls to avoid
auditBefore you press trainStatic analysis of the training script (8 AST rules)
watchWhile training runsPlateau / overfit / NaN / divergence (log / TB / W&B)
compareAfter several runsSide-by-side config + final-metric diff with verdict
evaluateTraining doneMetrics, threshold sweep, leakage investigation
deployGoing to productionModel + deps + target-specific checks + checklist
statusAny timeProject metadata, active state, decision history
agent"Just do it for me"LLM router driving the other tools, with memory
monitorModel live, new dataPSI + KS + chi² drift, retrain verdict
optimizeA few runs, what's next?HPO sub-agent: leaderboard, sensitivity, N suggestions

Use from Claude Desktop / Cursor (MCP)

mlcompass ships a Model Context Protocol server, so any MCP-capable client (Claude Desktop, Claude Code, Cursor, Continue …) can call its eight tools directly. And pip install mlcompass ships eleven ready-made Claude Code slash commands — one mlcompass install-slash-commands turns /mlc-advise, /mlc-evaluate, /mlc-leak … into one-keystroke calls.

$ pip install "mlcompass[mcp]" # claude_desktop_config.json { "mcpServers": { "mlcompass": { "command": "mlcompass-mcp" } } }

Self-Driving Agent (CLI)

When you're not in Claude Desktop — CI runs, cron jobs, an ssh session on a GPU box — an agent can drive the same eight tools from the terminal. It asks before mutating by default (agentlite's permission system is first-class) and streams every step to a transcript.

$ pip install "mlcompass[agent]" $ export ANTHROPIC_API_KEY="sk-ant-..." $ mlcompass agent "I have data.csv, take me to a model recommendation"

Why mlcompass

pandas-profiling W&B / TB Cursor / Devin mlcompass
Analyzes raw data
Recommends models + featurespartial
Audits training scriptsreactive
Proactive diagnosisreactive
Persistent project memoryper-run
Permission-gated actionspartialfirst-class

mlcompass doesn't replace any of these — it's the advisor that sits next to all of them.

Tech Stack

  • Language: Python 3.10–3.13
  • Core: pandas (deterministic analysis) + agentlite (LLM agent layer)
  • Analyzers: one pure analyzer per command — pure pandas / pure AST / pure log parser
  • Interfaces: CLI + MCP server (8 tools) + 11 Claude Code slash commands
  • Agent backends: Anthropic API or Claude Code (Claude Agent SDK)
  • Packaging: src/ layout, pyproject.toml, mlcompass on PyPI
  • Quality: pytest, ruff, mypy (strict) — the anti-hallucination contract is end-to-end tested
  • License: MIT

Links