Field Note #3: I let the AI close duplicate issues. Then I turned that off.
FixClaw GitHub issue triage with Claude tool use. Hybrid duplicate detection, three confidence bands, human-in-the-loop by default. What broke when I let it auto-close.
- Subject
- FixClaw — AI GitHub issue triage
- Industry
- Open source / maintainer tools
- Stack
TypeScript backend, Claude tool use, GitHub webhooks, pgvector
The case
Maintainers with 50+ issues a month have a triage problem. New issues queue up unlabeled, duplicates ship as separate threads, the maintainer’s inbox becomes the to-do list. Backlog rots, contributors stop trusting that anyone is reading.
A triage agent helps if it gets three things right: labels, duplicates, and a first-draft reply — and leaves the close-button click to a human.
FixClaw is what I run on my own repos. It watches issues.opened, classifies labels, runs a hybrid duplicate check, and posts a draft reply or needs-info prompt. Maintainer keeps the final click on close.
The numbers
| Metric | Value |
|---|---|
| Time-to-first-label (p50) | < 5 min after issues.opened |
| Label accuracy vs gold standard | 0.87 |
| Duplicate recall (real duplicates caught) | 0.85 |
| Duplicate precision (no false flags) | 0.92 |
| Maintainer time saved (self-reported) | ~6 hours/week on a 60-issue/month repo |
| LLM cost (Claude Sonnet, 60 issues/mo) | ~€4/month |
Eval set: 50 historical issues from 3 repos, manually labeled. Re-eval monthly.
What worked
- Human-in-the-loop by default. The agent proposes — labels, duplicate links, a draft
needs-inforeply. The human clicks. No auto-close in v1. - Hybrid duplicate detection. Pass 1: exact title match (lowercased, whitespace-normalized). Pass 2: vector embedding (nomic-embed-text) over body + title, cosine ≥ 0.85, search across the last 1000 closed issues. Both hit → label
duplicate+ link. Only vector hits → labelpossible-duplicate, queue for review. - Structured tool use, not free-text replies. Claude calls a
propose_triagetool with a typed schema:{labels, duplicates, reply_draft, confidence}. No prose parsing. - Idempotency keys.
sha256(event_id + issue_number + body_hash)written to aprocessedtable before any action. Webhook re-deliveries don’t double-label.
What failed
- Auto-closing high-confidence duplicates. Confidence > 0.95 and label =
duplicatewith verified link — felt safe, was not. One false positive a quarter became a maintainer-vs-bot argument in a public thread. Turned off; only humans close. - Vector-only duplicate detection. Without the exact-title pre-pass, the vector pass surfaced too many “thematically similar” issues. Reverted to hybrid.
- LLM-proposed labels with no component taxonomy. On a monorepo, the agent proposed
bugrepeatedly with no component scope. Addedcomponent:api/component:ui/component:clilabels and a CODEOWNERS routing rule. - A single confidence threshold. “Above 0.5 act, below ignore” gave bad outcomes both directions. Three bands (auto / review / defer) gave the agent room to be uncertain.
The architecture
GitHub webhook (issues.opened / issue_comment.created)
│
▼
Webhook handler (TypeScript)
│ idempotency check on sha256(event_id + issue# + body_hash)
▼
Embed body + title (nomic-embed-text → pgvector)
│
▼
Hybrid duplicate check
│ Pass 1: exact title match
│ Pass 2: cosine ≥ 0.85 over last 1000 closed
▼
Claude (tool use)
│ propose_triage({labels, duplicates, reply_draft, confidence})
▼
Confidence band routing
│ ≥ 0.85 → apply labels + post draft
│ 0.6-0.85 → apply + mention maintainer
│ 0.4-0.6 → internal comment only
│ < 0.4 → no action, manual triage
▼
GitHub API (set labels, post comment, link duplicates)
│
▼
Append-only event log (every action, every confidence)
Next-step checklist
- Define label taxonomy. 6-8 labels + optional
component:*for monorepos. - Collect 50 historical issues, manually label. Your gold standard for evals.
- Embed last 1000 closed issues into pgvector (or Qdrant / SQLite + cosine).
- Set three confidence bands. Don’t ship with one threshold.
- Build the tool schema. Output must match the GitHub API write you intend.
- Ship without auto-close. Flip the switch only after a quarter of clean operation.
- Write idempotency keys to a
processedtable. Webhooks redeliver. - Re-eval monthly. Watch for maintainer override rate — > 5% in a week is a drift alert.
Workflow spec: GitHub Issue Triage AI Starter Kit. Full case study: FixClaw guide. Article: GitHub Issue Management AI.
Does this shape match what you're building?
If you want me to scope a similar system for you — I respond in 24 hours.
Request a scope