#3 engineer ← all field notes

Field Note #3: I let the AI close duplicate issues. Then I turned that off.

FixClaw GitHub issue triage with Claude tool use. Hybrid duplicate detection, three confidence bands, human-in-the-loop by default. What broke when I let it auto-close.

Subject: FixClaw — AI GitHub issue triage
Industry: Open source / maintainer tools
Stack: TypeScript backend, Claude tool use, GitHub webhooks, pgvector

The case

Maintainers with 50+ issues a month have a triage problem. New issues queue up unlabeled, duplicates ship as separate threads, the maintainer’s inbox becomes the to-do list. Backlog rots, contributors stop trusting that anyone is reading.

A triage agent helps if it gets three things right: labels, duplicates, and a first-draft reply — and leaves the close-button click to a human.

FixClaw is what I run on my own repos. It watches issues.opened, classifies labels, runs a hybrid duplicate check, and posts a draft reply or needs-info prompt. Maintainer keeps the final click on close.

The numbers

Metric	Value
Time-to-first-label (p50)	< 5 min after `issues.opened`
Label accuracy vs gold standard	0.87
Duplicate recall (real duplicates caught)	0.85
Duplicate precision (no false flags)	0.92
Maintainer time saved (self-reported)	~6 hours/week on a 60-issue/month repo
LLM cost (Claude Sonnet, 60 issues/mo)	~€4/month

Eval set: 50 historical issues from 3 repos, manually labeled. Re-eval monthly.

What worked

Human-in-the-loop by default. The agent proposes — labels, duplicate links, a draft needs-info reply. The human clicks. No auto-close in v1.
Hybrid duplicate detection. Pass 1: exact title match (lowercased, whitespace-normalized). Pass 2: vector embedding (nomic-embed-text) over body + title, cosine ≥ 0.85, search across the last 1000 closed issues. Both hit → label duplicate + link. Only vector hits → label possible-duplicate, queue for review.
Structured tool use, not free-text replies. Claude calls a propose_triage tool with a typed schema: {labels, duplicates, reply_draft, confidence}. No prose parsing.
Idempotency keys. sha256(event_id + issue_number + body_hash) written to a processed table before any action. Webhook re-deliveries don’t double-label.

What failed

Auto-closing high-confidence duplicates. Confidence > 0.95 and label = duplicate with verified link — felt safe, was not. One false positive a quarter became a maintainer-vs-bot argument in a public thread. Turned off; only humans close.
Vector-only duplicate detection. Without the exact-title pre-pass, the vector pass surfaced too many “thematically similar” issues. Reverted to hybrid.
LLM-proposed labels with no component taxonomy. On a monorepo, the agent proposed bug repeatedly with no component scope. Added component:api / component:ui / component:cli labels and a CODEOWNERS routing rule.
A single confidence threshold. “Above 0.5 act, below ignore” gave bad outcomes both directions. Three bands (auto / review / defer) gave the agent room to be uncertain.

The architecture

GitHub webhook (issues.opened / issue_comment.created)
    │
    ▼
Webhook handler (TypeScript)
    │ idempotency check on sha256(event_id + issue# + body_hash)
    ▼
Embed body + title (nomic-embed-text → pgvector)
    │
    ▼
Hybrid duplicate check
    │ Pass 1: exact title match
    │ Pass 2: cosine ≥ 0.85 over last 1000 closed
    ▼
Claude (tool use)
    │ propose_triage({labels, duplicates, reply_draft, confidence})
    ▼
Confidence band routing
    │ ≥ 0.85  →  apply labels + post draft
    │ 0.6-0.85 →  apply + mention maintainer
    │ 0.4-0.6 →  internal comment only
    │ < 0.4   →  no action, manual triage
    ▼
GitHub API (set labels, post comment, link duplicates)
    │
    ▼
Append-only event log (every action, every confidence)

Next-step checklist

Define label taxonomy. 6-8 labels + optional component:* for monorepos.
Collect 50 historical issues, manually label. Your gold standard for evals.
Embed last 1000 closed issues into pgvector (or Qdrant / SQLite + cosine).
Set three confidence bands. Don’t ship with one threshold.
Build the tool schema. Output must match the GitHub API write you intend.
Ship without auto-close. Flip the switch only after a quarter of clean operation.
Write idempotency keys to a processed table. Webhooks redeliver.
Re-eval monthly. Watch for maintainer override rate — > 5% in a week is a drift alert.

Workflow spec: GitHub Issue Triage AI Starter Kit. Full case study: FixClaw guide. Article: GitHub Issue Management AI.

Does this shape match what you're building?

If you want me to scope a similar system for you — I respond in 24 hours.

Request a scope