Field Note #3: I let the AI close duplicate issues. Then I turned that off.

FixClaw GitHub issue triage with Claude tool use. Hybrid duplicate detection, three confidence bands, human-in-the-loop by default. What broke when I let it auto-close.

Subject
FixClaw — AI GitHub issue triage
Industry
Open source / maintainer tools
Stack
TypeScript backend, Claude tool use, GitHub webhooks, pgvector

The case

Maintainers with 50+ issues a month have a triage problem. New issues queue up unlabeled, duplicates ship as separate threads, the maintainer’s inbox becomes the to-do list. Backlog rots, contributors stop trusting that anyone is reading.

A triage agent helps if it gets three things right: labels, duplicates, and a first-draft reply — and leaves the close-button click to a human.

FixClaw is what I run on my own repos. It watches issues.opened, classifies labels, runs a hybrid duplicate check, and posts a draft reply or needs-info prompt. Maintainer keeps the final click on close.

The numbers

MetricValue
Time-to-first-label (p50)< 5 min after issues.opened
Label accuracy vs gold standard0.87
Duplicate recall (real duplicates caught)0.85
Duplicate precision (no false flags)0.92
Maintainer time saved (self-reported)~6 hours/week on a 60-issue/month repo
LLM cost (Claude Sonnet, 60 issues/mo)~€4/month

Eval set: 50 historical issues from 3 repos, manually labeled. Re-eval monthly.

What worked

  • Human-in-the-loop by default. The agent proposes — labels, duplicate links, a draft needs-info reply. The human clicks. No auto-close in v1.
  • Hybrid duplicate detection. Pass 1: exact title match (lowercased, whitespace-normalized). Pass 2: vector embedding (nomic-embed-text) over body + title, cosine ≥ 0.85, search across the last 1000 closed issues. Both hit → label duplicate + link. Only vector hits → label possible-duplicate, queue for review.
  • Structured tool use, not free-text replies. Claude calls a propose_triage tool with a typed schema: {labels, duplicates, reply_draft, confidence}. No prose parsing.
  • Idempotency keys. sha256(event_id + issue_number + body_hash) written to a processed table before any action. Webhook re-deliveries don’t double-label.

What failed

  • Auto-closing high-confidence duplicates. Confidence > 0.95 and label = duplicate with verified link — felt safe, was not. One false positive a quarter became a maintainer-vs-bot argument in a public thread. Turned off; only humans close.
  • Vector-only duplicate detection. Without the exact-title pre-pass, the vector pass surfaced too many “thematically similar” issues. Reverted to hybrid.
  • LLM-proposed labels with no component taxonomy. On a monorepo, the agent proposed bug repeatedly with no component scope. Added component:api / component:ui / component:cli labels and a CODEOWNERS routing rule.
  • A single confidence threshold. “Above 0.5 act, below ignore” gave bad outcomes both directions. Three bands (auto / review / defer) gave the agent room to be uncertain.

The architecture

GitHub webhook (issues.opened / issue_comment.created)
    │
    ▼
Webhook handler (TypeScript)
    │ idempotency check on sha256(event_id + issue# + body_hash)
    ▼
Embed body + title (nomic-embed-text → pgvector)
    │
    ▼
Hybrid duplicate check
    │ Pass 1: exact title match
    │ Pass 2: cosine ≥ 0.85 over last 1000 closed
    ▼
Claude (tool use)
    │ propose_triage({labels, duplicates, reply_draft, confidence})
    ▼
Confidence band routing
    │ ≥ 0.85  →  apply labels + post draft
    │ 0.6-0.85 →  apply + mention maintainer
    │ 0.4-0.6 →  internal comment only
    │ < 0.4   →  no action, manual triage
    ▼
GitHub API (set labels, post comment, link duplicates)
    │
    ▼
Append-only event log (every action, every confidence)

Next-step checklist

  • Define label taxonomy. 6-8 labels + optional component:* for monorepos.
  • Collect 50 historical issues, manually label. Your gold standard for evals.
  • Embed last 1000 closed issues into pgvector (or Qdrant / SQLite + cosine).
  • Set three confidence bands. Don’t ship with one threshold.
  • Build the tool schema. Output must match the GitHub API write you intend.
  • Ship without auto-close. Flip the switch only after a quarter of clean operation.
  • Write idempotency keys to a processed table. Webhooks redeliver.
  • Re-eval monthly. Watch for maintainer override rate — > 5% in a week is a drift alert.

Workflow spec: GitHub Issue Triage AI Starter Kit. Full case study: FixClaw guide. Article: GitHub Issue Management AI.

Does this shape match what you're building?

If you want me to scope a similar system for you — I respond in 24 hours.

Request a scope