Source
TL;DRIntercoder agreement (Cohen's κ) measures how consistently two or more coders classify the same data — and it's the standard benchmark for evaluating LLM-generated coding in qualitative research.

What it measures

When qualitative data is coded by multiple researchers (or by a human and an LLM), intercoder agreement quantifies how often they agree, adjusted for chance agreement. Cohen’s κ is the standard formula:

  • κ < 0.20: slight agreement
  • 0.21–0.40: fair
  • 0.41–0.60: moderate
  • 0.61–0.80: substantial
  • 0.81–1.00: almost perfect

In llm-qualitative-research, the “two coders” are typically multiple ChatGPT conversations (same prompt, different sessions), assessing whether the LLM produces consistent output over time and across accounts.

Why it matters for LLM evaluation

LLMs are probabilistic — the same prompt can yield different outputs at different times. Intercoder agreement between ChatGPT runs is therefore a proxy for reliability: if κ is high, the model is coding consistently; if low, the output is too variable to trust.

(bijker-chatgpt-qca-2024) used this approach systematically: 10 conversations per coding scheme, compared pairwise, to estimate overall and category-specific κ.

Reliability ≠ validity

A critical caveat: high intercoder agreement only proves the LLM is consistent, not that it’s correct. Two coders can agree on wrong answers. Validity — whether the coding accurately captures the underlying construct — requires comparison against a human gold standard, which most current studies (including Bijker et al.) have not done.

In practice

Low κ in specific categories often signals:

  • Overlapping category definitions (ambiguous boundaries)
  • Insufficient examples or context in category labels
  • Mismatch between the theoretical framework and the actual data

These are diagnostic signals for prompt-engineering refinement, not just failures.

See also