| TL;DR | Intercoder agreement (Cohen's κ) measures how consistently two or more coders classify the same data — and it's the standard benchmark for evaluating LLM-generated coding in qualitative research. |
|---|
What it measures
When qualitative data is coded by multiple researchers (or by a human and an LLM), intercoder agreement quantifies how often they agree, adjusted for chance agreement. Cohen’s κ is the standard formula:
- κ < 0.20: slight agreement
- 0.21–0.40: fair
- 0.41–0.60: moderate
- 0.61–0.80: substantial
- 0.81–1.00: almost perfect
In llm-qualitative-research, the “two coders” are typically multiple ChatGPT conversations (same prompt, different sessions), assessing whether the LLM produces consistent output over time and across accounts.
Why it matters for LLM evaluation
LLMs are probabilistic — the same prompt can yield different outputs at different times. Intercoder agreement between ChatGPT runs is therefore a proxy for reliability: if κ is high, the model is coding consistently; if low, the output is too variable to trust.
(bijker-chatgpt-qca-2024) used this approach systematically: 10 conversations per coding scheme, compared pairwise, to estimate overall and category-specific κ.
Reliability ≠ validity
A critical caveat: high intercoder agreement only proves the LLM is consistent, not that it’s correct. Two coders can agree on wrong answers. Validity — whether the coding accurately captures the underlying construct — requires comparison against a human gold standard, which most current studies (including Bijker et al.) have not done.
In practice
Low κ in specific categories often signals:
- Overlapping category definitions (ambiguous boundaries)
- Insufficient examples or context in category labels
- Mismatch between the theoretical framework and the actual data
These are diagnostic signals for prompt-engineering refinement, not just failures.
See also
- llm-qualitative-research — the context in which intercoder agreement is applied to LLM output
- bijker-chatgpt-qca-2024 — detailed κ results across inductive and deductive approaches
- prompt-engineering — how prompts affect intercoder agreement scores
What links here
- Ayik et al. (2026) — Human vs. AI: Evaluating TA With ChatGPT, QInsights, ATLAS.ti AI, and MAXQDA AI Assist
- Bennis & Mouwafaq (2025) — Advancing AI-Driven Thematic Analysis: A Comparative Study of Nine Generative Models
- Bijker et al. (2024) — ChatGPT for Automated Qualitative Research: Content Analysis
- Carlsen & Ralund (2022) — Computational Grounded Theory Revisited: From Computer-Led to Computer-Assisted Text Analysis
- Computational Grounded Theory
- Empirical Findings
- Epistemic Flattening
- Epistemology — Stances Across the Literature
- AI in Qualitative Research
- Index
- Jowsey et al. (2025) — Frankenstein, Thematic Analysis and Generative AI: Quality Appraisal Methods
- LLMs for Qualitative Research
- Montrosse-Moorhead (2023) — Evaluation Criteria for Artificial Intelligence
- Nelson (2020) — Computational Grounded Theory: A Methodological Framework
- Nicmanis & Spurrier (2025) — Getting Started with AI-Assisted Qualitative Analysis: An Introductory Guide
- Perkins & Roe (2024) — The Use of Generative AI in Qualitative Analysis: Inductive Thematic Analysis with ChatGPT
- Prescott et al. (2024) — Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses
- Prompt Engineering
- Qualitative AI Methods — A Living Taxonomy
- Reeping et al. (2025) — Interrogating the Use of LLMs in Qualitative Research Using the Q3 Framework
- Sakaguchi et al. (2025) — Evaluating ChatGPT in Qualitative Thematic Analysis in the Japanese Clinical Context
- Validity and Trustworthiness
- Xu (2026) — Doing Thematic Analysis in the Age of Generative AI: Practices, Ethics and Reflexivity
- Yang & Ma (2025) — Artificial Intelligence in Qualitative Analysis: A Practical Guide Using GPT-4 on Substance Use Interview Data