Source
urlhttps://www.jmir.org/2024/1/e59050
rawraw/Bijker_jmir-2024-1-e59050.pdf

TL;DR: One of the most methodologically rigorous early benchmarks of ChatGPT for qualitative content analysis. ChatGPT (GPT-3.5 Turbo) achieves substantial to near-perfect intercoder agreement for inductive coding and moderate-to-substantial agreement for deductive coding — but performs better when categories emerge from data than when imposed from theory.

Problem

Qualitative content analysis is resource-intensive. A researcher coding 537 forum posts through multiple iterative cycles, building and refining a coding scheme, then applying it reliably — that process takes months and usually requires multiple human coders to establish reliability. The question Bijker et al. ask is specific and empirically tractable: can ChatGPT perform each distinct phase of this process with sufficient reliability to be useful?

The research sits in a tradition of computer-assisted qualitative analysis stretching back decades, but LLMs introduce something new: the model can receive natural-language instructions and respond in kind, making it potentially accessible to researchers without programming skills. That accessibility claim is what makes the evaluation high-stakes.

Approach

The data are 537 forum posts from people sharing experiences of reducing sugar consumption — messy, naturalistic text of the kind qualitative researchers routinely encounter. The team tested three distinct analytical approaches:

Inductive coding — coding schemes developed directly from the data, with categories and labels emerging through iterative engagement. Ten coding schemes were generated, applied to the full dataset across 100 separate ChatGPT conversations, and the results compared pairwise.

Unconstrained deductive coding — the Theoretical Domains Framework (TDF) provided the theoretical structure, but ChatGPT was instructed to relabel and adapt categories to fit the current data. Another 100 conversations.

Structured deductive coding — data coded directly into the TDF’s predefined matrix without relabeling. Ten conversations.

Prompt engineering was thorough and explicitly iterative. The team developed, tested, and refined prompts over multiple rounds before formal data collection, and the final prompt set is published as a supplementary appendix — an unusual transparency move that makes the work replicable.

AI’s role

ChatGPT is positioned as a second coder — a computational analogue of the human research assistant who applies a coding scheme to data. The ambition is not full automation; human researchers designed the study, engineered the prompts, and interpreted the results. But within the coding and scheme-development phases, ChatGPT operated largely autonomously. The potential time saving is real: 100 conversations replacing what would otherwise require multiple human coders working across months.

Epistemological stance

The paper operates squarely within a post-positivist framework. Reliability is the primary quality criterion; validity is acknowledged but explicitly set aside as beyond the study’s scope. The research question is “can ChatGPT produce consistent coding?” rather than “does ChatGPT produce accurate or theoretically meaningful coding?” This is a deliberate and defensible methodological choice — reliability is a prerequisite for validity, so establishing reliability first makes sense — but it means the paper’s conclusions are narrower than they might appear.

The use of Cohen’s κ as the evaluation metric reflects a quantitative sensibility applied to a qualitative task. Whether κ is the right measure for evaluating AI-generated qualitative codes is a question the paper does not raise.

Rigor and trustworthiness

The reliability assessment is unusually thorough: 10 coding schemes per approach, compared across 100 conversations, with both overall κ and category-specific κ reported. The range of κ = 0.72–0.82 for inductive schemes is substantial by Landis & Koch standards. For deductive approaches, κ = 0.58–0.73 is moderate-to-substantial — acceptable but not confident.

The category-specific results tell a more nuanced story. Some categories within the best inductive scheme reach κ = 0.95; others fall to 0.67. For the deductive approaches, some categories reach 0.87 while others fall to 0.13. The aggregate figures mask wide within-scheme variation.

The prompt engineering process is well-documented. The finding that category labels including examples outperform shorter labels is an empirically grounded insight with practical implications for prompt-engineering.

Limitations

The paper acknowledges the most important limitation clearly: reliability without validity. High κ tells us ChatGPT is consistent; it does not tell us whether the codes are theoretically meaningful, contextually accurate, or reflective of participant experience. The authors also note that forum data is noisier than interview or survey data, and that GPT-3.5 Turbo’s token limits constrained how much context could be provided per prompt. Newer models with larger context windows may perform differently.

What the paper does not address: the epistemological implications of treating ChatGPT as a coder in the first place. The κ framework assumes that coding is a mechanical task that can be assessed through agreement — an assumption contested by interpretivist and constructionist traditions. The paper’s framing makes sense within its post-positivist scope but should not be read as a general-purpose endorsement of AI coding.

Connections