Source
urlhttps://doi.org/10.1007/s11135-025-02066-1
rawraw/Yang_10590730.pdf

TL;DR: A practitioner-oriented three-step framework (prompt engineering → reliability assessment → theory-driven analysis) for applying GPT-4 to 60 substance use interviews. The paper’s strongest contributions are its unusually detailed documentation of the iterative prompting process and its demonstration that theory-driven tasks require explicit theoretical priming — without it, GPT-4 produces high-level themes that miss theoretical constructs.

Problem

The qualitative research literature faces a paradox: the methods that produce the richest understanding of human experience (in-depth interviews, iterative coding, theory-informed analysis) are also the most resource-intensive. Scaling them to larger samples without sacrificing interpretive depth requires either more researchers or different tools.

GPT-4 represents a possible third option. But “use GPT-4 for qualitative analysis” is not a method; it is a description of a tool. Yang & Ma’s problem is methodological specification: what are the steps, what decisions must be made at each step, how do you evaluate whether the output is reliable, and what does the researcher need to know to guide the process effectively?

The substance use domain provides a concrete test case: 60 qualitative interviews from a program evaluation, with theoretically important constructs from behavioral science that must be accurately identified in participant language. This tests not just descriptive coding (do themes match?) but theory-driven coding (does GPT-4 correctly identify psychological constructs?).

Approach

The three-step framework is designed to address both reliability and validity concerns:

Step 1: Prompt engineering. The core methodological contribution. Yang & Ma document the iterative prompt development process in more detail than most papers in the corpus — showing how initial prompts, GPT-4’s responses, and revised prompts evolved through multiple rounds. Key finding: effective prompts for qualitative analysis require domain knowledge (understanding the substance use context), method knowledge (understanding what inductive coding means and how to instruct it), and theoretical knowledge (understanding the psychological constructs to be coded). Without all three, the prompts produce output that is coherent but analytically superficial.

Step 2: Reliability assessment. GPT-generated codes are compared with researcher-generated codes using systematic comparison. The comparison procedure is documented in enough detail to be replicated, which is unusual in the AI-TA literature.

Step 3: Theory-driven thematic analysis. GPT-4’s capacity to apply psychological constructs to participant language is evaluated. This is the methodologically novel contribution: most AI-TA studies test descriptive or inductive coding. Yang & Ma test whether GPT-4 can work with theoretically specified constructs — and find that explicit theoretical priming in the prompt is required for reliable performance. Without priming, GPT-4 produces themes at a level of abstraction that misses the constructs.

AI’s Role

AI is positioned as a researcher-guided analysis instrument — capable of scaling qualitative analysis but requiring substantial researcher expertise to operate appropriately. The central claim: qualitative expertise is not a nice-to-have but is necessary to guide GPT applications. A researcher who cannot evaluate whether a code correctly captures a psychological construct cannot use GPT-4 to code for psychological constructs.

This is a paradox that the paper does not resolve: the expertise required to guide AI effectively may be comparable to the expertise required to do the analysis without it. The efficiency gain depends on the researcher’s existing competence.

Epistemological Stance

Post-positivist, within a social and behavioral science framework. Reliability assessment (comparison of GPT-generated with researcher-generated codes) is the primary quality criterion. The paper engages with ontological and epistemological concepts briefly (reality as subject- and context-dependent; knowledge as constructed) but does not develop these commitments into methodological implications. The evaluation logic remains concordance-based.

Rigor and Trustworthiness

The paper’s methodological transparency is its strongest feature. The prompting documentation — showing initial prompts, GPT responses, and revised prompts — creates a partial audit trail that allows readers to evaluate the human-AI interaction rather than just its outputs.

The reliability comparison procedure is systematic and documented. The theory-driven analysis section uses explicit criteria for evaluating whether GPT-4 correctly identified psychological constructs, rather than relying on impressionistic judgment.

The three-step structure provides a replicable framework that can be adapted to other contexts. This is rarer than it appears in the AI-TA literature, where many “frameworks” are lists of principles rather than operational procedures.

Limitations

The paper does not report formal reliability statistics (κ or Jaccard) for the comparison of GPT-generated and researcher-generated codes. The comparison is documented but not quantified, making it difficult to compare this study’s reliability performance against benchmarks like bijker-chatgpt-qca-2024 or prescott-ai-thematic-analysis-2024.

The theory-driven analysis section, while methodologically valuable, uses psychological constructs specific to the substance use domain. Whether GPT-4’s performance on theory-driven coding generalizes to other theoretical frameworks (Braun & Clarke’s reflexive TA, grounded theory, discourse analysis) is unknown.

The iterative prompt refinement process, while well-documented, raises reproducibility questions: different researchers with different domain and method knowledge would develop different prompts, producing different outputs. The study documents one path through the methodological space, not a generalizable procedure.

Connections

  • llm-qualitative-research — broader landscape
  • prompt-engineering — the central practical skill this paper teaches in more operational detail than most comparable guides
  • intercoder-agreement — the reliability assessment framework; Yang & Ma’s comparison procedure is well-documented but lacks formal κ calculation
  • bijker-chatgpt-qca-2024 — the reliability benchmark; compare methodological approaches to assessing GPT coding accuracy
  • goyanes-chatgpt-protocol-2025 — parallel protocol paper published in the same journal; compare the six-step and three-step approaches
  • yang-gpt4-qualitative-guide-2025 — (self-referential in connections) compare with Goyanes for different protocol designs
  • naeem-chatgpt-ta-steps-2025 — parallel step-by-step guide aligned to Braun & Clarke phases
  • validity-trustworthiness — the theory-driven analysis finding has validity implications: high concordance on descriptive themes does not guarantee valid capture of theoretical constructs