Goyanes et al. (2025) — Thematic Analysis of Interview Data with ChatGPT: Designing and Testing a Reliable Research Protocol

Source
url	https://doi.org/10.1007/s11135-025-02199-3
raw	raw/Goyanes_s11135-025-02199-3.pdf

TL;DR: A tested six-step protocol for using ChatGPT in thematic analysis of interview data, published in Quality & Quantity. The protocol is structured for replicability and explicitly tested on real interview transcripts. Key finding: ChatGPT reliably identifies different thematic patterns, particularly at initial stages and with large transcript volumes, but output granularity depends entirely on prompt quality. Human contextual insight and reflexivity cannot be substituted.

Problem

The proliferation of guidance on AI-assisted qualitative analysis has produced recommendations, frameworks, and best practices at varying levels of abstraction. Most are either too general (AI should be used as a “research assistant”) or too narrowly tied to a single study context to be readily adapted. What the field lacks is a structured, tested, replicable protocol that researchers outside the originating team can actually implement.

Goyanes et al. address this specifically for thematic analysis of interview data — the most common qualitative design in social science and communication research. The protocol is designed with communication scholars as an implicit audience (the lead author is from a Department of Communication), making it particularly relevant for the corpus this wiki documents.

The practical problem is also real: qualitative thematic analysis of interview data at any scale above 20–30 interviews generates volumes of transcript text that create genuine time and resource pressures. Protocols that can reliably facilitate the early stages of analysis — without sacrificing validity at the interpretive stage — have direct value for applied communication research.

Approach

The six-step protocol is the paper’s central contribution:

Step 1: Data preparation. Cleaning, formatting, and contextualizing transcripts for ChatGPT input. This includes managing token limits, structuring interview segments, and providing sufficient context for ChatGPT to understand the research topic. This step receives detailed attention because it is where most naive implementations fail: raw, unformatted transcripts fed into ChatGPT produce poor output.

Step 2: Defining the analysis process. Specifying to ChatGPT the analytical approach (inductive/deductive), scope (what counts as a theme), and research questions. This is essentially prompt framing: without explicit specification, ChatGPT defaults to its generic interpretation of “qualitative analysis,” which may not match the researcher’s needs.

Step 3: Chatbot interaction. The initial prompting round — structured queries to generate first-pass codes and themes. The paper provides illustrative prompt templates, though it acknowledges these are starting points rather than universal scripts.

Step 4: Iterative process. Refining prompts based on initial output; running multiple rounds. The protocol explicitly recommends treating the first round as a rough draft and iterating until output meets quality thresholds. This is where prompt-engineering as a practical skill becomes central.

Step 5: Review and validation. Researcher critically reviews output against the original data. This is the step most likely to be skipped under time pressure, and the protocol is explicit that it is non-optional — ChatGPT’s plausible-sounding output must be checked against specific data segments.

Step 6: Analysis and interpretation. Final researcher synthesis and meaning-making. This stage is fully human: ChatGPT’s contribution to the analytic record has been incorporated into the researcher’s coding scheme, but the interpretive conclusions are the researcher’s.

The protocol was tested on real interview data (domain not specified in the abstract but consistent with communication research). Results confirmed that ChatGPT significantly facilitated analysis at early stages and with large transcript volumes, while output granularity varied directly with prompt quality.

AI’s Role

AI is positioned as a structured analytic facilitator — specifically useful at the initial coding and theme-identification stages, where volume and pattern detection matter, and less useful at the interpretive synthesis stage, where contextual knowledge and reflexivity are required.

The paper is frank about what ChatGPT cannot do: contextual insight (understanding what something means within a participant’s life context), subtle metaphorical nuances, and reflexive judgment about whether a theme adequately represents the data. These remain exclusively human capacities in this protocol.

Epistemological Stance

Post-positivist / pragmatist, with an applied social science orientation. The quality criterion is protocol reliability — consistent, reproducible outputs that facilitate analysis without introducing systematic distortions. The paper does not engage with interpretivist or constructionist epistemologies.

The communication research tradition the authors work in tends toward post-positivist assumptions about qualitative work (theory-driven, hypothesis-informed, reliability-conscious), which is reflected in the protocol’s emphasis on structured steps, explicit specification, and validation checks.

Rigor and Trustworthiness

The six-step structure imposes rigor through procedural discipline: the protocol’s explicit iteration and validation steps reduce the risk of accepting ChatGPT output uncritically. The paper demonstrates that these steps make a difference — Step 5 (review and validation) catches errors that Steps 3 and 4 produce.

The finding that output granularity depends on prompt quality is empirically demonstrated rather than merely asserted, which adds credibility. The protocol is reproducible in the sense that another researcher following the six steps should produce similar-quality output (though not identical content, since ChatGPT is nondeterministic).

Limitations

The paper does not provide quantitative reliability assessment (κ, Jaccard) comparing protocol-generated themes against a human reference standard. This makes it impossible to place the protocol’s performance on the evidence continuum alongside bijker-chatgpt-qca-2024, prescott-ai-thematic-analysis-2024, or bennis-ai-thematic-analysis-2025.

The six-step structure, while practical, is somewhat mechanical. Researchers working in interpretive or reflexive traditions may find that it fits small-q assumptions without engaging with Big-Q concerns. The protocol would need significant adaptation for xu-ai-thematic-analysis-2026-style reflexive TA or brailas-ai-qualitative-research-2025-style abductive engagement.

The dataset on which the protocol was tested is not described in enough detail to assess how representative it is. Domain, interview length, participant characteristics, and transcript volume all affect how well the protocol works, and these details are not reported.

Connections

llm-qualitative-research — broader landscape
prompt-engineering — the central practical skill the protocol depends on and develops; Goyanes’s iterative refinement is consistent with best-practice guidance across the corpus
naeem-chatgpt-ta-steps-2025 — parallel step-by-step guide aligned to Braun & Clarke’s six phases; compare the protocols for overlapping and diverging design choices
yang-gpt4-qualitative-guide-2025 — parallel protocol in the same journal with a three-step structure; compare the two
nguyen-trung-gaita-2025 — GAITA offers a more theoretically developed framework for AI-assisted TA; compare the methodological ambitions
bijker-chatgpt-qca-2024 — the reliability benchmark; Goyanes’s lack of quantitative assessment is the gap
validity-trustworthiness — the review-and-validation step is Goyanes’s answer to the validity question; whether it is sufficient is debated