Source
urlhttps://doi.org/10.1177/16094069231201504
rawraw/hamilton-et-al-2023-exploring-the-use-of-ai-in-qualitative-analysis-a-comparative-study-of-guaranteed-income-data.pdf

TL;DR: One of the first academic comparisons of ChatGPT vs. human qualitative analysis. The headline result is complementarity: human coders identified nuanced affective and experiential themes ChatGPT missed, and ChatGPT surfaced patterns human coders — working within a phenomenological frame — had not foregrounded. The relationship is not competitive but additive. Published within months of ChatGPT’s release, the paper is historically significant as an early empirical document of what AI actually produced when given real qualitative data.

Problem

In the months immediately following ChatGPT’s public release (November 2022), discourse about its implications for research was dominated by concerns about plagiarism and research integrity. Hamilton et al. ask a different question: can ChatGPT supplement human-centered qualitative research tasks? The framing is exploratory and non-adversarial — not “will AI replace qualitative researchers?” but “what happens when we run the same data through both?”

The context is concrete: a guaranteed income pilot program. Seventy-one recipients were interviewed about their experiences. Human researchers coded the interviews using Colaizzi’s descriptive phenomenological method, producing themes grounded in participants’ lived experience. The paper then asks ChatGPT to analyze the same data and compares the outputs.

At the time of writing, this was one of the first academic attempts to directly compare ChatGPT with human qualitative analysis — not a simulation, not a thought experiment, but actual parallel coding of real data from a social justice research project.

Approach

The study is methodologically simple by later standards but historically important. Human coders applied Colaizzi’s phenomenological method — a rigorous, researcher-immersive approach that emphasizes attending to the lived experience of participants as expressed in their words. ChatGPT was then asked to analyze the same interview data.

The comparison was qualitative: themes were compared for overlap and divergence. The paper reports which themes appeared in both analyses (indicating convergence) and which appeared in only one (indicating complementary capture). No reliability metrics (κ, Jaccard) were applied — the paper predates the methodological infrastructure for quantitative AI-TA comparison.

The context for the guaranteed income topic is analytically interesting: the interviews dealt with social justice, economic vulnerability, and lived experience of financial marginalization — data rich in affective and social meaning that phenomenological analysis is designed to capture.

AI’s Role

AI is positioned as a supplementary analyst — an additional perspective that can surface patterns the human coder, immersed in a specific analytical frame, might not foreground. The paper’s most important finding is that the complementarity goes in both directions: humans noticed things ChatGPT missed, and ChatGPT noticed things humans missed.

The “things humans missed” finding is epistemologically provocative. It suggests that the researcher’s analytical frame — in this case, Colaizzi’s phenomenological orientation — is not neutral. It directs attention toward certain patterns and away from others. AI, without a frame, may surface what the frame suppresses. This is the “explicatory via failure” and triangulation argument that later papers develop more formally (anis-french-ai-qualitative-research-2023, bennis-ai-thematic-analysis-2025).

The paper does not claim that ChatGPT’s additional themes are interpretively superior — it claims they are worth attending to as potential oversights or alternative framings.

Epistemological Stance

Implicitly interpretivist, operating within a phenomenological tradition. Colaizzi’s method is explicitly human-centered: the researcher’s immersion in the data and relationship to participants’ words are constitutive of the analysis. Using ChatGPT alongside this method creates a productive tension: the phenomenological analysis emphasizes depth and experiential fidelity; the AI analysis emphasizes breadth and pattern detection.

The paper does not theorize this tension explicitly — it is exploratory and descriptive. But the choice to use ChatGPT alongside (not instead of) a rigorous phenomenological method is implicitly a claim about the relationship between depth and breadth in qualitative analysis.

Rigor and Trustworthiness

By later standards, the evaluation approach is thin. No reliability metrics, no systematic theme-by-theme comparison procedure, no documentation of the prompts used. The comparison is researcher judgment: these themes appear in both; these appear only in human analysis; these appear only in ChatGPT.

This is appropriate for an exploratory study conducted in early 2023, before the field had developed evaluation infrastructure. The paper’s contribution is not methodological precision but empirical documentation of a phenomenon: in a direct comparison, AI and human analysis produce partially overlapping, partially divergent results — and the divergence is as interesting as the overlap.

The paper is also transparent about ChatGPT’s limitations at the time: the model occasionally fabricated information, was inconsistent across runs, and produced themes at a level of abstraction different from the phenomenological human analysis.

Limitations

The GPT version tested is not specified precisely — “ChatGPT” in early 2023 refers to GPT-3.5, which has been substantially outperformed by subsequent versions. The specific reliability and theme-generation failures the paper notes may not characterize current models.

The comparison is not symmetrical: the human analysis used a defined method (Colaizzi) while the AI analysis was unsupervised and unconstrained. A more controlled comparison would specify the same analytic task for both. The divergence may reflect methodological differences as much as capability differences.

The paper does not address reproducibility: would ChatGPT produce the same themes on the same data in a second run? This is a fundamental reliability question that the paper cannot answer with a single-session design.

Connections