Ayik et al. (2026) — Human vs. AI: Evaluating TA With ChatGPT, QInsights, ATLAS.ti AI, and MAXQDA AI Assist

Source
url	https://doi.org/10.1177/10778004251412874
raw	raw/ayik-et-al-2026-human-vs-ai-evaluating-thematic-analysis-with-chatgpt-qinsights-atlas-ti-ai-and-maxqda-ai-assist.pdf

TL;DR: The first head-to-head empirical comparison of four AI tool types — a general-purpose chatbot (ChatGPT-4o), a dedicated qualitative AI platform (QInsights), and two AI-enhanced CAQDAS systems (ATLAS.ti AI and MAXQDA AI Assist) — against a validated human-coded thematic analysis of K-12 STEM teacher data. Key finding: no hallucinations with careful prompting; MAXQDA closest to human analysis (3/6 exact matches); ATLAS.ti and ChatGPT exhibit post-positivist epistemological tendencies; QInsights and MAXQDA lean interpretivist. The paper argues that AI tool choice should align with the researcher’s epistemological stance.

What it means

Most comparative studies in this corpus pit ChatGPT against human analysis, or compare different GPT models. This paper takes a different approach: it compares four structurally distinct AI tools representing the dominant archetypes qualitative researchers currently encounter — a general-purpose LLM, a purpose-built qualitative AI platform, and two traditional CAQDAS systems with embedded AI. The human-coded baseline (six themes, validated through a prior study) allows for a structured comparison of not just which tool got which themes, but how each tool’s analytic logic shapes what it sees and how it frames it.

The dataset — 25 K-12 STEM teachers discussing formative assessment with multilingual learners — was chosen for its thematic richness and prior validated analysis, and importantly was still under manuscript review when the AI analyses were conducted, preventing the tools from having been trained on the published version. All analyses used the same five standardized prompts across all tools. The lead researcher (Ayik, who had conducted the original human analysis) led all AI analyses, preserving analytic consistency.

The most methodologically interesting contribution is not the reliability comparison but the epistemological analysis in the final section: the authors read back from observable tool behavior to infer each tool’s implicit epistemological orientation. This is rare in the corpus and connects to debates in epistemology and paulus-marone-qdas-discourse-2024 about what QDAS tools structurally assume about knowledge.

The four tools and their logics

ChatGPT-4o (general-purpose LLM): Flexible, prompt-driven, context-aware. Operates through next-token prediction; generates codes and themes from conversational prompts. Retains memory across sessions — the researchers used a trusted external account to prevent prior context from contaminating the analysis. Result: 1 exact match, 5 partial matches (16.7% / 83.3%). Strengths: Flexibility, speed, produces thematic maps. Weaknesses: Relied on word frequency rather than conceptual depth; treated a single teacher’s repeated utterances across two documents as multiple respondents; produced a low-quality thematic map with truncated labels and missed inter-theme connections.

QInsights (dedicated qualitative AI platform): Corpus-level dialogic questioning — positioned as “QDA without coding” (Friese’s CAAI concept). Themes were generated automatically before research questions were shared, reflecting an inductive/exploratory stance. Result: 1 exact match, 5 partial matches. Strengths: Clearer visual structure; better at conversational analysis mode. Weaknesses: Themes generated before the researcher provided research questions (no explicit research purpose); sub-theme codes largely repeated sub-theme wording; thematic map did not correspond to generated themes. Better for dialogue with the corpus than for end-to-end TA with codes.

ATLAS.ti AI (CAQDAS with embedded AI): Submitted research goals; generated 1,001 codes with 504 quotations in under 2 minutes. Two-feature structure: AI Coding (frequency-based) and Conversational AI (more integrative). Result: 1 exact match, 5 partial matches. The only tool that did not capture multimodality as a theme. Initial 1,001 codes aligned well with human codes but weren’t grouped into sub-themes; conversational AI codes were repetitive of sub-themes. Thematic map was structurally unhelpful — only listed RQs without showing theme relationships.

MAXQDA AI Assist (CAQDAS with assist-and-review model): The standout performer. Required researcher to engage with the dataset before generating outputs. Chat could not process six separate documents simultaneously — datasets were combined into one Word file. Result: 3 exact matches, 1 partial match, 4 themes reproduced (50% exact match rate). Differences “within expected bounds — comparable to those one might see if another human researcher conducted the analysis.” Thematic map was visually complex but included an explanatory relationship summary identifying inter-theme connections. Closest to human analysis across all dimensions.

Reliability summary

Tool	Exact matches	Partial matches	Match rate
ChatGPT-4o	1/6 (16.7%)	5/6 (83.3%)	Moderate
QInsights	1/6 (16.7%)	5/6 (83.3%)	Moderate
ATLAS.ti AI	1/6 (16.7%)	5/6 (83.3%)	Moderate
MAXQDA AI Assist	3/6 (50%)	1/6 (partial)	Highest

Hallucinations: None across all four tools — attributed to careful prompting: “Please do not incorporate any external sources or data beyond these.” The contrast with jowsey-frankenstein-ai-ta-2025 (58% fabricated quotes) is striking and underlines the prompt-sensitivity of hallucination risk.

Time: All tools dramatically faster than the 150+ hours required for human analysis. ChatGPT and QInsights: minutes. MAXQDA: longer due to technical limitations (5,000-character constraint for AI Coding on selected passages).

The epistemological analysis

The paper’s most distinctive contribution: reading AI tool behavior as evidence of epistemological orientation.

ATLAS.ti AI and ChatGPT → post-positivist tendencies. Both emphasized frequency and pattern recurrence. ChatGPT treated a single teacher’s repeated comments as multiple data points — a word-count logic, not a conceptual one. ATLAS.ti’s AI Coding feature generated 1,001 codes by emphasizing frequently co-occurring concepts; this efficiency does not demonstrate interpretive or reflexive reasoning. The authors read this as “post-positivist orientation.”

QInsights and MAXQDA → interpretivist tendencies. QInsights generated themes without explicit code production or frequency-based computation, aligning with dialogic/interpretive stance when unconstrained. MAXQDA’s initial response engaged with the data before producing code suggestions; no frequency-based reasoning was observed. Both foregrounded researcher interaction.

Practical implication: The epistemological stance of the researcher and the chosen AI tool should ideally be congruent. Mixed-methods researchers may gravitate toward ATLAS.ti and ChatGPT (frequency-based); critical qualitative scholars may prefer QInsights and MAXQDA (conversational/dialogic). The choice of tool is a methodological decision with epistemological consequences — not a neutral selection of instruments.

The paper frames this as a “paradigm shift” rather than an “epistemological clash” — a more cautious and empirically grounded version of the optimist position in the contested-claims debate.

Epistemological stance

Pragmatist with interpretivist commitments. The researchers take Braun and Clarke’s (2006) six-phase reflexive TA as the methodological standard, and they engage with the epistemological implications of tool choice explicitly. They cite Jowsey et al. (2025) as raising legitimate concerns, but reframe the question: the issue is not whether AI should participate in TA at all, but how its analytic logic interacts with the epistemological commitments of the specific approach. Their position: AI should complement, not replace, human judgment; tool choice should align with epistemological stance; reflexivity is not automated.

Virginia Braun’s (2025) keynote position is also noted: “a conscientious objection to AI,” arguing that AI-in-TA discourse rests on post-positivist realism aiming to improve efficiency, and that technological opportunity is not methodological improvement. The paper neither endorses nor rejects this view but treats it as part of the methodological landscape.

Limitations

Single researcher led all analyses — Ayik conducted human and AI analyses; consistent but reduces independence
Standardized prompts may underuse tool-specific strengths — tools were compared on identical prompts for fairness, but each tool has distinct affordances that standardization constrained
Version-specific — all analyses in July 2025; AI tools update rapidly; results may differ on later versions
Non-deterministic outputs — different runs may produce different results; findings are point-in-time snapshots
Epistemological inferences are behavioral, not architectural — conclusions about post-positivist vs. interpretivist tendencies are based on observed behavior, not knowledge of internal model architectures
Human baseline is itself an interpretation — the validated human-coded analysis is not a neutral ground truth; it reflects the research team’s lenses
Data security unresolved — sending confidential qualitative datasets to external services remains a governance concern not resolved by the study

Connections

Compared to human TA validated baseline connects to: validity-trustworthiness — the reliability/validity distinction; this study is primarily reliability-focused but raises validity questions through epistemological analysis
No hallucination finding contrasts with: jowsey-frankenstein-ai-ta-2025 — 58% fabrication; the difference is careful prompting constraints
ATLAS.ti and ChatGPT post-positivist tendency connects to: chatzichristos-ai-positivism-2025 — empirical evidence that AI tools import positivist assumptions into interpretivist disciplines
QDAS epistemological analysis connects to: paulus-marone-qdas-discourse-2024 — discourse analysis of QDAS marketing; this paper provides behavioral evidence that QDAS platforms operationalize different epistemologies
Reliability metrics: intercoder-agreement — exact/partial match framework used here is less quantified than κ but analogous
Tool-epistemology alignment principle connects to: epistemology — Nicmanis & Spurrier’s call to match AI methods to epistemological commitments
Human-AI roles framework: human-ai-collaboration — the four researcher roles (manager, colleague, teacher, advocate) from Thominet et al. (2024) are introduced as a framing device
Claim 1 (AI quality) evidence: contested-claims — provides empirical data on performance variability by tool type; partial convergence is the dominant finding across all tools
QInsights as dialogic tool connects to: friese-caai-framework-2026 — CAAI framework; QInsights appears to be the commercial implementation of the CAAI approach
Contextual theme performance connects to: sakaguchi-chatgpt-japanese-2025 — nuanced/culturally embedded themes underperformed; this study shows contextually complex themes (teacher stance, intertwined RQs) are missed more than descriptive themes