| url | https://doi.org/10.2196/71521 |
|---|---|
| raw | raw/Sakaguichi_jmir-2025-1-e71521.pdf |
TL;DR: ChatGPT-4 achieves >80% agreement with human researchers on descriptive themes in Japanese clinical interview data, but drops to approximately 30% on culturally and emotionally embedded themes such as “fate” and “difficult to answer.” This is the most direct evidence in the corpus that high AI-human concordance in English-language studies cannot be assumed to generalize across languages and cultures.
Problem
The majority of empirical studies on AI-assisted qualitative analysis use English-language data in Western research contexts. The reliability figures they produce — κ = 0.72–0.82 in Bijker et al., Jaccard = 1.00 in Bennis & Mouwafaq — are specific to those conditions. Whether they generalize to non-English languages, non-Western cultural contexts, or research traditions with different epistemological groundings is an open question the English-centric literature has not addressed.
Sakaguchi et al. address this gap through a specifically Japanese clinical context: 30 semi-structured interviews with healthcare providers and patients at urban and rural Japanese hospitals, centered on “sacred moments” in clinical practice. The choice of topic is methodologically significant: sacred moments are by definition culturally embedded, spiritually resonant, and linguistically expressed through Japanese idiom. If AI struggles anywhere, it should struggle here.
The research context — healthcare providers and patients in Japan — also adds a practical dimension: AI-assisted analysis holds particular promise for clinical research settings where qualitative analysis capacity is limited but where culturally sensitive interpretation is essential.
Approach
The study conducts a comparative qualitative analysis:
Human analysis: Reflexive thematic analysis performed by experienced Japanese-language researchers familiar with the clinical and cultural context. The same data were also analyzed using Charmaz’s grounded theory and Pope’s five-step framework to assess consistency across methods.
AI analysis: ChatGPT-4 applied to the same interview transcripts (in Japanese) using iterative prompts. Thematic agreement between AI and human outputs was calculated for each theme.
The two-site design (urban community hospital and rural university hospital) introduces contextual variation that tests whether agreement rates are consistent across settings.
The iterative prompting approach — refining prompts based on initial AI output — is consistent with best-practice guidance in the corpus (bijker-chatgpt-qca-2024, yang-gpt4-qualitative-guide-2025) and means the study’s results reflect optimized, not first-pass, AI performance.
AI’s Role
AI is positioned as an auxiliary tool with bounded capability — useful for efficient identification of descriptive themes in Japanese-language data, but substantially limited when cultural knowledge, emotional register, and context-dependent linguistic structures are required for interpretation.
The paper is careful not to reject AI as useless in Japanese clinical contexts, but to specify precisely where it can and cannot be trusted. This bounded endorsement is more informative than either blanket acceptance or rejection.
Epistemological Stance
Post-positivist, within a clinical research framework. Thematic agreement rates are the primary evidence, and the reference standard is the human expert analysis. The paper does not engage with interpretive or constructionist epistemologies — its evaluation logic is concordance-based.
This is consistent with the clinical medical informatics tradition within which JMIR publishes. The paper is asking whether AI is reliable enough to be useful in clinical research workflows, not whether it can be epistemologically integrated into reflexive qualitative inquiry.
Rigor and Trustworthiness
The rigorous element is the theme-level breakdown: rather than reporting a single aggregate agreement figure, the paper identifies specific themes where AI performance was high (>80%) and specific themes where it collapsed (~30%). This granularity is methodologically important — aggregate figures would obscure the cultural/descriptive split.
The multiple-method verification (applying both Charmaz’s grounded theory and Pope’s framework alongside reflexive TA) provides triangulation support for the human analysis as a reference standard.
The identification of the mechanism — that Japanese structural and contextual complexity, including honorifics, indirection, and culture-specific concepts like enryo (restraint) and amae (indulgence), creates failure conditions for AI trained primarily on English data — is the paper’s most important theoretical contribution.
Limitations
The 30-interview dataset, while appropriate for qualitative research, is a limited empirical base for claims about AI performance across the full range of Japanese-language qualitative research. The topic (sacred moments) is both culturally rich and relatively abstract — AI performance might differ on more mundane clinical topics.
The paper does not test whether Japanese-language AI models (or Japanese-fine-tuned LLMs) perform differently from ChatGPT on the same data. The failure may be GPT-specific rather than a general limitation of AI in non-English contexts. Models with non-English training data are not evaluated.
The prompting strategy used is described but not published in full detail. Reproducibility of the AI analysis is therefore limited.
Connections
- llm-qualitative-research — broader landscape; this paper extends the evidence base to non-English research contexts
- epistemic-flattening — Sakaguchi operationalizes a specific dimension of this concept: AI’s training on English-dominant data creates systematic blind spots for non-Western cultural meaning
- intercoder-agreement — the agreement-rate methodology used throughout; the 30% vs. 80% split is the most dramatic reliability divergence by theme type in the corpus
- bijker-chatgpt-qca-2024 — high concordance study for contrast; both use English-language data in Western contexts; Sakaguchi’s data extends and qualifies those findings
- bennis-ai-thematic-analysis-2025 — the Jaccard = 1.00 benchmark; Sakaguchi shows this is not generalizable to Japanese clinical data
- anis-french-ai-qualitative-research-2023 — the equity argument; AI’s non-English failure has equity implications for researchers and communities outside English-language research traditions
- dahal-genai-qualitative-nepal-2024 — parallel evidence from a Global South context; together these papers establish that AI performance is culturally and linguistically bounded
- validity-trustworthiness — validity in culturally specific research requires cultural competence that current English-dominant LLMs lack
What links here
- Ayik et al. (2026) — Human vs. AI: Evaluating TA With ChatGPT, QInsights, ATLAS.ti AI, and MAXQDA AI Assist
- Contested Claims
- Dahal (2024) — How Can Generative AI Enhance or Hinder Qualitative Studies? A Critical Appraisal from South Asia, Nepal
- Dellafiore et al. (2025) — Artificial Intelligence in Qualitative Research: Insights From Experts via Reflexive Thematic Analysis
- Empirical Findings
- Fischer & Biemann (2024) — Exploring Large Language Models for Qualitative Data Analysis
- AI in Qualitative Research
- Index
- Montrosse-Moorhead (2023) — Evaluation Criteria for Artificial Intelligence
- Qualitative AI Methods — A Living Taxonomy
- Validity and Trustworthiness
- Wise et al. (2026) — Why AI is Not the Enemy: Trustworthy AI-in-the-Loop Analysis