Wise et al. (2026) — Why AI is Not the Enemy: Trustworthy AI-in-the-Loop Analysis

Source
url	https://doi.org/10.1177/16094069261435579
raw	raw/wise-et-al-2026-why-ai-is-not-the-enemy-opportunities-to-strengthen-core-commitments-of-qualitative-inquiry-through.pdf

TL;DR: The most technically grounded paper in this corpus. Wise, Gresalfi, and Spencer-Smith (Vanderbilt) introduce “AI-in-the-loop analysis” as a reframing of GenAI’s role — not for efficiency or automation, but for deepening core qualitative commitments. Crucially, they map specific LLM architectural properties (pre-training, attention mechanisms, long context, embeddings, auto-regression) to five epistemological commitments of interpretive qualitative research, then connect these to established trustworthiness criteria. The result is the most operationally rigorous case for AI-assisted qualitative research in the current literature.

What it means

The paper enters the conversation at a specific moment: after Jowsey et al.'s categorical rejection (jowsey-et-al-2025-we-reject) and amidst widespread concern that AI-assisted qualitative research cedes interpretive judgment to machines. Wise et al. accept the epistemological critique of efficiency-focused AI use — they cite paulus-marone-qdas-discourse-2024 approvingly — while arguing that the debate has been poorly framed. The problem is not AI in qualitative research; it is AI used for the wrong reasons and in the wrong ways.

The terminological move is important. They coin “AI-in-the-loop analysis” in deliberate contrast to “human-in-the-loop” (the ML community’s term). Human-in-the-loop describes AI processes where human judgment is occasionally inserted to check or correct AI outputs; the human is incidental to an AI-driven pipeline. AI-in-the-loop inverts this: it describes analytic processes driven by human sensemaking and interpretation, into which computational capabilities are intentionally incorporated. The locus of agency is clarified from the start.

The paper’s epistemological home is Guba and Lincoln’s (1989) relativist paradigm — interpretive, inductive, context-sensitive, committed to thick description. This is not a pragmatist or post-positivist paper. It is making a claim that AI can serve genuinely interpretivist research, not just that AI works for some-kind-of-qualitative research somewhere.

The framework: five commitments and their LLM matches

Wise et al. identify five core commitments of qualitative analysis and then systematically map LLM properties to each:

The five commitments:

Close attention to data details in multiple iterations → layering of meaning
Immersion in the data with awareness of larger context at all times
Attention to context surrounding and beyond the data to interpret meaning
Positionality as analytical resource — subjectivity surfaces and conceptualizes patterns
Multiple researchers’ perspectives in dialogue to generate varied interpretations

The LLM properties and their connections:

Large-scale pre-training — LLMs encode broad socio-cultural knowledge across trillions of tokens. This supports contextually grounded interpretation even when meaning is not explicit in the data (commitment iii) and enables eliciting specific theoretical perspectives or positionalities from the model (commitments iv and v).

Attention mechanisms — LLMs continuously recontextualize all tokens in the model context as new information is added. This enables focused attention to specific details while maintaining awareness of the full corpus (commitment ii) and supports iterative meaning-layering across multiple passes (commitment i).

Rich embeddings — semantic representations capture relational distances between concepts, supporting multi-scale thematic analysis across iterations (commitment i) and multiple analytic perspectives (commitment iv).

Auto-regressive generation with adjustable temperature — LLMs can produce varied responses to the same prompt. Higher temperature settings broaden interpretive range, functioning analogously to multiple researchers with different positionalities (commitments iv and v). This supports exploration rather than convergence.

Long-context capabilities — current models can hold entire research corpora (128K–1M tokens; roughly 300–2500 pages) in context at once. This allows analysis of any specific moment within the full dataset context (commitment ii), without the selective sampling that researchers typically must impose on large corpora.

Trustworthiness: connecting to established criteria

The paper’s most distinctive contribution is linking AI-in-the-loop practices to Lincoln and Guba’s trustworthiness criteria:

Credibility — establishing full corpus context from the start ensures interpretations are situated in the whole, not idiosyncratic data subsets
Dependability — AI can be used to probe temporal stability of themes (“AI-supported temporal audit”)
Confirmability — multiple passes with different personas and analytic tasks; AI can systematically surface both confirming and disconfirming instances for emergent conjectures — a more thorough triangulation than is typically feasible manually
Transferability — detailed documentation of iterative prompting processes in methods sections
Authenticity — AI can be prompted to search specifically for underrepresented voices and experiences not yet surfaced, checking whether certain kinds of participants are being systematically missed

They also propose a new criterion: transparency in analytic decision-making — explicit documentation of what data context was included, which models and parameters were selected, and how iterative prompting was enacted. This directly addresses the concern about AI use appearing “neutral” or “objective.”

Prompting strategies for qualitative commitments

The paper provides specific, practical prompting techniques tied to the qualitative commitments:

Personas — giving the LLM a role (“you are an experienced K-12 science teacher”) to adopt specific positionalities and surface patterns those positionalities would notice (commitment iv)
Flipped interaction — having the LLM ask the researcher questions to clarify analytic focus, which supports reflexivity
Chain-of-thought — staged prompts that require intermediate reasoning steps before conclusions, keeping analysis grounded in evidence (commitment i)
Templates — structured output formats (four-column theme tables including supporting AND challenging excerpts) that make reasoning inspectable
Rationale requests — prompting the model to justify each interpretive claim with specific textual evidence (commitment iii)
System instructions — consistent analytic posture maintained across an entire session; should be reserved for stable elements, not evolving ones

One technical note with methodological implications: they distinguish between direct context inclusion (the entire corpus in the model’s active context) and Retrieval-Augmented Generation (RAG), which selectively retrieves document fragments. RAG is explicitly discouraged for this approach — the model must be informed by the full dataset, not selectively retrieved portions outside researcher control.

The WEIRD caveat

The paper is candid about LLM limitations in a way that strengthens rather than undermines the framework. It cites Bender et al. (2021) on WEIRD (Western, Educated, Industrialized, Rich, Democratic) bias in training data, and Hofmann et al. (2024) on AI generating covertly racist decisions based on dialect even after explicit debiasing training. The response is not to abandon AI but to extend the qualitative researcher’s habitual interrogation of positionality to the AI system itself — prompting models to adopt explicitly critical, bias-aware stances and making patterned biases in outputs visible. This frames AI bias not as a disqualifying defect but as an object of the same reflexive scrutiny applied to any analytic tool or researcher subjectivity.

Epistemological stance

Interpretivist/relativist, working explicitly within Guba and Lincoln’s fourth-generation evaluation framework. The authors are not post-positivists arguing for AI as measurement tool; they are interpretivists arguing that AI can be enlisted to deepen the specific practices — multiple iterations, contextual immersion, perspectival diversity — that interpretive research requires but structurally struggles to achieve at scale. The human researcher remains ground truth for interpretation; AI augments the researcher’s capacity for noticing, questioning, and synthesizing.

The paper is also implicitly post-humanist in accepting that the researcher-AI assemblage can produce analysis that neither alone could, without treating this as epistemologically threatening. This puts it in the same neighborhood as brailas-ai-qualitative-research-2025 and de-paoli-reject-rejection-2026, though the Wise et al. framing is more operationally grounded and less philosophically explicit.

Limitations

No empirical validation. The classroom science example is hypothetical. The framework is theoretically argued, not empirically tested. Whether the trustworthiness gains from AI-in-the-loop analysis actually materialize in practice is not demonstrated.
Assumes researcher technical sophistication. The prompting strategies described require meaningful understanding of LLM capabilities (temperature, model context vs. RAG, embedding space behavior). Most qualitative researchers are not trained in these concepts; the paper assumes a partnership with someone like the third author (a data scientist).
Long-context capabilities are rapidly evolving. The paper cites specific context sizes (128K–1M tokens) that were accurate at time of writing but will change. The framework’s arguments are context-size-dependent.
The WEIRD bias concern may be harder to operationalize than acknowledged. “Prompting the model to adopt bias-aware stances” may not effectively surface biases that are structurally embedded in training data — the model may be no more able to detect its own biases than a researcher who has internalized hegemonic assumptions.
Scope. The paper positions itself as addressing interpretive qualitative research but focuses heavily on a classroom discourse example that is somewhat atypical of the interview-heavy qualitative work in most social science fields.

Connections

Responds to and reframes: jowsey-et-al-2025-we-reject — cited as explaining the rejection; Wise et al. are constructing an alternative to both rejection and efficiency-focused adoption
Critiques automation framing shared with: paulus-marone-qdas-discourse-2024 — cited approvingly; the efficiency narrative is the problem this paper rejects
Extends trustworthiness framework of: validity-trustworthiness — Lincoln & Guba’s criteria are the evaluative framework; proposes transparency as a new criterion
Complements: human-ai-collaboration — the most technically detailed framework for human-AI labor division now in the wiki; AI-in-the-loop sits alongside CALM, CAAI, AbductivAI, GAITA
Connects to prompt engineering: prompt-engineering — the most specific treatment of prompting strategies in interpretive context
Shares epistemological home with: carlsen-ralund-computational-grounded-theory-2022 — both work in interpretivist tradition and reject AI as autonomous analyst while accepting AI as analytic augmentation
Empirical limitations acknowledged connect to: epistemic-flattening, sakaguchi-chatgpt-japanese-2025 — WEIRD bias as structural concern
Post-humanist resonance with: brailas-ai-qualitative-research-2025, de-paoli-reject-rejection-2026 — human-AI assemblage as productive without being philosophically threatening
Claim 2 debate: contested-claims — the most technically rigorous case in the wiki for Big-Q compatibility