Source
urlhttps://doi.org/10.1002/ev.20566
rawraw/New Drctns Evaluation - 2023 - Montrosse‐Moorhead - Evaluation criteria for artificial intelligence.pdf

TL;DR: Derives eight criteria domains for evaluating AI use in program evaluation from Teasdale’s Criteria Domains Framework. Three criteria govern AI conceptualization and implementation (purpose alignment, methodological appropriateness, transparency); five govern outcomes (accuracy, credibility, equity, efficiency, ethical integrity). The equity criterion — specifically AI’s differential performance across population subgroups — is the most underemphasized concern in the rest of the corpus.

Problem

Program evaluation is a discipline with well-developed criteria for judging research quality — but at the time of writing, no one had proposed criteria specifically for evaluating AI use in evaluation practice. The AI-in-evaluation literature was emerging rapidly, but without shared criteria for judging whether specific AI applications were appropriate, rigorous, or trustworthy, the field was, as Montrosse-Moorhead puts it, “in the wild, wild west.”

The gap matters for a specific reason: criteria are not optional extras in evaluation — they are constitutive of evaluation practice. “We can’t evaluate without criteria” (Patton, 2021). Without criteria for AI use, researchers cannot make principled judgments about which AI applications are appropriate, reviewers cannot evaluate whether AI-assisted research is rigorous, and practitioners cannot defend their choices.

This same gap exists in the qualitative research literature. AI-TA papers are full of claims that AI-assisted approaches are “reliable,” “valid,” or “appropriate” — but rarely specify the criteria against which these judgments are made, or make them explicit enough for peer scrutiny. Montrosse-Moorhead provides a framework that fills this gap for the evaluation context and can be adapted for qualitative research.

Approach

The paper uses Teasdale’s Criteria Domains Framework as a critical reading lens applied to papers in a special issue on AI in evaluation. Teasdale’s framework provides a process for identifying evaluative criteria from systematic review of existing practice — rather than imposing criteria from outside, it derives them from what the field is already using implicitly.

Eight criteria domains emerge:

Three concern conceptualization and implementation:

1. Purpose alignment. Is AI being used for tasks it can legitimately perform, and are those tasks aligned with the evaluation’s goals? Purpose misalignment — using AI for tasks it performs poorly, or for tasks where AI involvement undermines the evaluation’s aims — is a fundamental rather than a technical failure.

2. Methodological appropriateness. Does AI use fit the evaluation design and data type? AI techniques that work well for survey data may be inappropriate for interview data; models trained on general corpora may be inappropriate for specialized domains. Appropriateness requires matching AI capabilities to research requirements.

3. Transparency. Is AI use documented clearly enough for peer scrutiny? This covers which tools were used, what tasks they performed, how outputs were validated, and what role they played in final interpretations. Without transparency, neither reviewers nor readers can assess any of the other criteria.

Five concern outcomes:

4. Accuracy. Does AI produce correct or valid outputs for the specific task? This is the criterion most often assessed in the empirical literature (κ, Jaccard, theme concordance rates).

5. Credibility. Are outputs trustworthy and backed by evidence of validation beyond the initial benchmark? Credibility is accuracy extended through multiple sources of evidence.

6. Equity. Does AI introduce differential performance across subgroups or populations? This is the most under-addressed criterion in the qualitative research literature. If AI performs better on English-language data (sakaguchi-chatgpt-japanese-2025) or on texts from dominant groups (epistemic-flattening), then AI-assisted analysis may systematically misrepresent marginalized communities’ experiences.

7. Efficiency. Does AI actually save time and resources proportionate to quality? The efficiency gain must be evaluated against the validation burden and quality assurance costs, not just the raw speed of AI processing.

8. Ethical integrity. Are participant rights, privacy, and consent adequately protected? This connects directly to davison-ethics-genai-2024's concerns about data ownership and to brailas-ai-qualitative-research-2025's concerns about re-identification.

AI’s Role

AI appears in this paper as the object of systematic criteria-based evaluation — not as a tool being used, but as a practice being judged. The paper’s contribution is not guidance on how to use AI but criteria for evaluating whether its use was appropriate.

Epistemological Stance

Evaluation theory / pragmatist, within the program evaluation tradition. The paper draws on Scriven’s evaluation logic and Teasdale’s criteria framework — both grounded in the pragmatist tradition that holds criteria should be derived from practice rather than imposed from theory. This makes the framework applicable across epistemological traditions, since the criteria are functional rather than paradigmatic.

Rigor and Trustworthiness

The criteria derivation is explicitly grounded in Teasdale’s framework, providing a systematic rather than impressionistic basis for the eight domains. The derivation from a systematic read of a special issue on AI in evaluation — rather than from abstract principle — means the criteria reflect what the field is actually grappling with.

The paper explicitly acknowledges that this is “a first step” — more work is needed to deliberate the criteria, establish performance standards for each, and develop tools for applying them. This intellectual honesty about scope is methodologically appropriate.

Limitations

The evaluation context is specific: program evaluation with its own professional norms and stakeholder accountability structures. Some criteria (especially equity, which in evaluation often means ensuring that marginalized communities benefit fairly from evaluation findings) have nuances that apply differently in academic qualitative research.

The eight criteria are identified but not operationalized. What constitutes “adequate” purpose alignment? What evidence satisfies the credibility criterion? Without performance standards, the framework identifies what to evaluate without specifying how good is good enough.

The paper was written in 2023, before many of the empirical AI-TA studies were published. The criteria are derived from early AI-in-evaluation papers rather than from the now-substantial evidence base on AI performance in qualitative research.

Connections

  • llm-qualitative-research — the broader context; the framework applies across the AI-assisted research landscape
  • ai-research-ethics — criteria 3 (transparency) and 8 (ethical integrity) directly address the ethics literature
  • epistemic-flattening — the equity criterion (6) operationalizes this risk; AI performance disparities across populations are a measurable equity concern
  • sakaguchi-chatgpt-japanese-2025 — the clearest empirical evidence of criterion 6 failure; AI differential performance by language/culture
  • intercoder-agreement — criteria 4 (accuracy) and 5 (credibility) are what reliability metrics like κ and Jaccard measure
  • validity-trustworthiness — the framework provides a systematic alternative to ad hoc trustworthiness claims
  • jowsey-frankenstein-ai-ta-2025 — empirical evidence of criteria 4 and 5 failure (fabrication) and criterion 7 failure (efficiency gains offset by validation burden)
  • bennis-ai-thematic-analysis-2025 — the high-concordance study; meets criteria 4 and 5 but raises equity criterion concerns about language/cultural universalizability