Source
urlhttps://aclanthology.org/2024.nlp4dh-1.41
rawraw/Fischer_2024.nlp4dh-1.41.pdf

TL;DR: A technical NLP evaluation of three open-source LLMs (Llama 3.1, Gemma 2, Mistral NeMo) on QDA-specific tasks, integrated as an opt-in feature into the Discourse Analysis Tool Suite. Promises for English; weaker for German. Represents a technically grounded alternative to commercial AI tools: open-source, locally deployed, privacy-preserving, with benchmark documentation.

Problem

Most AI-TA papers in this corpus evaluate commercial LLMs (ChatGPT, Bard, Claude) in the hands of social science researchers using conversational interfaces. Fischer & Biemann take a different entry point: they are NLP researchers building AI features into a research platform (the Discourse Analysis Tool Suite, DATS) used by Digital Humanities scholars at their university.

The problems they address are distinct from the methodological debates in the qualitative research literature. Their questions are engineering-oriented: which NLP tasks map onto QDA activities? Which open-source models perform best on a benchmark curated from real DH use cases? How should AI features be integrated into a research platform to preserve researcher control while improving efficiency?

The decision to work with open-source models is also a response to a specific set of concerns: commercial API costs, data privacy (research data sent to external servers), and the opacity of closed-source model architectures. Building AI features around locally deployed open-source models addresses all three.

Approach

Task mapping: Fischer & Biemann identify four NLP tasks embedded in the DATS platform’s QDA workflows:

  1. Document classification — sorting documents into predefined categories using metadata tags
  2. Information extraction — identifying and pulling relevant text segments, entities, or events
  3. Span classification — labeling sub-document text passages with codes (the most direct analog to qualitative coding)
  4. Text generation — summarizing, paraphrasing, generating memos

Benchmark construction: Datasets from real DH research projects, covering both English and German, curated to align with actual use cases rather than generic NLP benchmarks. This is methodologically important: generic benchmarks may not reflect the specific demands of QDA (dense conceptual content, variable document length, researcher-defined categories).

Model evaluation: Three state-of-the-art open-source LLMs evaluated on the benchmark:

  • Llama 3.1 (Meta)
  • Gemma 2 (Google)
  • Mistral NeMo (Mistral AI)

Integration: Based on findings, an LLM Assistant was implemented as an opt-in feature in the DATS platform, available for English projects. The opt-in design is explicitly principled: it ensures researchers actively choose to engage AI assistance rather than having it embedded as a default workflow.

AI’s Role

AI is positioned as a workflow-enhancing assistant for routine tasks — reducing the labor burden on document classification and information extraction while leaving conceptual coding, interpretation, and memo-writing to researchers. The opt-in implementation reflects this: AI is available when useful, not imposed.

The platform context is important: DATS is built around Grounded Theory-based research (Strauss and Corbin 1990), where coding is iterative, theory-building, and deeply researcher-controlled. AI features that fit within this workflow enhance it without disrupting its epistemological commitments.

Epistemological Stance

NLP / computational linguistics, with implicit post-positivist evaluation logic. The paper evaluates model performance against benchmark datasets — a quantitative evaluation framework appropriate to the NLP conference venue (NLP4DH). It does not engage with qualitative research epistemology or the small-q/Big-Q debate.

This is appropriate given the paper’s scope and venue — it is contributing to the technical literature, not the qualitative methodology literature. The epistemological questions raised by AI integration into DH research are not addressed here; other sources in the corpus address them more directly.

Rigor and Trustworthiness

The benchmark design is the paper’s strongest methodological contribution: using real DH research data rather than generic NLP benchmarks ensures that the evaluation reflects actual use-case demands. Published benchmarks allow independent replication.

The open-source model selection is principled: all three models are publicly available, allowing researchers to reproduce the evaluation or adapt the benchmark for their own contexts.

The language breakdown (English vs. German) is methodologically important: it directly demonstrates the non-English performance gap that sakaguchi-chatgpt-japanese-2025 found for Japanese. Two different papers from different traditions reaching the same finding strengthens the claim.

Limitations

The conference paper format limits depth: benchmark methodology, model evaluation, and integration details are all compressed. The benchmark itself (specific tasks, specific datasets, specific evaluation metrics) is described but not fully documented in the paper. Independent replication would require accessing the platform and benchmark directly.

The evaluation does not address the qualitative validity of AI outputs — whether span classifications or document categorizations reflect genuine interpretive understanding or surface-level pattern matching. This is the same limitation as other technical AI evaluations: performance on NLP benchmarks does not guarantee meaningful qualitative analysis.

The focus on English and German limits generalizability. DH research is conducted in many languages, and the performance gap between English (promising) and German (weaker) suggests that multilingual coverage would require substantially more work.

Connections