Back to Home TR

TürkResearcher

Turkish Academic Research Agent — LangChain + LangGraph multi-agent LLM grounded in 740K Turkish sources

Project Summary

The English-speaking research community has tools like Elicit and Consensus.app that let researchers ask a question and get an evidence-grounded answer with citations. None of them serve Turkish academic literature. TürkResearcher fills this gap: an open-source (MIT) multi-agent LLM agent that takes a Turkish research question, retrieves evidence from a 740K-record Turkish academic corpus, and produces an IEEE-cited Turkish academic answer linking to real YÖK PDFs.

This was the final project for the Large Language Models course at Istanbul Medipol University (Track 1 — Novel Idea).

Architecture

A five-agent LangGraph state machine. The Critic agent loops back to Retriever when coverage is insufficient; after two iterations it falls through to LiveSearch (OpenAlex + Semantic Scholar + DergiPark live) for real-time augmentation.

QUESTION (TR) │ ▼ [ PLANNER ] → 3-5 sub-questions │ ▼ [ RETRIEVER ] → multi-query over 740K corpus (cosine, top-30) │ ▼ [ SYNTHESIZER ]→ cluster findings, flag contradictions │ ▼ [ CRITIC ] → coverage_ok? ──No── RETRIEVER (loop ≤2) │ or LIVE_SEARCH │ Yes (OpenAlex / SS / DergiPark) ▼ [ WRITER ] → Turkish academic answer + IEEE citations │ ▼ ANSWER (~30 citations, real tez.yok.gov.tr URLs)

Data

Two combined sources of Turkish academic content:

Source Records Collection method
YÖK National Thesis Center 633,998 Hugging Face Hub (CC-BY-4.0) → quality filter
DergiPark journal articles 106,641 OAI-PMH harvest (custom resumable scraper)
Total 740,639 Single Chroma collection, cosine, mpnet-base-v2 (768-dim)

The index was built on a Colab T4 GPU and released openly on Hugging Face Hub (hakansabunis/tr-academic-research-agent-index, 16 GB).

Evaluation

A 30-question Turkish benchmark spanning 10 categories (health, education, engineering, law, computer science, business, etc.). Four LLM-as-judge metrics:

I evaluated two configurations: 633K theses-only vs 740K + DergiPark:

Metric 633K 740K Δ
Citation accuracy0.600.51−0.10
Faithfulness0.590.49−0.10
Coverage0.490.47−0.03
Holistic2.632.40−0.23
#Citations30.132.8+2.7

Surprising Finding — "The Corpus-Expansion Paradox"

Naive corpus expansion is not always a free win: under-covered categories (CS, business) improved, while well-covered categories (health, engineering, law) regressed. Three contributing factors:

  1. Abstract length distribution shift — theses ≈ 1600 chars, journal abstracts ≈ 500 chars; less surface area to ground claims.
  2. Citation inflation — the writer agent emits +2.7 more citations per answer; each extra citation is weakly grounded.
  3. Source-mixing without source-aware writing — theses are broad-coherent, journal articles narrow-empirical. The writer prompt does not yet distinguish them.

This is not a "we built it and it worked" result — it is a real scientific observation about the nuance of multi-source RAG.

Tech Stack

  • Language: Python 3.13
  • Orchestration: LangChain + LangGraph (5+1 agents, conditional routing)
  • Vector store: ChromaDB (cosine, 768-dim)
  • Embedder: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
  • LLM: DeepSeek-Chat (OpenAI-compatible API)
  • Live APIs: OpenAlex, Semantic Scholar, DergiPark OAI-PMH
  • Evaluation: 30 questions × 4 metrics (LLM-as-judge), per-category breakdown
  • Reproducibility: all code on GitHub, 16 GB index on Hugging Face

Future Work

The same data is sufficient to train a domain-specific Turkish academic LLM, in three stages:

  1. Custom embedder — SimCSE fine-tune on Turkish academic text (15-25% retrieval improvement expected)
  2. SFT model — synthesise 100-200K Q&A pairs and QLoRA-fine-tune a Turkish 7B base (TürkResearcher-7B-instruct)
  3. DPO alignment — using eval judgments as preference pairs

Links