TürkResearcher

Project Summary

The English-speaking research community has tools like Elicit and Consensus.app that let researchers ask a question and get an evidence-grounded answer with citations. None of them serve Turkish academic literature. TürkResearcher fills this gap: an open-source (MIT) multi-agent LLM agent that takes a Turkish research question, retrieves evidence from a 740K-record Turkish academic corpus, and produces an IEEE-cited Turkish academic answer linking to real YÖK PDFs.

This was the final project for the Large Language Models course at Istanbul Medipol University (Track 1 — Novel Idea).

Architecture

A five-agent LangGraph state machine. The Critic agent loops back to Retriever when coverage is insufficient; after two iterations it falls through to LiveSearch (OpenAlex + Semantic Scholar + DergiPark live) for real-time augmentation.

QUESTION (TR) │ ▼ [ PLANNER ] → 3-5 sub-questions │ ▼ [ RETRIEVER ] → multi-query over 740K corpus (cosine, top-30) │ ▼ [ SYNTHESIZER ]→ cluster findings, flag contradictions │ ▼ [ CRITIC ] → coverage_ok? ──No── RETRIEVER (loop ≤2) │ or LIVE_SEARCH │ Yes (OpenAlex / SS / DergiPark) ▼ [ WRITER ] → Turkish academic answer + IEEE citations │ ▼ ANSWER (~30 citations, real tez.yok.gov.tr URLs)

Data

Two combined sources of Turkish academic content:

Source	Records	Collection method
YÖK National Thesis Center	633,998	Hugging Face Hub (CC-BY-4.0) → quality filter
DergiPark journal articles	106,641	OAI-PMH harvest (custom resumable scraper)
Total	740,639	Single Chroma collection, cosine, mpnet-base-v2 (768-dim)

The index was built on a Colab T4 GPU and released openly on Hugging Face Hub (hakansabunis/tr-academic-research-agent-index, 16 GB).

Evaluation

A 30-question Turkish benchmark spanning 10 categories (health, education, engineering, law, computer science, business, etc.). Four LLM-as-judge metrics:

Citation accuracy — do the cited sources actually support each claim?
Faithfulness — how grounded is the answer in retrieved chunks?
Coverage — fraction of sub-questions addressed
Holistic (1-5) — overall academic quality

I evaluated two configurations: 633K theses-only vs 740K + DergiPark:

Metric	633K	740K	Δ
Citation accuracy	0.60	0.51	−0.10
Faithfulness	0.59	0.49	−0.10
Coverage	0.49	0.47	−0.03
Holistic	2.63	2.40	−0.23
#Citations	30.1	32.8	+2.7

Surprising Finding — "The Corpus-Expansion Paradox"

Naive corpus expansion is not always a free win: under-covered categories (CS, business) improved, while well-covered categories (health, engineering, law) regressed. Three contributing factors:

Abstract length distribution shift — theses ≈ 1600 chars, journal abstracts ≈ 500 chars; less surface area to ground claims.
Citation inflation — the writer agent emits +2.7 more citations per answer; each extra citation is weakly grounded.
Source-mixing without source-aware writing — theses are broad-coherent, journal articles narrow-empirical. The writer prompt does not yet distinguish them.

This is not a "we built it and it worked" result — it is a real scientific observation about the nuance of multi-source RAG.

Tech Stack

Language: Python 3.13
Orchestration: LangChain + LangGraph (5+1 agents, conditional routing)
Vector store: ChromaDB (cosine, 768-dim)
Embedder: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
LLM: DeepSeek-Chat (OpenAI-compatible API)
Live APIs: OpenAlex, Semantic Scholar, DergiPark OAI-PMH
Evaluation: 30 questions × 4 metrics (LLM-as-judge), per-category breakdown
Reproducibility: all code on GitHub, 16 GB index on Hugging Face

Future Work

The same data is sufficient to train a domain-specific Turkish academic LLM, in three stages:

Custom embedder — SimCSE fine-tune on Turkish academic text (15-25% retrieval improvement expected)
SFT model — synthesise 100-200K Q&A pairs and QLoRA-fine-tune a Turkish 7B base (TürkResearcher-7B-instruct)
DPO alignment — using eval judgments as preference pairs