Project Summary
The English-speaking research community has tools like Elicit and Consensus.app that let researchers ask a question and get an evidence-grounded answer with citations. None of them serve Turkish academic literature. TürkResearcher fills this gap: an open-source (MIT) multi-agent LLM agent that takes a Turkish research question, retrieves evidence from a 740K-record Turkish academic corpus, and produces an IEEE-cited Turkish academic answer linking to real YÖK PDFs.
This was the final project for the Large Language Models course at Istanbul Medipol University (Track 1 — Novel Idea).
Architecture
A five-agent LangGraph state machine. The Critic agent loops back to
Retriever when coverage is insufficient; after two iterations it falls through to
LiveSearch (OpenAlex + Semantic Scholar + DergiPark live) for real-time augmentation.
Data
Two combined sources of Turkish academic content:
| Source | Records | Collection method |
|---|---|---|
| YÖK National Thesis Center | 633,998 | Hugging Face Hub (CC-BY-4.0) → quality filter |
| DergiPark journal articles | 106,641 | OAI-PMH harvest (custom resumable scraper) |
| Total | 740,639 | Single Chroma collection, cosine, mpnet-base-v2 (768-dim) |
The index was built on a Colab T4 GPU and released openly on Hugging Face Hub (hakansabunis/tr-academic-research-agent-index, 16 GB).
Evaluation
A 30-question Turkish benchmark spanning 10 categories (health, education, engineering, law, computer science, business, etc.). Four LLM-as-judge metrics:
- Citation accuracy — do the cited sources actually support each claim?
- Faithfulness — how grounded is the answer in retrieved chunks?
- Coverage — fraction of sub-questions addressed
- Holistic (1-5) — overall academic quality
I evaluated two configurations: 633K theses-only vs 740K + DergiPark:
| Metric | 633K | 740K | Δ |
|---|---|---|---|
| Citation accuracy | 0.60 | 0.51 | −0.10 |
| Faithfulness | 0.59 | 0.49 | −0.10 |
| Coverage | 0.49 | 0.47 | −0.03 |
| Holistic | 2.63 | 2.40 | −0.23 |
| #Citations | 30.1 | 32.8 | +2.7 |
Surprising Finding — "The Corpus-Expansion Paradox"
Naive corpus expansion is not always a free win: under-covered categories (CS, business) improved, while well-covered categories (health, engineering, law) regressed. Three contributing factors:
- Abstract length distribution shift — theses ≈ 1600 chars, journal abstracts ≈ 500 chars; less surface area to ground claims.
- Citation inflation — the writer agent emits +2.7 more citations per answer; each extra citation is weakly grounded.
- Source-mixing without source-aware writing — theses are broad-coherent, journal articles narrow-empirical. The writer prompt does not yet distinguish them.
This is not a "we built it and it worked" result — it is a real scientific observation about the nuance of multi-source RAG.
Tech Stack
- Language: Python 3.13
- Orchestration: LangChain + LangGraph (5+1 agents, conditional routing)
- Vector store: ChromaDB (cosine, 768-dim)
- Embedder: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- LLM: DeepSeek-Chat (OpenAI-compatible API)
- Live APIs: OpenAlex, Semantic Scholar, DergiPark OAI-PMH
- Evaluation: 30 questions × 4 metrics (LLM-as-judge), per-category breakdown
- Reproducibility: all code on GitHub, 16 GB index on Hugging Face
Future Work
The same data is sufficient to train a domain-specific Turkish academic LLM, in three stages:
- Custom embedder — SimCSE fine-tune on Turkish academic text (15-25% retrieval improvement expected)
- SFT model — synthesise 100-200K Q&A pairs and QLoRA-fine-tune a Turkish 7B base (TürkResearcher-7B-instruct)
- DPO alignment — using eval judgments as preference pairs