This study introduces the Paper Search Model Context Protocol (MCP) server, a unified architecture enabling Large Language Models (LLMs) to autonomously search and retrieve academic literature across heterogeneous platforms (arXiv, PubMed, bioRxiv, medRxiv, Semantic Scholar, and CrossRef). By normalizing diverse metadata into a standardized schema, the system facilitates platform-agnostic tool use. We evaluate this infrastructure by tasking GPT-4.1 and GPT-5.1 with generating grounded literature reviews on AI applications in medicine, cybersecurity, and transportation, measuring performance against retrieved abstracts using ROUGE and BERTScore. Results indicate that GPT-4.1 achieves superior semantic and structural alignment, particularly when grounded via CrossRef (BERTScore F1 = 0.881, ROUGE-L F1 = 0.375), outperforming GPT-5.1 across most metrics. While GPT-5.1 demonstrates higher unigram recall (ROUGE-1 = 0.412 on arXiv), it exhibits lower structural fidelity. These findings validate MCP as a robust integration layer for academic RAG systems and demonstrate that standardized tool interfaces enable precise, quantitative assessment of LLM grounding capabilities.
Copyrights © 2026