领域大模型数据基座框架构建与应用探索

谢文骏; 董焕晴; 曹高辉

doi:10.13266/j.issn.2095-5472.2026.003

PDF(3326 KB)

知识管理论坛 ›› 2026, Vol. 11 ›› Issue (1) : 24-39. DOI: 10.13266/j.issn.2095-5472.2026.003 CSTR: 32306.14.CN11-6036.2026.003

AI赋能知识管理与服务的拓荒探索专题

领域大模型数据基座框架构建与应用探索

谢文骏 ¹ ,
董焕晴 ¹ ,
曹高辉 ¹^,²

作者信息 +

Construction and Application Exploration of Data Foundations for Domain Large Models

Author information +

文章历史 +

摘要

【目的/意义】 面向领域大模型在专业化与高风险应用场景中对高质量数据支撑与可信运行的现实需求，系统探讨领域大模型数据基座的基本内涵、关键特征与系统化构建路径，旨在为推动可信人工智能的落地应用提供理论框架与方法支撑，为领域大模型数据基座的规范化建设与治理实践提供系统性参考。 【方法/过程】 在系统梳理相关文献的基础上，明确领域大模型数据基座的概念内涵、主要特征与构成要素，引入分层模块化架构理论，构建分层化的领域大模型数据基座模型框架，主要涵盖数据资源层、知识组织层、知识存储与语义索引层、模型层、算法层、应用层、治理与运行支撑层，以此刻画数据、知识与模型协同运行的整体机制。最后，采用基于公开证据的桌面研究与跨案例对照方法开展框架的情境化适用性论证。 【结果/结论】 跨领域对照显示，该框架能够解释不同领域参照系统在外部知识记忆接入、证据约束生成、合规审计输出链条上的共性机制，在此基础上，进一步凝练出数据契约、索引配置与评测基线等可复用方法要素，验证该框架在跨领域情境下的适用性与可迁移性。

Abstract

[Purpose/Significance] In response to the practical demand for high-quality data support and trustworthy operation of domain large models in specialized and high-risk application scenarios, this study systematically examines the basic connotation, key characteristics, and systematic construction path of the domain large model data foundation. It aims to provide a theoretical framework and methodological support for the practical deployment of trustworthy artificial intelligence, and to offer systematic references for the standardized construction and governance practice of the domain large model data foundation. [Method/Process] Based on a systematic review of the relevant literature, this study clarified the conceptual connotation, major characteristics, and constituent elements of the domain large model data foundation. It introduced the theory of layered modular architecture to construct a layered framework for the domain large model data foundation, mainly covering the data resource layer, knowledge organization layer, knowledge storage and semantic indexing layer, model layer, algorithm layer, application layer, and governance and operational support layer, so as to characterize the overall mechanism of the coordinated operation of data, knowledge, and models. Finally, a desk research method based on public evidence and a cross-case comparison approach were employed to demonstrate the contextual applicability of the framework. [Result/Conclusion] Cross-domain comparisons show that the proposed framework can explain the common mechanisms underlying external knowledge memory access, evidence-constrained generation, and compliance-audit output chains in reference systems across different domains. On this basis, the study further distills reusable methodological elements such as data contracts, index configuration, and evaluation baselines, thereby validating the applicability and transferability of the framework across domains.

导出引用

谢文骏 , 董焕晴 , 曹高辉. 领域大模型数据基座框架构建与应用探索[J]. 知识管理论坛. 2026, 11(1): 24-39 https://doi.org/10.13266/j.issn.2095-5472.2026.003

Xie Wenjun , Dong Huanqing , Cao Gaohui. Construction and Application Exploration of Data Foundations for Domain Large Models[J]. Knowledge Management Forum. 2026, 11(1): 24-39 https://doi.org/10.13266/j.issn.2095-5472.2026.003

中图分类号： G203

参考文献

列表( 原文顺序 | 文献年度倒序 | 文中引用次数倒序 ) 可视化分析

[1]	国务院. 国务院关于深入实施“人工智能+”行动的意见[J]. 中华人民共和国国务院公报, 2025(25):16-20. The State Council of the People’s Republic of China. Opinions of the State Council on deeply implementing the “Artificial Intelligence+” action[J]. Gazette of the State Council of the People’s Republic of China, 2025(25):16-20. 本文引用 [1]

[2]	WU S, IRSOY O, LU S, et al. BloombergGPT: a large language model for finance[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2303.17564 本文引用 [3]

[3]	刘学博, 户保田, 陈科海, 等. 大模型关键技术与未来发展方向——从ChatGPT谈起[J]. 中国科学基金, 2023, 37(5):758-766. LIU X B, HU B T, CHEN K H, et al. Key technologies and future development directions of large language models: insights from ChatGPT[J]. Bulletin of National Natural Science Foundation, 2023, 37(5):758-766.

[4]	FROST & SULLIVAN CHINA. 数据基础设施白皮书[R]. 北京: Frost & Sullivan China, 2024. FROST & SULLIVAN CHINA. White paper on data infrastructure[R]. Beijing: Frost & Sullivan China, 2024.

[5]	中国信息通信研究院. 2024年中国大模型行业应用优秀案例白皮书[R]. 北京: 中国信息通信研究院, 2024. China Academy of Information and Communications Technology. White paper on excellent industry application cases of large models in China (2024)[R]. Beijing: China Academy of Information and Communications Technology, 2024. 本文引用 [1]

[6]	ALBER D A, YANG Z, ALYAKIN A, et al. Medical large language models are vulnerable to data-poisoning attacks[J]. Nature medicine, 2025, 31(2):618-626. 本文引用 [1]

[7]	CHEN H, CHEN H, ZHAO Z, et al. An overview of domain-specific foundation model: key technologies, applications and challenges[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2409.04267 本文引用 [1]

[8]	GUHA N, NYARKO J, HO D, et al. LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models[J]. Advances in neural information processing systems, 2023, 36:44123-44279.

[9]	JIANG Y, FENG C M, REN J, et al. Privacy-preserving federated foundation model for generalist ultrasound artificial intelligence[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2411.16380

[10]	KE Y H, JIN L, ELANGOVAN K, et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness[J]. NPJ digital medicine, 2025, 8(1):187. 本文引用 [2]

[11]	PANG Z, WANG C, ZHAO L, et al. Cross-modality hierarchical clustering and refinement for unsupervised visible-infrared person re-identification[J]. IEEE transactions on circuits and systems for video technology, 2024, 34(4):2706-2718. 本文引用 [2]

[12]

钱力, 刘志博, 刘细文, 等. 科技文献数据资源建设模式数智化转型研究——中国科学院文献情报中心的实践探索[J]. 图书情报工作, 2025, 69(10):4-13.

QIAN

, LIU

Z B

, LIU

X W

, et al. Research on digitalization and intelligentization transformation of scientific and technological literature data resources construction: practical exploration of National Science Library, CAS[J]. Library and information service, 2025, 69(10):4-13.

本文引用 [3]

[13]	ALZUBAIDI L, BAI J, AL-SABAAWI A, et al. A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications[J]. Journal of big data, 2023, 10(1):46.

[14]	SONG Z, ZHANG W, DENG L, et al. Mitigating negative transfer in cross-domain recommendation via knowledge transferability enhancement[C]//Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. New York: ACM, 2024:2745-2754. 本文引用 [1]

[15]	DAMA INTERNATIONAL. DAMA-DMBOK: data management body of knowledge[M]. 2nd revised ed. Basking Ridge: Technics Publications, 2024. 本文引用 [1]

[16]	JACOBSEN A, DE MIRANDA AZEVEDO R, JUTY N, et al. FAIR principles: interpretations and implementation considerations[J]. Data intelligence, 2020, 2(1/2):10-29. 本文引用 [1]

[17]	BUSCH F, KATHER J N, JOHNER C, et al. Navigating the European Union artificial intelligence act for healthcare[J]. npj Digital Medicine, 2024, 7(1):210. 本文引用 [2]

[18]	LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[J]. Advances in neural information processing systems, 2020, 33:9459-9474. 本文引用 [5]

[19]	HU Y, LU Y. RAG and RAU: a survey on retrieval-augmented language model in natural language processing[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2404.19543 本文引用 [5]

[20]	AMUGONGO L M, MASCHERONI P, BROOKS S, et al. Retrieval augmented generation for large language models in healthcare: a systematic review[J]. PLOS digital health, 2025, 4(6):e0000877. 本文引用 [3]

[21]

颜航, 高扬, 费朝烨, 等. 基座模型训练中的数据与模型架构[C]// 第二十二届中国计算语言学大会论文集（卷2：前沿综述）.哈尔滨: 中国中文信息学会计算语言学专业委员会,2023:1-15.

YAN

, GAO

, FEI

, et al. Data and model architecture in base model training[C]//Proceedings of the 22nd Chinese national conference on computational linguistics (volume 2: frontier forum). Harbin: Chinese Information Processing Society of China, 2023:1-15.

本文引用 [3]

[22]	SONG Z, YAN B, LIU Y, et al. Injecting domain-specific knowledge into large language models: a comprehensive survey[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2502.10708 本文引用 [3]

[23]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of machine learning research, 2020, 21(140):1-67. 本文引用 [2]

[24]	GURURANGAN S, MARASOVIĆ A, SWAYAMDIPTA S, et al. Don’t stop pretraining: adapt language models to domains and tasks[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2004.10964 本文引用 [2]

[25]	钱力, 张智雄, 伍大勇, 等. 科技文献大模型:方法、框架与应用[J]. 中国图书馆学报, 2024, 50(6):45-58. QIAN L, ZHANG Z X, WU D Y, et al. The large language model for scientific literature: method, framework, and application[J]. Journal of library science in China, 2024, 50(6):45-58. 本文引用 [3]

[26]	THAKUR N, REIMERS N, RÜCKLÉ A, et al. BEIR: a heterogeneous benchmark for zero-shot evaluation of information retrieval models[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2104.08663 本文引用 [2]

[27]	KREUZBERGER D, KÜHL N, HIRSCHL S. Machine learning operations (MLOps): overview, definition, and architecture[J]. IEEE access, 2023, 11:31866-31879. 本文引用 [3]

[28]	FREIDEL S, SCHWARZ E. Knowledge graphs in psychiatric research: potential applications and future perspectives[J]. Acta psychiatrica scandinavica, 2025, 151(3):180-191. 本文引用 [1]

[29]	JI S, PAN S, CAMBRIA E, et al. A survey on knowledge graphs: representation, acquisition, and applications[J]. IEEE transactions on neural networks and learning systems, 2022, 33(2):494-514. 本文引用 [1]

[30]	GAO S, YU K, YANG Y, et al. Large language model powered knowledge graph construction for mental health exploration[J]. Nature communications, 2025, 16(1):7526. 本文引用 [2]

[31]	GUERRA-GARCÍA C, NIKIFOROVA A, JIMÉNEZ S, et al. ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: toward data quality by design[J]. Data & knowledge engineering, 2023, 145:102152. 本文引用 [2]

[32]	GEBRU T, MORGENSTERN J, VECCHIONE B, et al. Datasheets for datasets[J]. Communications of the ACM, 2021, 64(12):86-92. 本文引用 [4]

[33]	DI BUONO M P, NOLANO G, MONTI J. NEAT—Named Entities in Archaeological Texts: a semantic approach to term extraction and classification[J]. Digital scholarship in the humanities, 2023, 38(3):997-1013. 本文引用 [1]

[34]	唐悦, 马海群. 基于信息生态理论的云边协同数智专利情报服务框架构建[J]. 情报科学, 2025, 43(7):162-171. TANG Y, MA H Q. Constructing a cloud-edge collaborative digital-intelligent patent intelligence service framework based on information ecology theory[J]. Information science, 2025, 43(7):162-171. 本文引用 [2]

[35]	HUTSON M. Rules to keep AI in check: nations carve different paths for tech regulation[J]. Nature, 2023, 620(7973):260-263. 本文引用 [2]

[36]	TAIPALUS T. Vector database management systems: fundamental concepts, use-cases, and current challenges[J]. Cognitive systems research, 2024, 85:101216. 本文引用 [1]

[37]	BHARATI S, MONDAL M R H, PODDER P, et al. Federated learning: applications, challenges and future directions[J]. International journal of hybrid intelligent systems, 2022, 18(1-2):19-35. 本文引用 [1]

[38]	徐淋楠, 邵波. 近30年图情领域ILM理论研究：热点演进、范式归纳与进路展望[J]. 图书馆理论与实践, 2022(5):9-15. XU L N, SHAO B. Theoretical research on ILM in library and information science field in the past 30 years: hotspot evolution, paradigm induction and forward prospect[J]. Library theory and practice, 2022(5):9-15. 本文引用 [1]

[39]

黄微, 刘逸伦, 周东阳. 基于信息生态理论的突发事件网络舆情多平台演化路径及实证研究[J]. 情报杂志, 2025, 44(9):164-175.

HUANG

, LIU

Y L

, ZHOU

D Y

. Empirical study on the multi-platform evolution path of online public opinion in emergency incidents based on the information ecology theory[J]. Journal of intelligence, 2025, 44(9):164-175.

本文引用 [1]

[40]	ADLER B, AGARWAL N, AITHAL A, et al. Nemotron-4 340B technical report[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2406.11704 本文引用 [1]

[41]	TANG X, FENG Z, XIAO Y, et al. Construction and application of an ontology-based domain-specific knowledge graph for petroleum exploration and development[J]. Geoscience frontiers, 2023, 14(5):101426. 本文引用 [1]

[42]	HYLVING L, SCHULTZE U. Accomplishing the layered modular architecture in digital innovation: the case of the car’s driver information module[J]. The journal of strategic information systems, 2020, 29(3):101621. 本文引用 [2]

[43]	WILKINSON M D, DUMONTIER M, AALBERSBERG I J, et al. The FAIR guiding principles for scientific data management and stewardship[J]. Scientific data, 2016, 3(1):160018. 本文引用 [1]

[44]	MITCHELL M, WU S, ZALDIVAR A, et al. Model cards for model reporting[C]//Proceedings of the conference on fairness, accountability, and transparency. New York: ACM, 2019:220-229. 本文引用 [1]

[45]	W3 Working Group C OWL. OWL 2 Web Ontology Language document overview: W3recommendationC 27 October 2009[EB/OL]. [2026-01-08]. https://www.w3.org/TR/owl2-overview/ 本文引用 [1]

[46]	WOOD D, ZAIDMAN M, RUTH L, et al. Linked data[M]. Shelter Island: Manning Publications, 2014. 本文引用 [1]

[47]	PAULHEIM H. Knowledge graph refinement: a survey of approaches and evaluation methods[J]. Semantic Web, 2017, 8(3):489-508. 本文引用 [1]

[48]	RATNER A, BACH S H, EHRENBERG H, et al. Snorkel: rapid training data creation with weak supervision[J]. Proceedings of the VLDB Endowment, 2017, 11(3):269-282. 本文引用 [1]

[49]	GUU K, LEE K, TUNG Z, et al. Retrieval-augmented language model pre-training[C]//Proceedings of the 37th international conference on machine learning. PMLR, 2020:3929-3938. 本文引用 [1]

[50]

李兴腾, 冯锋, 黄鹂强. 突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考[J]. 中国科学院院刊, 2025, 40(3):522-529.

X T

, FENG

, HUANG

L Q

. Breaking through the “data bottleneck” of AI large models—Reflections on building a national corpus operation platform[J]. Bulletin of Chinese Academy of Sciences, 2025, 40(3):522-529.

本文引用 [1]

[51]	SUTTON R T, PINCOCK D, BAUMGART D C, et al. An overview of clinical decision support systems: benefits, risks, and strategies for success[J]. NPJ digital medicine, 2020, 3:17. 本文引用 [1]

[52]	GENOVESE A, PRABHA S, BORNA S, et al. From data to decisions: leveraging retrieval-augmented generation to balance citation bias in burn management literature[J]. European burn journal, 2025, 6(2):28. 本文引用 [1]

[53]	BHATT N, BHATT N, PRAJAPATI P, et al. A data-centric approach to improve performance of deep learning models[J]. Scientific reports, 2024, 14(1):22329. 本文引用 [2]

[54]	ASHRAF Z A, MUSTAFA N. AI standards and regulations[M]//QIDWAI M A, ed. Intersection of human rights and AI in healthcare. Hershey: IGI Global, 2025:325-352. 本文引用 [1]

[55]	SENJYU T, SO-IN C, JOSHI A, et al. Smart trends in computing and communications: proceedings of SmartCom 2023, volume 1[C]. Singapore: Springer, 2023. 本文引用 [1]

[56]	中国信息通信研究院. 数据要素价值稳步释放——数据要素白皮书（2023年）摘编[J]. 企业管理, 2024(1):46-52. China Academy of Information and Communications Technology. Steady release of data element value—excerpt from the white paper on data elements (2023)[J]. Business management, 2024(1):46-52. 本文引用 [1]

[57]	MARTENS B. The tension between exploding AI investment costs and slow productivity growth[R]. Brussels: Bruegel, 2024. 本文引用 [1]

[58]	ZILIOLI M, LANUCARA S, OGGIONI A, et al. Fostering data sharing in multidisciplinary research communities: a case study in the geospatial domain[J]. Data science journal, 2019, 18:15. 本文引用 [1]

[59]	NEURIPS 2021 Data-Centric AI Workshop. Data-centric AI[EB/OL]. [2026-01-08]. https://neurips.cc/virtual/2021/workshop/21860 本文引用 [1]

[60]	TAYLOR R, KARDAS M, CUCURULL G, et al. Galactica: a large language model for science[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2211.09085 本文引用 [2]

[61]	中国科学院文献情报中心. 科技文献知识人工智能引擎（SciAIEngine）[EB/OL]. [2026-01-08]. https://sciengine.las.ac.cn/About National Science Library, Chinese Academy of Sciences. Scientific literature knowledge artificial intelligence engine (SciAIEngine)[EB/OL]. [2026-01-08]. https://sciengine.las.ac.cn/About 本文引用 [2]

[62]	THOMSON REUTERS. Generative AI legal research \| Westlaw Precision \| Thomson Reuters[EB/OL]. [2026-01-08]. https://www.thomsonreuters.com/en-us/help/westlaw-precision/tools/generative-ai.html 本文引用 [2]

[63]	FIDLER K L. Lexis+: to buy or not to buy... that is the question[J]. Australian law librarian, 2024, 32(2):31-33. 本文引用 [2]

[64]	SINGHAL K, TU T, GOTTWEIS J, et al. Toward expert-level medical question answering with large language models[J]. Nature medicine, 2025, 31(3):943-950. 本文引用 [2]

[65]	BOLTON E, VENIGALLA A, YASUNAGA M, et al. BioMedLM: a 2.7B parameter language model trained on biomedical text[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2403.18421 本文引用 [2]

[66]	SHETYE S. An evaluation of Khanmigo, a generative AI tool, as a computer-assisted language learning app[J]. Studies in applied linguistics and TESOL, 2024, 24(1). 本文引用 [1]

[67]	HAO K. China has started a grand experiment in AI education. It could reshape how the world learns[EB/OL]. [2026-01-08]. https://www.technologyreview.com/2019/08/02/238339/china-has-started-a-grand-experiment-in-ai-education-it-could-reshape-how-the-world-learns/ 本文引用 [2]

[68]	LIU X Y, WANG G, YANG H, et al. FinGPT: democratizing internet-scale data for financial large language models[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2307.10485 本文引用 [2]