Construction and Application Exploration of Data Foundations for Domain Large Models

Xie Wenjun; Dong Huanqing; Cao Gaohui

doi:10.13266/j.issn.2095-5472.2026.003

PDF(3326 KB)

Knowledge Management Forum ›› 2026, Vol. 11 ›› Issue (1) : 24-39. DOI: 10.13266/j.issn.2095-5472.2026.003 CSTR: 32306.14.CN11-6036.2026.003

Pioneering Exploration of AI-Empowered Knowledge Management and Services

Construction and Application Exploration of Data Foundations for Domain Large Models

Author information +

History +

Abstract

[Purpose/Significance] In response to the practical demand for high-quality data support and trustworthy operation of domain large models in specialized and high-risk application scenarios, this study systematically examines the basic connotation, key characteristics, and systematic construction path of the domain large model data foundation. It aims to provide a theoretical framework and methodological support for the practical deployment of trustworthy artificial intelligence, and to offer systematic references for the standardized construction and governance practice of the domain large model data foundation. [Method/Process] Based on a systematic review of the relevant literature, this study clarified the conceptual connotation, major characteristics, and constituent elements of the domain large model data foundation. It introduced the theory of layered modular architecture to construct a layered framework for the domain large model data foundation, mainly covering the data resource layer, knowledge organization layer, knowledge storage and semantic indexing layer, model layer, algorithm layer, application layer, and governance and operational support layer, so as to characterize the overall mechanism of the coordinated operation of data, knowledge, and models. Finally, a desk research method based on public evidence and a cross-case comparison approach were employed to demonstrate the contextual applicability of the framework. [Result/Conclusion] Cross-domain comparisons show that the proposed framework can explain the common mechanisms underlying external knowledge memory access, evidence-constrained generation, and compliance-audit output chains in reference systems across different domains. On this basis, the study further distills reusable methodological elements such as data contracts, index configuration, and evaluation baselines, thereby validating the applicability and transferability of the framework across domains.

Key words

large language models / domain large models data foundations / framework construction / application exploration

Cite this article

EndNote

Ris (Procite)

Bibtex

Download Citations

Xie Wenjun , Dong Huanqing , Cao Gaohui. Construction and Application Exploration of Data Foundations for Domain Large Models[J]. Knowledge Management Forum. 2026, 11(1): 24-39 https://doi.org/10.13266/j.issn.2095-5472.2026.003

References

List( Publishing order | Descend order by publishing year | Descend order by cited within ) Chart analysis

[1]

国务院. 国务院关于深入实施“人工智能+”行动的意见[J]. 中华人民共和国国务院公报, 2025(25):16-20.

The State Council of the People’s Republic of China. Opinions of the State Council on deeply implementing the “Artificial Intelligence+” action[J]. Gazette of the State Council of the People’s Republic of China, 2025(25):16-20.

Cited in this article [1]

[2]	WU S, IRSOY O, LU S, et al. BloombergGPT: a large language model for finance[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2303.17564 Cited in this article [3]

[3]	刘学博, 户保田, 陈科海, 等. 大模型关键技术与未来发展方向——从ChatGPT谈起[J]. 中国科学基金, 2023, 37(5):758-766. LIU X B, HU B T, CHEN K H, et al. Key technologies and future development directions of large language models: insights from ChatGPT[J]. Bulletin of National Natural Science Foundation, 2023, 37(5):758-766.

[4]	FROST & SULLIVAN CHINA. 数据基础设施白皮书[R]. 北京: Frost & Sullivan China, 2024. FROST & SULLIVAN CHINA. White paper on data infrastructure[R]. Beijing: Frost & Sullivan China, 2024.

[5]

中国信息通信研究院. 2024年中国大模型行业应用优秀案例白皮书[R]. 北京: 中国信息通信研究院, 2024.

China Academy of Information and Communications Technology. White paper on excellent industry application cases of large models in China (2024)[R]. Beijing: China Academy of Information and Communications Technology, 2024.

Cited in this article [1]

[6]	ALBER D A, YANG Z, ALYAKIN A, et al. Medical large language models are vulnerable to data-poisoning attacks[J]. Nature medicine, 2025, 31(2):618-626. Cited in this article [1]

[7]	CHEN H, CHEN H, ZHAO Z, et al. An overview of domain-specific foundation model: key technologies, applications and challenges[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2409.04267 Cited in this article [1]

[8]	GUHA N, NYARKO J, HO D, et al. LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models[J]. Advances in neural information processing systems, 2023, 36:44123-44279.

[9]	JIANG Y, FENG C M, REN J, et al. Privacy-preserving federated foundation model for generalist ultrasound artificial intelligence[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2411.16380

[10]	KE Y H, JIN L, ELANGOVAN K, et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness[J]. NPJ digital medicine, 2025, 8(1):187. Cited in this article [2]

[11]	PANG Z, WANG C, ZHAO L, et al. Cross-modality hierarchical clustering and refinement for unsupervised visible-infrared person re-identification[J]. IEEE transactions on circuits and systems for video technology, 2024, 34(4):2706-2718. Cited in this article [2]

[12]

钱力, 刘志博, 刘细文, 等. 科技文献数据资源建设模式数智化转型研究——中国科学院文献情报中心的实践探索[J]. 图书情报工作, 2025, 69(10):4-13.

QIAN

, LIU

Z B

, LIU

X W

, et al. Research on digitalization and intelligentization transformation of scientific and technological literature data resources construction: practical exploration of National Science Library, CAS[J]. Library and information service, 2025, 69(10):4-13.

Cited in this article [3]

[13]	ALZUBAIDI L, BAI J, AL-SABAAWI A, et al. A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications[J]. Journal of big data, 2023, 10(1):46.

[14]	SONG Z, ZHANG W, DENG L, et al. Mitigating negative transfer in cross-domain recommendation via knowledge transferability enhancement[C]//Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. New York: ACM, 2024:2745-2754. Cited in this article [1]

[15]	DAMA INTERNATIONAL. DAMA-DMBOK: data management body of knowledge[M]. 2nd revised ed. Basking Ridge: Technics Publications, 2024. Cited in this article [1]

[16]	JACOBSEN A, DE MIRANDA AZEVEDO R, JUTY N, et al. FAIR principles: interpretations and implementation considerations[J]. Data intelligence, 2020, 2(1/2):10-29. Cited in this article [1]

[17]	BUSCH F, KATHER J N, JOHNER C, et al. Navigating the European Union artificial intelligence act for healthcare[J]. npj Digital Medicine, 2024, 7(1):210. Cited in this article [2]

[18]	LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[J]. Advances in neural information processing systems, 2020, 33:9459-9474. Cited in this article [5]

[19]	HU Y, LU Y. RAG and RAU: a survey on retrieval-augmented language model in natural language processing[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2404.19543 Cited in this article [5]

[20]	AMUGONGO L M, MASCHERONI P, BROOKS S, et al. Retrieval augmented generation for large language models in healthcare: a systematic review[J]. PLOS digital health, 2025, 4(6):e0000877. Cited in this article [3]

[21]

颜航, 高扬, 费朝烨, 等. 基座模型训练中的数据与模型架构[C]// 第二十二届中国计算语言学大会论文集（卷2：前沿综述）.哈尔滨: 中国中文信息学会计算语言学专业委员会,2023:1-15.

YAN

, GAO

, FEI

, et al. Data and model architecture in base model training[C]//Proceedings of the 22nd Chinese national conference on computational linguistics (volume 2: frontier forum). Harbin: Chinese Information Processing Society of China, 2023:1-15.

Cited in this article [3]

[22]	SONG Z, YAN B, LIU Y, et al. Injecting domain-specific knowledge into large language models: a comprehensive survey[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2502.10708 Cited in this article [3]

[23]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of machine learning research, 2020, 21(140):1-67. Cited in this article [2]

[24]	GURURANGAN S, MARASOVIĆ A, SWAYAMDIPTA S, et al. Don’t stop pretraining: adapt language models to domains and tasks[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2004.10964 Cited in this article [2]

[25]	钱力, 张智雄, 伍大勇, 等. 科技文献大模型:方法、框架与应用[J]. 中国图书馆学报, 2024, 50(6):45-58. QIAN L, ZHANG Z X, WU D Y, et al. The large language model for scientific literature: method, framework, and application[J]. Journal of library science in China, 2024, 50(6):45-58. Cited in this article [3]

[26]	THAKUR N, REIMERS N, RÜCKLÉ A, et al. BEIR: a heterogeneous benchmark for zero-shot evaluation of information retrieval models[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2104.08663 Cited in this article [2]

[27]	KREUZBERGER D, KÜHL N, HIRSCHL S. Machine learning operations (MLOps): overview, definition, and architecture[J]. IEEE access, 2023, 11:31866-31879. Cited in this article [3]

[28]	FREIDEL S, SCHWARZ E. Knowledge graphs in psychiatric research: potential applications and future perspectives[J]. Acta psychiatrica scandinavica, 2025, 151(3):180-191. Cited in this article [1]

[29]	JI S, PAN S, CAMBRIA E, et al. A survey on knowledge graphs: representation, acquisition, and applications[J]. IEEE transactions on neural networks and learning systems, 2022, 33(2):494-514. Cited in this article [1]

[30]	GAO S, YU K, YANG Y, et al. Large language model powered knowledge graph construction for mental health exploration[J]. Nature communications, 2025, 16(1):7526. Cited in this article [2]

[31]	GUERRA-GARCÍA C, NIKIFOROVA A, JIMÉNEZ S, et al. ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: toward data quality by design[J]. Data & knowledge engineering, 2023, 145:102152. Cited in this article [2]

[32]	GEBRU T, MORGENSTERN J, VECCHIONE B, et al. Datasheets for datasets[J]. Communications of the ACM, 2021, 64(12):86-92. Cited in this article [4]

[33]	DI BUONO M P, NOLANO G, MONTI J. NEAT—Named Entities in Archaeological Texts: a semantic approach to term extraction and classification[J]. Digital scholarship in the humanities, 2023, 38(3):997-1013. Cited in this article [1]

[34]	唐悦, 马海群. 基于信息生态理论的云边协同数智专利情报服务框架构建[J]. 情报科学, 2025, 43(7):162-171. TANG Y, MA H Q. Constructing a cloud-edge collaborative digital-intelligent patent intelligence service framework based on information ecology theory[J]. Information science, 2025, 43(7):162-171. Cited in this article [2]

[35]	HUTSON M. Rules to keep AI in check: nations carve different paths for tech regulation[J]. Nature, 2023, 620(7973):260-263. Cited in this article [2]

[36]	TAIPALUS T. Vector database management systems: fundamental concepts, use-cases, and current challenges[J]. Cognitive systems research, 2024, 85:101216. Cited in this article [1]

[37]	BHARATI S, MONDAL M R H, PODDER P, et al. Federated learning: applications, challenges and future directions[J]. International journal of hybrid intelligent systems, 2022, 18(1-2):19-35. Cited in this article [1]

[38]

徐淋楠, 邵波. 近30年图情领域ILM理论研究：热点演进、范式归纳与进路展望[J]. 图书馆理论与实践, 2022(5):9-15.

L N

, SHAO

. Theoretical research on ILM in library and information science field in the past 30 years: hotspot evolution, paradigm induction and forward prospect[J]. Library theory and practice, 2022(5):9-15.

Cited in this article [1]

[39]

黄微, 刘逸伦, 周东阳. 基于信息生态理论的突发事件网络舆情多平台演化路径及实证研究[J]. 情报杂志, 2025, 44(9):164-175.

HUANG

, LIU

Y L

, ZHOU

D Y

. Empirical study on the multi-platform evolution path of online public opinion in emergency incidents based on the information ecology theory[J]. Journal of intelligence, 2025, 44(9):164-175.

Cited in this article [1]

[40]	ADLER B, AGARWAL N, AITHAL A, et al. Nemotron-4 340B technical report[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2406.11704 Cited in this article [1]

[41]	TANG X, FENG Z, XIAO Y, et al. Construction and application of an ontology-based domain-specific knowledge graph for petroleum exploration and development[J]. Geoscience frontiers, 2023, 14(5):101426. Cited in this article [1]

[42]	HYLVING L, SCHULTZE U. Accomplishing the layered modular architecture in digital innovation: the case of the car’s driver information module[J]. The journal of strategic information systems, 2020, 29(3):101621. Cited in this article [2]

[43]	WILKINSON M D, DUMONTIER M, AALBERSBERG I J, et al. The FAIR guiding principles for scientific data management and stewardship[J]. Scientific data, 2016, 3(1):160018. Cited in this article [1]

[44]	MITCHELL M, WU S, ZALDIVAR A, et al. Model cards for model reporting[C]//Proceedings of the conference on fairness, accountability, and transparency. New York: ACM, 2019:220-229. Cited in this article [1]

[45]	W3 Working Group C OWL. OWL 2 Web Ontology Language document overview: W3recommendationC 27 October 2009[EB/OL]. [2026-01-08]. https://www.w3.org/TR/owl2-overview/ Cited in this article [1]

[46]	WOOD D, ZAIDMAN M, RUTH L, et al. Linked data[M]. Shelter Island: Manning Publications, 2014. Cited in this article [1]

[47]	PAULHEIM H. Knowledge graph refinement: a survey of approaches and evaluation methods[J]. Semantic Web, 2017, 8(3):489-508. Cited in this article [1]

[48]	RATNER A, BACH S H, EHRENBERG H, et al. Snorkel: rapid training data creation with weak supervision[J]. Proceedings of the VLDB Endowment, 2017, 11(3):269-282. Cited in this article [1]

[49]	GUU K, LEE K, TUNG Z, et al. Retrieval-augmented language model pre-training[C]//Proceedings of the 37th international conference on machine learning. PMLR, 2020:3929-3938. Cited in this article [1]

[50]

李兴腾, 冯锋, 黄鹂强. 突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考[J]. 中国科学院院刊, 2025, 40(3):522-529.

X T

, FENG

, HUANG

L Q

. Breaking through the “data bottleneck” of AI large models—Reflections on building a national corpus operation platform[J]. Bulletin of Chinese Academy of Sciences, 2025, 40(3):522-529.

Cited in this article [1]

[51]	SUTTON R T, PINCOCK D, BAUMGART D C, et al. An overview of clinical decision support systems: benefits, risks, and strategies for success[J]. NPJ digital medicine, 2020, 3:17. Cited in this article [1]

[52]	GENOVESE A, PRABHA S, BORNA S, et al. From data to decisions: leveraging retrieval-augmented generation to balance citation bias in burn management literature[J]. European burn journal, 2025, 6(2):28. Cited in this article [1]

[53]	BHATT N, BHATT N, PRAJAPATI P, et al. A data-centric approach to improve performance of deep learning models[J]. Scientific reports, 2024, 14(1):22329. Cited in this article [2]

[54]	ASHRAF Z A, MUSTAFA N. AI standards and regulations[M]//QIDWAI M A, ed. Intersection of human rights and AI in healthcare. Hershey: IGI Global, 2025:325-352. Cited in this article [1]

[55]	SENJYU T, SO-IN C, JOSHI A, et al. Smart trends in computing and communications: proceedings of SmartCom 2023, volume 1[C]. Singapore: Springer, 2023. Cited in this article [1]

[56]	中国信息通信研究院. 数据要素价值稳步释放——数据要素白皮书（2023年）摘编[J]. 企业管理, 2024(1):46-52. China Academy of Information and Communications Technology. Steady release of data element value—excerpt from the white paper on data elements (2023)[J]. Business management, 2024(1):46-52. Cited in this article [1]

[57]	MARTENS B. The tension between exploding AI investment costs and slow productivity growth[R]. Brussels: Bruegel, 2024. Cited in this article [1]

[58]	ZILIOLI M, LANUCARA S, OGGIONI A, et al. Fostering data sharing in multidisciplinary research communities: a case study in the geospatial domain[J]. Data science journal, 2019, 18:15. Cited in this article [1]

[59]	NEURIPS 2021 Data-Centric AI Workshop. Data-centric AI[EB/OL]. [2026-01-08]. https://neurips.cc/virtual/2021/workshop/21860 Cited in this article [1]

[60]	TAYLOR R, KARDAS M, CUCURULL G, et al. Galactica: a large language model for science[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2211.09085 Cited in this article [2]

[61]

中国科学院文献情报中心. 科技文献知识人工智能引擎（SciAIEngine）[EB/OL]. [2026-01-08].

https://sciengine.las.ac.cn/About

National Science Library, Chinese Academy of Sciences. Scientific literature knowledge artificial intelligence engine (SciAIEngine)[EB/OL]. [2026-01-08].

https://sciengine.las.ac.cn/About

Cited in this article [2]

[62]	THOMSON REUTERS. Generative AI legal research \| Westlaw Precision \| Thomson Reuters[EB/OL]. [2026-01-08]. https://www.thomsonreuters.com/en-us/help/westlaw-precision/tools/generative-ai.html Cited in this article [2]

[63]	FIDLER K L. Lexis+: to buy or not to buy... that is the question[J]. Australian law librarian, 2024, 32(2):31-33. Cited in this article [2]

[64]	SINGHAL K, TU T, GOTTWEIS J, et al. Toward expert-level medical question answering with large language models[J]. Nature medicine, 2025, 31(3):943-950. Cited in this article [2]

[65]	BOLTON E, VENIGALLA A, YASUNAGA M, et al. BioMedLM: a 2.7B parameter language model trained on biomedical text[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2403.18421 Cited in this article [2]

[66]	SHETYE S. An evaluation of Khanmigo, a generative AI tool, as a computer-assisted language learning app[J]. Studies in applied linguistics and TESOL, 2024, 24(1). Cited in this article [1]

[67]	HAO K. China has started a grand experiment in AI education. It could reshape how the world learns[EB/OL]. [2026-01-08]. https://www.technologyreview.com/2019/08/02/238339/china-has-started-a-grand-experiment-in-ai-education-it-could-reshape-how-the-world-learns/ Cited in this article [2]

[68]	LIU X Y, WANG G, YANG H, et al. FinGPT: democratizing internet-scale data for financial large language models[EB/OL]. [2026-01-08]. https://arxiv.org/abs/2307.10485 Cited in this article [2]

Funding

Fundamental Research Funds for the Central Universities project of Central China Normal University, titled “Large Language Model-Driven Technological Forecasting and Policy Guidance for the Semiconductor Industry”(CCNU24JC034)