Construction and Application Exploration of Data Foundations for Domain Large Models

Xie Wenjun, Dong Huanqing, Cao Gaohui

Knowledge Management Forum ›› 2026, Vol. 11 ›› Issue (1) : 24-39.

PDF(3326 KB)
PDF(3326 KB)
Knowledge Management Forum ›› 2026, Vol. 11 ›› Issue (1) : 24-39. DOI: 10.13266/j.issn.2095-5472.2026.003  CSTR: 32306.14.CN11-6036.2026.003
Pioneering Exploration of AI-Empowered Knowledge Management and Services

Construction and Application Exploration of Data Foundations for Domain Large Models

Author information +
History +

Abstract

[Purpose/Significance] In response to the practical demand for high-quality data support and trustworthy operation of domain large models in specialized and high-risk application scenarios, this study systematically examines the basic connotation, key characteristics, and systematic construction path of the domain large model data foundation. It aims to provide a theoretical framework and methodological support for the practical deployment of trustworthy artificial intelligence, and to offer systematic references for the standardized construction and governance practice of the domain large model data foundation. [Method/Process] Based on a systematic review of the relevant literature, this study clarified the conceptual connotation, major characteristics, and constituent elements of the domain large model data foundation. It introduced the theory of layered modular architecture to construct a layered framework for the domain large model data foundation, mainly covering the data resource layer, knowledge organization layer, knowledge storage and semantic indexing layer, model layer, algorithm layer, application layer, and governance and operational support layer, so as to characterize the overall mechanism of the coordinated operation of data, knowledge, and models. Finally, a desk research method based on public evidence and a cross-case comparison approach were employed to demonstrate the contextual applicability of the framework. [Result/Conclusion] Cross-domain comparisons show that the proposed framework can explain the common mechanisms underlying external knowledge memory access, evidence-constrained generation, and compliance-audit output chains in reference systems across different domains. On this basis, the study further distills reusable methodological elements such as data contracts, index configuration, and evaluation baselines, thereby validating the applicability and transferability of the framework across domains.

Key words

large language models / domain large models data foundations / framework construction / application exploration

Cite this article

Download Citations
Xie Wenjun , Dong Huanqing , Cao Gaohui. Construction and Application Exploration of Data Foundations for Domain Large Models[J]. Knowledge Management Forum. 2026, 11(1): 24-39 https://doi.org/10.13266/j.issn.2095-5472.2026.003

References

[1]
国务院. 国务院关于深入实施“人工智能+”行动的意见[J]. 中华人民共和国国务院公报, 2025(25):16-20.
The State Council of the People’s Republic of China. Opinions of the State Council on deeply implementing the “Artificial Intelligence+” action[J]. Gazette of the State Council of the People’s Republic of China, 2025(25):16-20.
[2]
WU S, IRSOY O, LU S, et al. BloombergGPT: a large language model for finance[EB/OL]. [2026-01-08].
[3]
刘学博, 户保田, 陈科海, 等. 大模型关键技术与未来发展方向——从ChatGPT谈起[J]. 中国科学基金, 2023, 37(5):758-766.
LIU X B, HU B T, CHEN K H, et al. Key technologies and future development directions of large language models: insights from ChatGPT[J]. Bulletin of National Natural Science Foundation, 2023, 37(5):758-766.
[4]
FROST & SULLIVAN CHINA. 数据基础设施白皮书[R]. 北京: Frost & Sullivan China, 2024.
FROST & SULLIVAN CHINA. White paper on data infrastructure[R]. Beijing: Frost & Sullivan China, 2024.
[5]
中国信息通信研究院. 2024年中国大模型行业应用优秀案例白皮书[R]. 北京: 中国信息通信研究院, 2024.
China Academy of Information and Communications Technology. White paper on excellent industry application cases of large models in China (2024)[R]. Beijing: China Academy of Information and Communications Technology, 2024.
[6]
ALBER D A, YANG Z, ALYAKIN A, et al. Medical large language models are vulnerable to data-poisoning attacks[J]. Nature medicine, 2025, 31(2):618-626.
[7]
CHEN H, CHEN H, ZHAO Z, et al. An overview of domain-specific foundation model: key technologies, applications and challenges[EB/OL]. [2026-01-08].
[8]
GUHA N, NYARKO J, HO D, et al. LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models[J]. Advances in neural information processing systems, 2023, 36:44123-44279.
[9]
JIANG Y, FENG C M, REN J, et al. Privacy-preserving federated foundation model for generalist ultrasound artificial intelligence[EB/OL]. [2026-01-08].
[10]
KE Y H, JIN L, ELANGOVAN K, et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness[J]. NPJ digital medicine, 2025, 8(1):187.
[11]
PANG Z, WANG C, ZHAO L, et al. Cross-modality hierarchical clustering and refinement for unsupervised visible-infrared person re-identification[J]. IEEE transactions on circuits and systems for video technology, 2024, 34(4):2706-2718.
[12]
钱力, 刘志博, 刘细文, 等. 科技文献数据资源建设模式数智化转型研究——中国科学院文献情报中心的实践探索[J]. 图书情报工作, 2025, 69(10):4-13.
QIAN L, LIU Z B, LIU X W, et al. Research on digitalization and intelligentization transformation of scientific and technological literature data resources construction: practical exploration of National Science Library, CAS[J]. Library and information service, 2025, 69(10):4-13.
[13]
ALZUBAIDI L, BAI J, AL-SABAAWI A, et al. A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications[J]. Journal of big data, 2023, 10(1):46.
[14]
SONG Z, ZHANG W, DENG L, et al. Mitigating negative transfer in cross-domain recommendation via knowledge transferability enhancement[C]//Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. New York: ACM, 2024:2745-2754.
[15]
DAMA INTERNATIONAL. DAMA-DMBOK: data management body of knowledge[M]. 2nd revised ed. Basking Ridge: Technics Publications, 2024.
[16]
JACOBSEN A, DE MIRANDA AZEVEDO R, JUTY N, et al. FAIR principles: interpretations and implementation considerations[J]. Data intelligence, 2020, 2(1/2):10-29.
[17]
BUSCH F, KATHER J N, JOHNER C, et al. Navigating the European Union artificial intelligence act for healthcare[J]. npj Digital Medicine, 2024, 7(1):210.
[18]
LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[J]. Advances in neural information processing systems, 2020, 33:9459-9474.
[19]
HU Y, LU Y. RAG and RAU: a survey on retrieval-augmented language model in natural language processing[EB/OL]. [2026-01-08].
[20]
AMUGONGO L M, MASCHERONI P, BROOKS S, et al. Retrieval augmented generation for large language models in healthcare: a systematic review[J]. PLOS digital health, 2025, 4(6):e0000877.
[21]
颜航, 高扬, 费朝烨, 等. 基座模型训练中的数据与模型架构[C]// 第二十二届中国计算语言学大会论文集(卷2:前沿综述).哈尔滨: 中国中文信息学会计算语言学专业委员会,2023:1-15.
YAN H, GAO Y, FEI C, et al. Data and model architecture in base model training[C]//Proceedings of the 22nd Chinese national conference on computational linguistics (volume 2: frontier forum). Harbin: Chinese Information Processing Society of China, 2023:1-15.
[22]
SONG Z, YAN B, LIU Y, et al. Injecting domain-specific knowledge into large language models: a comprehensive survey[EB/OL]. [2026-01-08].
[23]
RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of machine learning research, 2020, 21(140):1-67.
[24]
GURURANGAN S, MARASOVIĆ A, SWAYAMDIPTA S, et al. Don’t stop pretraining: adapt language models to domains and tasks[EB/OL]. [2026-01-08].
[25]
钱力, 张智雄, 伍大勇, 等. 科技文献大模型:方法、框架与应用[J]. 中国图书馆学报, 2024, 50(6):45-58.
QIAN L, ZHANG Z X, WU D Y, et al. The large language model for scientific literature: method, framework, and application[J]. Journal of library science in China, 2024, 50(6):45-58.
[26]
THAKUR N, REIMERS N, RÜCKLÉ A, et al. BEIR: a heterogeneous benchmark for zero-shot evaluation of information retrieval models[EB/OL]. [2026-01-08].
[27]
KREUZBERGER D, KÜHL N, HIRSCHL S. Machine learning operations (MLOps): overview, definition, and architecture[J]. IEEE access, 2023, 11:31866-31879.
[28]
FREIDEL S, SCHWARZ E. Knowledge graphs in psychiatric research: potential applications and future perspectives[J]. Acta psychiatrica scandinavica, 2025, 151(3):180-191.
[29]
JI S, PAN S, CAMBRIA E, et al. A survey on knowledge graphs: representation, acquisition, and applications[J]. IEEE transactions on neural networks and learning systems, 2022, 33(2):494-514.
[30]
GAO S, YU K, YANG Y, et al. Large language model powered knowledge graph construction for mental health exploration[J]. Nature communications, 2025, 16(1):7526.
[31]
GUERRA-GARCÍA C, NIKIFOROVA A, JIMÉNEZ S, et al. ISO/IEC 25012-based methodology for managing data quality requirements in the development of information systems: toward data quality by design[J]. Data & knowledge engineering, 2023, 145:102152.
[32]
GEBRU T, MORGENSTERN J, VECCHIONE B, et al. Datasheets for datasets[J]. Communications of the ACM, 2021, 64(12):86-92.
[33]
DI BUONO M P, NOLANO G, MONTI J. NEAT—Named Entities in Archaeological Texts: a semantic approach to term extraction and classification[J]. Digital scholarship in the humanities, 2023, 38(3):997-1013.
[34]
唐悦, 马海群. 基于信息生态理论的云边协同数智专利情报服务框架构建[J]. 情报科学, 2025, 43(7):162-171.
TANG Y, MA H Q. Constructing a cloud-edge collaborative digital-intelligent patent intelligence service framework based on information ecology theory[J]. Information science, 2025, 43(7):162-171.
[35]
HUTSON M. Rules to keep AI in check: nations carve different paths for tech regulation[J]. Nature, 2023, 620(7973):260-263.
[36]
TAIPALUS T. Vector database management systems: fundamental concepts, use-cases, and current challenges[J]. Cognitive systems research, 2024, 85:101216.
[37]
BHARATI S, MONDAL M R H, PODDER P, et al. Federated learning: applications, challenges and future directions[J]. International journal of hybrid intelligent systems, 2022, 18(1-2):19-35.
[38]
徐淋楠, 邵波. 近30年图情领域ILM理论研究:热点演进、范式归纳与进路展望[J]. 图书馆理论与实践, 2022(5):9-15.
XU L N, SHAO B. Theoretical research on ILM in library and information science field in the past 30 years: hotspot evolution, paradigm induction and forward prospect[J]. Library theory and practice, 2022(5):9-15.
[39]
黄微, 刘逸伦, 周东阳. 基于信息生态理论的突发事件网络舆情多平台演化路径及实证研究[J]. 情报杂志, 2025, 44(9):164-175.
HUANG W, LIU Y L, ZHOU D Y. Empirical study on the multi-platform evolution path of online public opinion in emergency incidents based on the information ecology theory[J]. Journal of intelligence, 2025, 44(9):164-175.
[40]
ADLER B, AGARWAL N, AITHAL A, et al. Nemotron-4 340B technical report[EB/OL]. [2026-01-08].
[41]
TANG X, FENG Z, XIAO Y, et al. Construction and application of an ontology-based domain-specific knowledge graph for petroleum exploration and development[J]. Geoscience frontiers, 2023, 14(5):101426.
[42]
HYLVING L, SCHULTZE U. Accomplishing the layered modular architecture in digital innovation: the case of the car’s driver information module[J]. The journal of strategic information systems, 2020, 29(3):101621.
[43]
WILKINSON M D, DUMONTIER M, AALBERSBERG I J, et al. The FAIR guiding principles for scientific data management and stewardship[J]. Scientific data, 2016, 3(1):160018.
[44]
MITCHELL M, WU S, ZALDIVAR A, et al. Model cards for model reporting[C]//Proceedings of the conference on fairness, accountability, and transparency. New York: ACM, 2019:220-229.
[45]
W3 Working Group C OWL. OWL 2 Web Ontology Language document overview: W3recommendationC 27 October 2009[EB/OL]. [2026-01-08].
[46]
WOOD D, ZAIDMAN M, RUTH L, et al. Linked data[M]. Shelter Island: Manning Publications, 2014.
[47]
PAULHEIM H. Knowledge graph refinement: a survey of approaches and evaluation methods[J]. Semantic Web, 2017, 8(3):489-508.
[48]
RATNER A, BACH S H, EHRENBERG H, et al. Snorkel: rapid training data creation with weak supervision[J]. Proceedings of the VLDB Endowment, 2017, 11(3):269-282.
[49]
GUU K, LEE K, TUNG Z, et al. Retrieval-augmented language model pre-training[C]//Proceedings of the 37th international conference on machine learning. PMLR, 2020:3929-3938.
[50]
李兴腾, 冯锋, 黄鹂强. 突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考[J]. 中国科学院院刊, 2025, 40(3):522-529.
LI X T, FENG F, HUANG L Q. Breaking through the “data bottleneck” of AI large models—Reflections on building a national corpus operation platform[J]. Bulletin of Chinese Academy of Sciences, 2025, 40(3):522-529.
[51]
SUTTON R T, PINCOCK D, BAUMGART D C, et al. An overview of clinical decision support systems: benefits, risks, and strategies for success[J]. NPJ digital medicine, 2020, 3:17.
[52]
GENOVESE A, PRABHA S, BORNA S, et al. From data to decisions: leveraging retrieval-augmented generation to balance citation bias in burn management literature[J]. European burn journal, 2025, 6(2):28.
[53]
BHATT N, BHATT N, PRAJAPATI P, et al. A data-centric approach to improve performance of deep learning models[J]. Scientific reports, 2024, 14(1):22329.
[54]
ASHRAF Z A, MUSTAFA N. AI standards and regulations[M]//QIDWAI M A, ed. Intersection of human rights and AI in healthcare. Hershey: IGI Global, 2025:325-352.
[55]
SENJYU T, SO-IN C, JOSHI A, et al. Smart trends in computing and communications: proceedings of SmartCom 2023, volume 1[C]. Singapore: Springer, 2023.
[56]
中国信息通信研究院. 数据要素价值稳步释放——数据要素白皮书(2023年)摘编[J]. 企业管理, 2024(1):46-52.
China Academy of Information and Communications Technology. Steady release of data element value—excerpt from the white paper on data elements (2023)[J]. Business management, 2024(1):46-52.
[57]
MARTENS B. The tension between exploding AI investment costs and slow productivity growth[R]. Brussels: Bruegel, 2024.
[58]
ZILIOLI M, LANUCARA S, OGGIONI A, et al. Fostering data sharing in multidisciplinary research communities: a case study in the geospatial domain[J]. Data science journal, 2019, 18:15.
[59]
NEURIPS 2021 Data-Centric AI Workshop. Data-centric AI[EB/OL]. [2026-01-08].
[60]
TAYLOR R, KARDAS M, CUCURULL G, et al. Galactica: a large language model for science[EB/OL]. [2026-01-08].
[61]
中国科学院文献情报中心. 科技文献知识人工智能引擎(SciAIEngine)[EB/OL]. [2026-01-08].
National Science Library, Chinese Academy of Sciences. Scientific literature knowledge artificial intelligence engine (SciAIEngine)[EB/OL]. [2026-01-08].
[62]
THOMSON REUTERS. Generative AI legal research | Westlaw Precision | Thomson Reuters[EB/OL]. [2026-01-08].
[63]
FIDLER K L. Lexis+: to buy or not to buy... that is the question[J]. Australian law librarian, 2024, 32(2):31-33.
[64]
SINGHAL K, TU T, GOTTWEIS J, et al. Toward expert-level medical question answering with large language models[J]. Nature medicine, 2025, 31(3):943-950.
[65]
BOLTON E, VENIGALLA A, YASUNAGA M, et al. BioMedLM: a 2.7B parameter language model trained on biomedical text[EB/OL]. [2026-01-08].
[66]
SHETYE S. An evaluation of Khanmigo, a generative AI tool, as a computer-assisted language learning app[J]. Studies in applied linguistics and TESOL, 2024, 24(1).
[67]
HAO K. China has started a grand experiment in AI education. It could reshape how the world learns[EB/OL]. [2026-01-08].
[68]
LIU X Y, WANG G, YANG H, et al. FinGPT: democratizing internet-scale data for financial large language models[EB/OL]. [2026-01-08].

Funding

Fundamental Research Funds for the Central Universities project of Central China Normal University, titled “Large Language Model-Driven Technological Forecasting and Policy Guidance for the Semiconductor Industry”(CCNU24JC034)
PDF(3326 KB)

Accesses

Citation

Detail

Sections
Recommended

/