数字人文视域下多粒度特征融合的古文命名实体识别

孟佳娜, 许英傲, 赵丹丹, 李丰毅, 赵迪

知识管理论坛 ›› 2024, Vol. 9 ›› Issue (6) : 533-546.

PDF(3081 KB)
PDF(3081 KB)
知识管理论坛 ›› 2024, Vol. 9 ›› Issue (6) : 533-546. DOI: 10.13266/j.issn.2095-5472.2024.039  CSTR: 32306.14.CN11-6036.2024.039
研究论文

数字人文视域下多粒度特征融合的古文命名实体识别

作者信息 +

Multi-Granularity Feature Fusion for Named Entity Recognition of Classical Chinese Texts from the Perspective of Digital Humanities

Author information +
文章历史 +

摘要

[目的/意义] 利用命名实体识别技术深入挖掘古籍文献,推动中文古籍数字化进程,对于推动历史学习、增强文化自信以及弘扬中国传统文化具有重要意义。[方法/过程] 提出多粒度特征融合的古文命名实体识别方法,以《左传》为研究语料,构建人名、地名、时间等命名实体识别任务。首先,将古文字信息、词性信息及字形特征融合,提高输入特征表示能力;然后,在加入预测实体头尾辅助任务学习古句边界信息的同时利用Transfer交互器启发式学习古文实体构词规律,并用BiLSTM和IDCNN联合抽取上下文信息;最后,将学习到的多种古文特征加权融合,输入CRF中进行实体预测。[结果/结论] 实验结果表明,多粒度特征融合的古文命名实体识别方法,相比主流的BERT-BiLSTM-CRF模型,精确率、召回率和F1值分别提升5.09%、13.45%和9.87%。多粒度特征融合的古文命名实体识别方法能够精准地实现对古籍文本的命名实体识别。

Abstract

[Purpose/Significance] Leveraging Named Entity Recognition (NER) techniques for the thorough exploration of ancient literary documents not only drives forward the digitization of ancient Chinese texts, including the vital process of Ancient text digitization, which is crucial for historical studies, bolstering cultural confidence, promoting traditional Chinese culture, and advancing Named Entity Recognition (NER) as a foundational task in NLP. [Method/Process] A method for named entity recognition in classical Chinese texts with multi-granularity feature fusion was proposed, Leveraging "Zuo Zhuan" as the research corpus and formulating named entity recognition tasks for personal names, geographical names, temporal entities, etc. Initially, ancient character information, part-of-speech (POS) information, and glyph features were integrated to enhance input feature representation. Subsequently, auxiliary tasks for predicting entity boundaries were introduced, alongside the utilization of a Transfer Interactor heuristic to learn classical Chinese entity formation rules. This was complemented by joint contextual information extraction using BiLSTM and IDCNN (Iterated Dilated Convolutional Neural Network). Finally, learned features were weighted and merged into a CRF (Conditional Random Field) for entity prediction. [Result/Conclusion] Experimental results demonstrate that the proposed method of multi-granularity feature fusion for named entity recognition in classical Chinese texts enhances precision, recall, and F1 score by 5.09%, 13.45%, and 9.87%, respectively, compared to the mainstream BERT-BiLSTM-CRF method. Multi-granularity feature fusion for named entity recognition in classical Chinese texts is crucial for accurately identifying named entities in ancient texts.

关键词

数字人文 / 古文 / 实体识别 / 多粒度特征融合

Key words

digital humanities / classical Chinese / entity recognition / multi-granularity feature fusion

引用本文

导出引用
孟佳娜 , 许英傲 , 赵丹丹 , . 数字人文视域下多粒度特征融合的古文命名实体识别[J]. 知识管理论坛. 2024, 9(6): 533-546 https://doi.org/10.13266/j.issn.2095-5472.2024.039
Meng Jiana , Xu Yingao , Zhao Dandan , et al. Multi-Granularity Feature Fusion for Named Entity Recognition of Classical Chinese Texts from the Perspective of Digital Humanities[J]. Knowledge Management Forum. 2024, 9(6): 533-546 https://doi.org/10.13266/j.issn.2095-5472.2024.039
中图分类号: TP391.1   

参考文献

[1]
王东波.SikuBERT:数字人文下的古籍智能信息处理(专题前言)[J]. 图书馆论坛,2022,42(6):30.(WANG D B. SikuBERT: intelligent information processing of ancient texts in digital humanities(special introduction)[J]. Library tribune,2022,42(6):30.)
[2]
GRISHMAN R, SUNDHEIM B. Message understanding conference 6: a brief history[C]// Proceedings of the 16th conference on computational linguistics. Stroudsburg: Association for Computational Linguistics, 1996.
[3]
HAMMERTON J. Named entity recognition with long short-term memory[C]// Proceedings of Conference on natural language learning at HLT-NAACL. Stroudsburg: Association for Computational Linguistics, 2003.
[4]
COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of machine learning research, 2011, 12(1):2493-2537.
[5]
HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging: computer science[EB/OL]. [2024-06-20].https://arxiv.org/abs/1508.01991.
[6]
CHIU J P C, NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs: computer science[EB/OL].[2024-06-20].https://aclanthology.org/Q16-1026.
[7]
AKBIK A, BLYTHE D, VOLLGRAF R. Contextual string embeddings for sequence labeling[C]// Proceedings of International conference on computational linguistics. Stroudsburg: Association for Computational Linguistics, 2018.
[8]
DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding: computer science[EB/OL]. [2024-06-20].https://arxiv.org/abs/1810.04805.
[9]
LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations: computer science[EB/OL]. [2024-07-15].https://arxiv.org/abs/1909.11942.
[10]
LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach: computer science[EB/OL]. [2024-07-15].https://arxiv.org/abs/1907.11692.
[11]
刘江峰,冯钰童,王东波,等. 数字人文视域下SikuBERT增强的史籍实体识别研究[J].图书馆论坛,2022,42(10):61-72.(LIU J F, FENG Y T, WANG D B. Research on historical entity recognition enhanced by SikuBERT under the perspective of digital humanities[J]. Library tribune,2022,42(10):61-72.)
[12]
WANG P, REN Z. The uncertainty-based retrieval framework for ancient Chinese CWS and POS: computer science[EB/OL]. [2024-07-20].https://arxiv.org/abs/2310.08496.
[13]
ZHANG Y, YANG J. Chinese NER using Lattice LSTM: computer science[EB/OL]. [2024-07-20].https://arxiv.org/abs/1805.02023.
[14]
LI X, YAN H, QIU X, et al. FLAT: Chinese NER using Flat-Lattice Transformer: computer science[EB/OL]. [2024-07-20].https://arxiv.org/abs/2004.11795.
[15]
谢靖,刘江峰,王东波.古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J].图书馆论坛,2022,42(10):51-60.(XIE J, LIU J F, WANG D B. Research on named entity recognition of ancient Chinese medical literature: a case study of flat-lattice enhanced SikuBERT pre-trained model[J]. Library tribune,2022,42(10):51-60.)
[16]
PENG M, MA R, ZHANG Q, et al. Simplify the usage of lexicon in Chinese NER: computer science[EB/OL]. [2024-07-20].https://arxiv.org/abs/1908.05969.
[17]
SUN Z, LI X, SUN X, et al. Chinesebert: Chinese pretraining enhanced by glyph and pinyin information: computer science[EB/OL]. [2024-07-26].https://arxiv.org/abs/2106.16038.
[18]
尹成龙, 陈爱国. 融合多重嵌入的中文命名实体识别[J].中文信息学报,2023,37(4):63-71.(YIN C L, CHEN A G. Chinese Named entity recognition with integrated multiple embeddings[J]. Journal of Chinese information processing, 2023,37(4):63-71.)
[19]
孙红,王哲. 多粒度融合的命名实体识别[J]. 中文信息学报, 2023, 37(3): 123-134.(SUN H, WANG Z. Named entity recognition with multi-granularity fusion[J]. Journal of Chinese information processing, 2023, 37(3): 123-134.)
[20]
CHEN C, KONG F. Enhancing entity boundary detection for better Chinese named entity recognition[C]//Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th International joint conference on natural language processing. Stroudsburg: Association for Computational Linguistics, 2021: 20-25.
[21]
GU Y, QU X, WANG Z, et al. Delving deep into regularity: a simple but effective method for Chinese named entity recognition[J]. arxiv:2204.05544, 2022.
[22]
LAFFERTY J, MCCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]//Proceedings of International conference on machine learning. San Francisco: Morgan Kaufmann Publishers, 2002.
[23]
ZHOU P, SHI W, TIAN J, et al. Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th annual meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguistics, 2016.
[24]
STRUBELL E, VERGA P, BELANGER D, et al. Fast and accurate entity recognition with iterated dilated convolutions: computer science[EB/OL]. [2024-07-26].https://arxiv.org/abs/1702.02098.
[25]
王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J].图书馆论坛,2022,42(6):31-43.(WANG D B,LIU C,ZHU Z H. SikuBERT and SikuRoBERTa: research on the construction and application of pre-trained models for the Siku Quanshu (Complete Library of the Four Treasuries)in the Context of Digital Humanities[J]. Library tribune, 2022, 42(6):31-43.)
[26]
李正辉,廖光忠.基于多层次特征提取的中文医疗实体识别[J].计算机技术与发展,2023,33(9):119-125.(LI Z H,LIAO G Z. Chinese medical entity recognition based on multi-level feature extraction[J].Computer technology and development,2023,33(9):119-125.)
[27]
WU S, SONG X, FENG Z. MECT: multi-metadata embedding based cross-transformer for Chinese named entity recogtion[EB/OL]. https://aclanthology.org/2021.acl-long.121.pdf.
[28]
HU J, SHEN Y, LIU Y, et al. Hero-gang neural model for named entity recognition[C]//Proceedings of the 2022 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies. Seattle: Association for Computational Linguistics, 2022: 1924-1936.

作者贡献说明/Author contributions:

孟佳娜:设计研究方案,修改论文;

许英傲:提出研究思路,撰写论文;

赵丹丹:采集、清洗和分析数据;

李丰毅:设计实验,处理数据;

赵 迪:修订论文与定稿。

基金

教育部人文社会科学研究规划基金项目“基于知识图谱的中华文化互联网智慧传播研究”(23YJA860010)
中央高校基本科研业务费资助基金项目“基于大模型和知识驱动的情感分析研究”(140250)

PDF(3081 KB)

Accesses

Citation

Detail

段落导航
相关文章

/