基于核心主题特征的作者身份识别研究

孟旭, 谢靖, 李春旺

知识管理论坛 ›› 2023, Vol. 8 ›› Issue (5) : 351-364.

PDF(2249 KB)
PDF(2249 KB)
知识管理论坛 ›› 2023, Vol. 8 ›› Issue (5) : 351-364. DOI: 10.13266/j.issn.2095-5472.2023.030
研究论文

基于核心主题特征的作者身份识别研究

作者信息 +

Research on Author Attribution Based on Core Topic

Author information +
文章历史 +

摘要

[目的/意义] 以主题特征在中文社交媒体文本作者识别中的使用研究为基本目的,利用word2vec补充主题模型获取主题特征的不足,同时进一步制定策略对主题特征中的核心主题进行识别和筛选,优化主题特征的使用方法,从而提高主题特征在作者识别中的使用效果。[方法/过程] 首先利用LDA主题模型抽取候选作者的学术主题和社交主题,然后利用word2vec制定合并筛选策略进行核心主题的识别和表示,最后结合N-gram特征和相似度计算的办法实现作者识别。[结果/结论] 利用核心主题特征对科研人员社交文本进行作者识别有一定的积极作用,同时本研究提出的核心主题特征相关策略和应用也能优化主题特征的使用效果,将其结合文体风格特征应用于作者识别,最高识别率达到83%。

Abstract

[Purpose/Significance] The basic purpose of this study is to study the use of topic characteristics in author attribution of Chinese social media texts. Word2vec is used to supplement the topic model to obtain the deficiencies of topic characteristics. At the same time, strategies are further developed to identify and screen the core topics in the topic characteristics and optimize the use of topic characteristics. So as to improve the using effect of subject features in author attribution. [Methods/Process] The research first used the LDA topic model to extract the academic topics and social topics of the candidate authors, and then used word2vec to develop a merge screening strategy to identify and represent the core topics, and finally used N-gram features and similarity calculation to achieve author attribution. [Results/Conclusion] The experimental results show that the use of core topic characteristics has a positive effect on author attribution of social texts. Meanwhile, the strategy and application of core topic characteristics proposed in this study can also optimize the effect of the use of topic-features, and the highest recognition rate will reach 83% when it is combined with stylistic-features.

关键词

作者身份识别 / 主题特征 / N-gram / 科研作者 / 社交网络文本

Key words

author attribution / topic characteristics / N-gram / scientific research author / social media text

引用本文

导出引用
孟旭 , 谢靖 , 李春旺. 基于核心主题特征的作者身份识别研究[J]. 知识管理论坛. 2023, 8(5): 351-364 https://doi.org/10.13266/j.issn.2095-5472.2023.030
Xu Meng , Jing Xie , Chunwang Li. Research on Author Attribution Based on Core Topic[J]. Knowledge Management Forum. 2023, 8(5): 351-364 https://doi.org/10.13266/j.issn.2095-5472.2023.030
中图分类号: G206   

参考文献

[1]
Kalgutkar V, Kaur R, Gonzalez H, et al. Code authorship attribution: methods and challenges[J]. ACM computing surveys (CSUR), 2019, 52(1): 1-36.
[2]
Alrabaee S, Debbabi M, Wang L. CPA: accurate cross-platform binary authorship characterization using LDA[J]. IEEE transactions on information forensics and security, 2020(15): 3051-3066.
[3]
Maglogiannis I, Iliadis L, Pimenidis E. Artificial intelligence applications and innovations[J]. IFIP advances in information and communication technology, 2020(583):55-266.
[4]
刘颖,肖天久.金庸与古龙小说计量风格学研究[J].清华大学学报(哲学社会科学版), 2014,29(5):135-147,179.(LIU Y, XIAO T J. A Study of the stylistics of Jin Yong and Gu Long novels[J].Journal of Tsinghua University(philosophy and social sciences),2014,29(5):135-147,179.)
[5]
百度百科.主题[EB/OL]. [2023-04-05].https://baike.baidu.com/item/主题/2894698.(Baidu Encyclopedia. Topic[EB/OL]. [2023-04-05].https://baike.baidu.com/item/主题/2894698.)
[6]
Mendenhall T C. The characteristic curves of composition[J].Science,1887(214S):237-246.
[7]
Hoover D L. Another perspective on vocabulary richness[J].Computers and the humanities,2003(37):151-178.
[8]
De Vel O, Anderson A, Corney M, et al. Mining e-mail content for author identification forensics[J]. ACM SIGMOD record,2001,30(4):55-64.
[9]
Keselj V, Peng FC, Cercone N, et al. N-gram based author profiles for authorship attribution[EB/OL].[2023-04-05]. https://core.ac.uk/display/24680735 .
[10]
祁瑞华,杨德礼,郭旭,等.基于多层面文体特征的博客作者身份识别研究[J].情报学报, 2015,34(6):628-634. (QI R H, YANG D L, GUO X, et al. Blogger identification based on multidimensional stylistic features[J].Journal of the China Society for Scientific and Technical Information, 2015,34(6):628-634.)
[11]
祁瑞华,郭旭,刘彩虹.中文微博作者身份识别研究[J].情报学报,2017,36(1):72-78.(QI R H, GUO X, LIU C H. Authorship attribution of Chinese Microblog[J].Journal of the China Society for Scientific and Technical Information, 2017,36(1):72-78.)
[12]
Finn A, Kushmerick N. Learning to classify documents according to genre[J]. Journal of the American Society for Information Science and Technology, 2006, 57(11): 1506-1518.
[13]
Savoy J. Authorship attribution based on a probabilistic topic model[J]. Information processing & management, 2013, 49(1): 341-354.
[14]
Anwar W, Bajwa I S, Choudhary M A, et al. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution[J]. IEEE access,2019(7): 3224-3234.
[15]
Nie Y, Huang J, Li A, et al. Identifying users based on behavioral-modeling across social media sites[J].Web technologies and applications, 2014(8709):48-55.
[16]
孙学刚,陈群秀,马亮.基于主题的Web文档聚类研究[J].中文信息学报,2003(3):21-26.(SUN X G,CHEN Q L,MA L. Study on topic-based web clustering[J].Journal of Chinese information processing,2003(3):21-26.)
[17]
李湘东,张娇,袁满.基于LDA模型的科技期刊主题演化研究[J].情报杂志,2014,33(7):115-121.(LI X D, ZHANG J, YUAN M. On topic evolution of a scientific journal based on LDA model[J]. Journal of intelligence,2014,33(7):115-121.)
[18]
陈思含.基于微博的多特征情感分析方法研究[D].长春:吉林大学,2021.(CHEN S H. Research on multi-feature sentiment analysis method based on microblog[D].Changchun: Jilin University,2021.)
[19]
姚全珠,宋志理,彭程.基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153.(YAO Q Z,SONG Z L,PENG C. Research on text categorization based on LDA[J].Computer engineering and applications, 2011, 47(13): 150-153.)
[20]
王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学, 2013, 40(12):229-232.(WANG Z Z, HE M,DU Y P. Text similarity computing based on topic model LDA[J].Computer science, 2013, 40(12):229-232.)
[21]
崔凯. 基于LDA的主题演化研究与实现[D]. 长沙: 国防科学技术大学,2010.
(CUI K. The research and implementation of topic evolution based on LDA [D]. Changsha: National University of Defense Technology, 2010.)
[22]
马思丹,刘东苏.基于加权Word2vec的文本分类方法研究[J].情报科学,2019,37(11):38-42.(MA S D, LIU D S. Text classification method based on weighted word2vec [J]. Information science, 2019,37(11):38-42.)
[23]
李晓,解辉,李立杰.基于Word2vec的句子语义相似度计算研究[J].计算机科学,2017, 44(9): 256-260.(LI X, JIE H,LI L J. Research on sentence semantic similarity calculation based on word2vec[J]. Computer science, 2017, 44(9): 256-260.)
[24]
唐晓波,祝黎,谢力.基于主题的微博二级好友推荐模型研究[J].图书情报工作,2014, 58(9):105-113.(TANG X B, ZHU L, XIE L. Two-level microblog friend recommendation based on topic model[J]. Library and information service,2014, 58(9):105-113.)
[25]
你好星期一.Word2vec参数[EB/OL]. [2022-12-13].
[26]
张谦,高章敏,刘嘉勇.基于Word2vec的微博短文本分类研究[J].信息网络安全, 2017(1):57-62. (ZHANG Q, GAO Z M, LIU J Y. Research of Weibo short text classfication based on word2ve[J]. Netinfo security, 2017(1):57-62.)
[27]
Johnson A, Wright D. Identifying idiolect in forensic authorship attribution: an N-gram text bite approach[J].Language and law, 2014,1(1):37-69.

作者贡献说明

孟 旭:调研及撰写论文;

谢 靖:提出论文修改意见及定稿;

李春旺:提出论文选题和论文技术路线。


PDF(2249 KB)

Accesses

Citation

Detail

段落导航
相关文章

/