
Research on Author Attribution Based on Core Topic
Xu Meng, Jing Xie, Chunwang Li
Knowledge Management Forum ›› 2023, Vol. 8 ›› Issue (5) : 351-364.
Research on Author Attribution Based on Core Topic
[Purpose/Significance] The basic purpose of this study is to study the use of topic characteristics in author attribution of Chinese social media texts. Word2vec is used to supplement the topic model to obtain the deficiencies of topic characteristics. At the same time, strategies are further developed to identify and screen the core topics in the topic characteristics and optimize the use of topic characteristics. So as to improve the using effect of subject features in author attribution. [Methods/Process] The research first used the LDA topic model to extract the academic topics and social topics of the candidate authors, and then used word2vec to develop a merge screening strategy to identify and represent the core topics, and finally used N-gram features and similarity calculation to achieve author attribution. [Results/Conclusion] The experimental results show that the use of core topic characteristics has a positive effect on author attribution of social texts. Meanwhile, the strategy and application of core topic characteristics proposed in this study can also optimize the effect of the use of topic-features, and the highest recognition rate will reach 83% when it is combined with stylistic-features.
author attribution / topic characteristics / N-gram / scientific research author / social media text
[1] |
Kalgutkar V, Kaur R, Gonzalez H, et al. Code authorship attribution: methods and challenges[J]. ACM computing surveys (CSUR), 2019, 52(1): 1-36.
|
[2] |
Alrabaee S, Debbabi M, Wang L. CPA: accurate cross-platform binary authorship characterization using LDA[J]. IEEE transactions on information forensics and security, 2020(15): 3051-3066.
|
[3] |
Maglogiannis I, Iliadis L, Pimenidis E. Artificial intelligence applications and innovations[J]. IFIP advances in information and communication technology, 2020(583):55-266.
|
[4] |
刘颖,肖天久.金庸与古龙小说计量风格学研究[J].清华大学学报(哲学社会科学版), 2014,29(5):135-147,179.(LIU Y, XIAO T J. A Study of the stylistics of Jin Yong and Gu Long novels[J].Journal of Tsinghua University(philosophy and social sciences),2014,29(5):135-147,179.)
|
[5] |
百度百科.主题[EB/OL]. [2023-04-05].https://baike.baidu.com/item/主题/2894698.(Baidu Encyclopedia. Topic[EB/OL]. [2023-04-05].https://baike.baidu.com/item/主题/2894698.)
|
[6] |
Mendenhall T C. The characteristic curves of composition[J].Science,1887(214S):237-246.
|
[7] |
Hoover D L. Another perspective on vocabulary richness[J].Computers and the humanities,2003(37):151-178.
|
[8] |
De Vel O, Anderson A, Corney M, et al. Mining e-mail content for author identification forensics[J]. ACM SIGMOD record,2001,30(4):55-64.
|
[9] |
Keselj V, Peng FC, Cercone N, et al. N-gram based author profiles for authorship attribution[EB/OL].[2023-04-05]. https://core.ac.uk/display/24680735 .
|
[10] |
祁瑞华,杨德礼,郭旭,等.基于多层面文体特征的博客作者身份识别研究[J].情报学报, 2015,34(6):628-634. (QI R H, YANG D L, GUO X, et al. Blogger identification based on multidimensional stylistic features[J].Journal of the China Society for Scientific and Technical Information, 2015,34(6):628-634.)
|
[11] |
祁瑞华,郭旭,刘彩虹.中文微博作者身份识别研究[J].情报学报,2017,36(1):72-78.(QI R H, GUO X, LIU C H. Authorship attribution of Chinese Microblog[J].Journal of the China Society for Scientific and Technical Information, 2017,36(1):72-78.)
|
[12] |
Finn A, Kushmerick N. Learning to classify documents according to genre[J]. Journal of the American Society for Information Science and Technology, 2006, 57(11): 1506-1518.
|
[13] |
Savoy J. Authorship attribution based on a probabilistic topic model[J]. Information processing & management, 2013, 49(1): 341-354.
|
[14] |
Anwar W, Bajwa I S, Choudhary M A, et al. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution[J]. IEEE access,2019(7): 3224-3234.
|
[15] |
Nie Y, Huang J, Li A, et al. Identifying users based on behavioral-modeling across social media sites[J].Web technologies and applications, 2014(8709):48-55.
|
[16] |
孙学刚,陈群秀,马亮.基于主题的Web文档聚类研究[J].中文信息学报,2003(3):21-26.(SUN X G,CHEN Q L,MA L. Study on topic-based web clustering[J].Journal of Chinese information processing,2003(3):21-26.)
|
[17] |
李湘东,张娇,袁满.基于LDA模型的科技期刊主题演化研究[J].情报杂志,2014,33(7):115-121.(LI X D, ZHANG J, YUAN M. On topic evolution of a scientific journal based on LDA model[J]. Journal of intelligence,2014,33(7):115-121.)
|
[18] |
陈思含.基于微博的多特征情感分析方法研究[D].长春:吉林大学,2021.(CHEN S H. Research on multi-feature sentiment analysis method based on microblog[D].Changchun: Jilin University,2021.)
|
[19] |
姚全珠,宋志理,彭程.基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153.(YAO Q Z,SONG Z L,PENG C. Research on text categorization based on LDA[J].Computer engineering and applications, 2011, 47(13): 150-153.)
|
[20] |
王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学, 2013, 40(12):229-232.(WANG Z Z, HE M,DU Y P. Text similarity computing based on topic model LDA[J].Computer science, 2013, 40(12):229-232.)
|
[21] |
崔凯. 基于LDA的主题演化研究与实现[D]. 长沙: 国防科学技术大学,2010.
|
(CUI K. The research and implementation of topic evolution based on LDA [D]. Changsha: National University of Defense Technology, 2010.)
|
[22] |
马思丹,刘东苏.基于加权Word2vec的文本分类方法研究[J].情报科学,2019,37(11):38-42.(MA S D, LIU D S. Text classification method based on weighted word2vec [J]. Information science, 2019,37(11):38-42.)
|
[23] |
李晓,解辉,李立杰.基于Word2vec的句子语义相似度计算研究[J].计算机科学,2017, 44(9): 256-260.(LI X, JIE H,LI L J. Research on sentence semantic similarity calculation based on word2vec[J]. Computer science, 2017, 44(9): 256-260.)
|
[24] |
唐晓波,祝黎,谢力.基于主题的微博二级好友推荐模型研究[J].图书情报工作,2014, 58(9):105-113.(TANG X B, ZHU L, XIE L. Two-level microblog friend recommendation based on topic model[J]. Library and information service,2014, 58(9):105-113.)
|
[25] |
你好星期一.Word2vec参数[EB/OL]. [2022-12-13].
|
https://blog.csdn.net/DL_Iris/article/details/119175496 . (Hello on Monday. Word2vec parameter[EB/OL]. [2022-12-13].https://blog.csdn.net/DL_Iris/article/details/119175496.)
|
[26] |
张谦,高章敏,刘嘉勇.基于Word2vec的微博短文本分类研究[J].信息网络安全, 2017(1):57-62. (ZHANG Q, GAO Z M, LIU J Y. Research of Weibo short text classfication based on word2ve[J]. Netinfo security, 2017(1):57-62.)
|
[27] |
Johnson A, Wright D. Identifying idiolect in forensic authorship attribution: an N-gram text bite approach[J].Language and law, 2014,1(1):37-69.
|
孟 旭:调研及撰写论文;
谢 靖:提出论文修改意见及定稿;
李春旺:提出论文选题和论文技术路线。
/
〈 |
|
〉 |