网络谣言敏感词库的构建研究——以新浪微博谣言为例

夏松, 林荣蓉, 刘勘

知识管理论坛 ›› 2019, Vol. 4 ›› Issue (5) : 267-275.

PDF(1117 KB)
PDF(1117 KB)
知识管理论坛 ›› 2019, Vol. 4 ›› Issue (5) : 267-275. DOI: 10.13266/j.issn.2095-5472.2019.028
专稿

网络谣言敏感词库的构建研究——以新浪微博谣言为例

作者信息 +

Construction of Sensitive Thesaurus for Network Rumors——Taking the Microblog Rumors as an Example

Author information +
文章历史 +

摘要

[目的/意义] 网络谣言严重影响网络正常信息的传播,对网络谣言进行识别有着重要的现实意义。笔者构建一个基于微博的网络谣言敏感词库,以提高网络谣言的识别精度。[方法/过程] 针对微博类社交平台短文本的特点,首先舍弃传统的分词算法,设计LBCP抽词算法,并结合位置信息和改进的TF-IDF权重来提取敏感词库的种子词集,然后通过聚类算法将种子词的近义词补充到词库中,再将常用的替代词也加入到词库中,从而得到最终的敏感词库。[结果/结论] 利用敏感词特征对谣言进行判断,在提取微博的内容特征、用户特征、传播特征以及情感分析特征的基础上,新增敏感词特征以后谣言识别率有明显提升,得到较好的识别效果。

Abstract

[Purpose/significance] The network rumors seriously influent the spread of normal information on the internet. The purpose of this paper is to construct a sensitive lexicon on microblog rumors and to improve the recognition accuracy of the network rumors. [Method/process] According to the characteristics of microblog’s short text on social networking platforms, this paper focuses on construction of the microblog sensitive thesaurus, which is built up through LBCP algorithm and extension of multiple level words. At first, the method directly extracts words through LBCP algorithm, which considers the cohesion and polymerization of rumor words. And then, based on the core words, multiple level words are expanded to get sensitive thesaurus. [Result/conclusion] In addition to the features of the text, user characteristics, propagation characteristics, emotional analysis, and rumor features based on sensitive thesaurus are exploited. Experimental results show that the accuracy of microblog’s rumor recognition can be improved greatly based on sensitive thesaurus.

关键词

敏感词库 / 词向量 / 特征空间 / 网络谣言

Key words

sensitive thesaurus / word embedding / feature space / network rumors

引用本文

导出引用
夏松 , 林荣蓉 , 刘勘. 网络谣言敏感词库的构建研究——以新浪微博谣言为例[J]. 知识管理论坛. 2019, 4(5): 267-275 https://doi.org/10.13266/j.issn.2095-5472.2019.028
Xia Song , Lin Rongrong , Liu Kan. Construction of Sensitive Thesaurus for Network Rumors——Taking the Microblog Rumors as an Example[J]. Knowledge Management Forum. 2019, 4(5): 267-275 https://doi.org/10.13266/j.issn.2095-5472.2019.028
中图分类号: G202   

参考文献

[1]
徐建民,王金花,马伟瑜.利用本体关联度改进的TF-IDF特征词提取方法[J].情报科学,2011, 29(2): 279-283.
[2]
周晓. 基于互联网的情感词库扩展与优化研究[D]. 沈阳: 东北大学, 2011.
[3]
刘耕,方勇,刘嘉勇.基于关联词和扩展规则的敏感词库设计[J].四川大学学报(自然科学版), 2009, 46(3): 667-671.
[4]
徐琳宏,林鸿飞,潘宇,等.情感词汇本体的构造[J]. 情报学报, 2008,27(2): 180-185
[5]
侯丽,李姣,侯震,等.基于混合策略的公众健康领域新词识别方法研究[J].图书情报工作, 2015,59(23):115-123.
[6]
QUAN C, REN F. Construction of a blog emotion corpus for Chinese emotional expression analysis[C]//Proceedings of conference on empirical methods in natural language processing. Stroudsburg:Association for Computational Linguistics,2009:1446-1454.
[7]
PENG F, FENG F, MCCALLUM A. Chinese segmentation and new word detection using conditional random fields[C]//Proceedings of international conference on computational linguistics. Stroudsburg: Association for Computational Linguistics,2004:562-569.
[8]
周强.汉语谓词组合范畴语法词库的自动构建研究[J].中文信息学报, 2016,30(3): 196-203.
[9]
CHEN K J, MA W Y. Unknown word extraction for Chinese documents[C]// Proceedings of international conference on DBLP. Taipei: Morgan Kaufmann Publishers, 2002:169-175.
[10]
彭云,万常选,江腾蛟,等.基于语义约束LDA的商品特征和情感词提取[J].软件学报, 2017,28(3):676-693.
[11]
CHEN H, LYNCH K, BASU K, et al. Generating, integrating and activating thesauri for concept-based document retrieval[J]. IEEE intelligent systems and their applications, 1993,8(2):25-34.
[12]
YU S,CAI D,WEN J,et al. Improving pseudo-relevance feedback in web information retrieval using Web page segmentation[C]//Proceedings of the 12th international conference on World Wide Web. New York: ACM, 2003:11-18.
[13]
PNOTE J M,CROFT W B. A language modeling approach to information retrieval[C]//Proceeding of the 21st International ACM SIGIR conference on research and development in information retrieval. New York: ACM, 1998:275-281.
[14]
PEDERSEN T, KULKARNI A. Identifying similar words and contexts in natural language with sense clusters[C]// Proceedings of the 20th national conference on artificial intelligence. Pittsburgh: AAAI Press, 2010:1694-1695.
[15]
TURNEY P D, LITTMAN M L. Measuring praise and criticism: inference of semantic orientation from association[J]. ACM transactions on information systems, 2003, 21(4):315-346.
[16]
NEVIAROUSKAYA A,PRENDINGER H,ISHIZUKA M. SentiFul: a lexicon for sentiment analysis[J].IEEE transactions on affective computing,2011,2(1):22-36.

作者贡献说明:

夏 松:设计模型, 完成实验,修改论文;

林荣蓉:采集数据,进行实验,撰写论文初稿;

刘 勘:提出研究思路,设计研究方案,修改论文与定稿。

基金

系国家社会科学基金资助项目“基于文本挖掘的网络谣言预判研究”(14BXW033)

PDF(1117 KB)

Accesses

Citation

Detail

段落导航
相关文章

/