中文社交媒体用户性别预测研究——以新浪微博短文本内容为例

刘雅琦, 李得志, 王瑞雪

知识管理论坛 ›› 2021, Vol. 6 ›› Issue (4) : 213-227.

PDF(1822 KB)
PDF(1822 KB)
知识管理论坛 ›› 2021, Vol. 6 ›› Issue (4) : 213-227. DOI: 10.13266/j.issn.2095-5472.2021.021
学术探索

中文社交媒体用户性别预测研究——以新浪微博短文本内容为例

作者信息 +

Research on Gender Prediction of Chinese Social Media Users——Taking Sina Weibo Short Text Content as an Example

Author information +
文章历史 +

摘要

[目的/意义] 与互联网的高速发展不同,个人信息安全保护的发展相对滞后,通过预测社交媒体用户的性别,能够更好地针对不同性别用户提供隐私保护。[方法/过程] 以新浪微博这一社交媒体中用户发布的短文本为研究对象,从中抽取语言特征和主题特征,为每一个用户构建基于语言特征、主题特征以及两个特征叠加的特征表达向量,利用SVM机器学习算法构建性别预测的分类器。[结果/结论] 实验表明,从微博短文本中抽取的语言特征和主题特征能够准确预测用户性别,其效果在主要评价指标中均有大幅提升。

Abstract

[Purpose/significance] Different from the rapid development of the Internet, the development of personal information security protection is relatively lagging. By predicting the gender of social media users, it can better provide privacy protection for the users. [Method/process] The short texts posted by users in social media, Sina Weibo, were taken as the research object. The experiment extracted linguistic features and topic features from the short texts. For each user, we constructed features vector based on linguistic features, topic features, and the superposition of two features, then used SVM Machine learning algorithms built a classifier for gender prediction. [Result/conclusion] Experiments show that the linguistic features and topic features can predict the gender of the users accurately, and the effect is superior to other features used in gender prediction.

关键词

短文本 / 性别预测 / 主题特征 / 语言特征

Key words

short text / gender prediction / topic features / linguistic features

引用本文

导出引用
刘雅琦 , 李得志 , 王瑞雪. 中文社交媒体用户性别预测研究——以新浪微博短文本内容为例[J]. 知识管理论坛. 2021, 6(4): 213-227 https://doi.org/10.13266/j.issn.2095-5472.2021.021
Liu Yaqi , Li Dezhi , Wang Ruixue. Research on Gender Prediction of Chinese Social Media Users——Taking Sina Weibo Short Text Content as an Example[J]. Knowledge Management Forum. 2021, 6(4): 213-227 https://doi.org/10.13266/j.issn.2095-5472.2021.021
中图分类号: TP391.1   

参考文献

[1]
陈传夫,刘雅琦.公共部门信息增值利用中的个人信息保护[J].情报科学,2010,28(10):1455-1460.
[2]
刘雅琦.公共部门信息增值利用中的个人信息保护立法研究[J].情报理论与实践,2011,34(4):40-43.
[3]
郑莉,蔡琼,石曼,等.社交网络隐私成本的量化研究[J].科教导刊(电子版),2019(1):282.
[4]
曹杨.微博用户性别分类研究及应用[D].合肥:安徽大学.2019.
[5]
熊杰.政务微博在线评论中的用户情绪及行为研究[D].成都:电子科技大学,2020.
[6]
WALTON S C, RICE R E. Mediated disclosure on Twitter: the roles of gender and identity in boundary impermeability, valence, disclosure, and stage[J]. Computers in human behavior, 2013, 29(4):1465-1474.
[7]
PIAO G, BRESLIN J G. User modeling on Twitter with WordNet Synsets and DBpedia Concepts for Personalized Recommendations[C]//ACM international conference on information & knowledge management. Indianapolis: ACM, 2016:2057-2060.
[8]
PAN. Shared tasks[EB/OL].[2021-02-04]. https://pan.webis.de/shared-tasks.html.
[9]
BIENDATA.比赛项目[EB/OL].[2021-02-04]. https://www.biendata.xyz/competition/.
[10]
SMITH J. Gender prediction in social media[EB/OL].[2021-02-04]. https://arxiv.org/abs/1407.2147.
[11]
ABBASI M A, CHAI S K, LIU H, et al. Real-world behavior analysis through a social media lens[C]//International conference on social computing, behavioral-cultural modeling, and prediction. Berlin: Springer, 2012: 18-26.
[12]
ZHELEVA E, GETOOR L. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles[C]//Proceedings of the 18th international conference on World Wide Web. 2009: 531-540.
[13]
SCHWARTZ H A, EICHSTAEDT J C, KERN M L, et al. Personality, gender, and age in the language of social media: the open-vocabulary approach[J]. PloS one, 2013, 8(9): e73791.
[14]
VICENTE M, BATISTA F, CARVALHO J P. Gender detection of Twitter users based on multiple information sources[M]//Interactions between computational intelligence and mathematics part 2. Cham: Springer, 2019: 39-54.
[15]
SUN X, WU P, LIU H. Facial age estimation using bio-inspired features and cost-sensitive ordinal hyperplane rank[C]// IEEE, International Conference on Cloud Computing and Intelligence Systems. Shenzhen: IEEE, 2015:81-85.
[16]
GUO G, MU G, FU Y. Gender from body: a biologically-inspired approach with manifold learning[M]// Computer vision – ACCV 2009. Berlin: Springer, 2009.
[17]
LANITIS A, TAYLOR C J, COOTES T F. Toward automatic simulation of aging effects on face images[J]. Pattern analysis & machine intelligence IEEE transactions on, 2002, 24(4):442-455.
[18]
GUNAY A, NABIYEV V V. Automatic age classification with LBP[C]// International symposium on computer and information sciences. Istanbul :IEEE, 2008:1-4.
[19]
SHAN C. Learning local binary patterns for gender classification on real-world face images[M]. Amsterdam :Elsevier Science Inc. 2012.
[20]
BALUJA S, ROWLEY H. Boosting sex identification performance[J]. International journal of computer vision, 2007, 71(1):111-119.
[21]
MANSANET J, ALBIOL A, PAREDES R. Local deep neural networks for gender recognition[M]. Amsterdam: Elsevier Science Inc, 2016.
[22]
吴泽银. 基于集成卷积神经网络的人脸性别识别研究[D].广州:华南理工大学,2016.
[23]
BURGER J D, HENDERSON J, KIM G, et al. Discriminating gender on Twitter[C]// Conference on empirical methods in natural language processing. Edinburgh: Association for Computational Linguistics, 2011:1301-1309.
[24]
ALOWIBDI J S, BUY U A, YU P. Language independent gender classification on Twitter[C]// IEEE/ACM international conference on advances in social networks analysis and mining. Niagara Falls: IEEE, 2013:739-743.
[25]
钱铁云,尤珍妮,陈丽,等.基于兴趣标签的缄默用户性别预测研究[J].华中科技大学学报(自然科学版),2015,43(12):101-105.
[26]
LI S, WANG J, ZHOU G, et al. Interactive gender inference with integer linear programming[C]// International joint conference on artificial intelligence. Barcelona: AAAI Press, 2015:2341-2347.
[27]
戴斌,李寿山,贡正仙,等.基于多类型文本的半监督性别分类方法研究[J].山西大学学报(自然科学版),2017,40(1):14-20.
[28]
CHENG N, CHANDRAMOULI R, SUBBALAKSHMI K P. Author gender identification from text[J]. Digital investigation, 2012, 8(1):78-88.
[29]
FILHO J A B L, PASTI R, CASTRO L N D. Gender classification of twitter data based on textual meta-attributes extraction[C]// World conference on information systems and technologies. Switzerland: Springer, 2016:1025-1034.
[30]
WANG Q, MA S, ZHANG C. Predicting users’ demographic characteristics in a Chinese social media network[J]. The electronic library, 2017,35(4): 758-769.
[31]
PEERSMAN C, DAELEMANS W, VAERENBERGH L V. Predicting age and gender in online social networks[C]// International CIKM workshop on search and mining user-generated contents. Glasgow:DBLP, 2011:37-44.
[32]
王晶晶, 李寿山, 黄磊. 中文微博用户性别分类方法研究[J]. 中文信息学报, 2014, 28(6):150-155.
[33]
MILLER Z, DICKINSON B, HU W. Gender prediction on Twitter using stream algorithms with N-Gram character features[J]. International journal of intelligence science, 2012, 2(4):143-148.
[34]
RAO D, YAROWSKY D, SHREEVATS A, et al. Classifying latent user attributes in Twitter[C]// International workshop on search and mining user-generated contents. New York: ACM, 2010:37-44.
[35]
BIENDATA.SMPCUP2016微博用户画像数据[EB/OL]. [2020-10-08].https://www.biendata.xyz/competition/smpcup2016/data/.
[36]
BAMMAN D, EISENSTEIN J, SCHNOEBELEN T. Gender identity and lexical variation in social media[J]. Journal of sociolinguistics, 2014, 18(2):135–160.
[37]
BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of machine learning research, 2003, 3(3):993-1022.
[38]
CHEN J, HUANG H, TIAN S, et al. Feature selection for text classification with Naïve Bayes[J]. Expert systems with applications an international journal, 2009, 36(3):5432-5435.
[39]
GAO R, HAO B, LI H, et al. Developing simplified Chinese psychological linguistic analysis dictionary for Microblog[M]// Brain and health informatics, 2013:359-368.
[40]
KIM Y. Convolutional neural networks for sentence classification.[EB/OL].[2021-02-04]. https://arxiv.org/abs/1408.5882

作者贡献说明:

刘雅琦: 实验设计与论文修改;

李得志: 数据收集、实验与部分论文撰写;

王瑞雪: 数据分析与部分论文撰写。

基金

国家社会科学青年基金资助项目“大数据环境下基于个体识别风险的个人信息利用研究”(14CTQ016)

PDF(1822 KB)

Accesses

Citation

Detail

段落导航
相关文章

/