Research on Gender Prediction of Chinese Social Media Users——Taking Sina Weibo Short Text Content as an Example

Liu Yaqi, Li Dezhi, Wang Ruixue

Knowledge Management Forum ›› 2021, Vol. 6 ›› Issue (4) : 213-227.

PDF(1822 KB)
PDF(1822 KB)
Knowledge Management Forum ›› 2021, Vol. 6 ›› Issue (4) : 213-227. DOI: 10.13266/j.issn.2095-5472.2021.021

Research on Gender Prediction of Chinese Social Media Users——Taking Sina Weibo Short Text Content as an Example

Author information +
History +

Abstract

[Purpose/significance] Different from the rapid development of the Internet, the development of personal information security protection is relatively lagging. By predicting the gender of social media users, it can better provide privacy protection for the users. [Method/process] The short texts posted by users in social media, Sina Weibo, were taken as the research object. The experiment extracted linguistic features and topic features from the short texts. For each user, we constructed features vector based on linguistic features, topic features, and the superposition of two features, then used SVM Machine learning algorithms built a classifier for gender prediction. [Result/conclusion] Experiments show that the linguistic features and topic features can predict the gender of the users accurately, and the effect is superior to other features used in gender prediction.

Key words

short text / gender prediction / topic features / linguistic features

Cite this article

Download Citations
Liu Yaqi , Li Dezhi , Wang Ruixue. Research on Gender Prediction of Chinese Social Media Users——Taking Sina Weibo Short Text Content as an Example[J]. Knowledge Management Forum. 2021, 6(4): 213-227 https://doi.org/10.13266/j.issn.2095-5472.2021.021

References

[1]
陈传夫,刘雅琦.公共部门信息增值利用中的个人信息保护[J].情报科学,2010,28(10):1455-1460.
[2]
刘雅琦.公共部门信息增值利用中的个人信息保护立法研究[J].情报理论与实践,2011,34(4):40-43.
[3]
郑莉,蔡琼,石曼,等.社交网络隐私成本的量化研究[J].科教导刊(电子版),2019(1):282.
[4]
曹杨.微博用户性别分类研究及应用[D].合肥:安徽大学.2019.
[5]
熊杰.政务微博在线评论中的用户情绪及行为研究[D].成都:电子科技大学,2020.
[6]
WALTON S C, RICE R E. Mediated disclosure on Twitter: the roles of gender and identity in boundary impermeability, valence, disclosure, and stage[J]. Computers in human behavior, 2013, 29(4):1465-1474.
[7]
PIAO G, BRESLIN J G. User modeling on Twitter with WordNet Synsets and DBpedia Concepts for Personalized Recommendations[C]//ACM international conference on information & knowledge management. Indianapolis: ACM, 2016:2057-2060.
[8]
PAN. Shared tasks[EB/OL].[2021-02-04]. https://pan.webis.de/shared-tasks.html.
[9]
BIENDATA.比赛项目[EB/OL].[2021-02-04]. https://www.biendata.xyz/competition/.
[10]
SMITH J. Gender prediction in social media[EB/OL].[2021-02-04]. https://arxiv.org/abs/1407.2147.
[11]
ABBASI M A, CHAI S K, LIU H, et al. Real-world behavior analysis through a social media lens[C]//International conference on social computing, behavioral-cultural modeling, and prediction. Berlin: Springer, 2012: 18-26.
[12]
ZHELEVA E, GETOOR L. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles[C]//Proceedings of the 18th international conference on World Wide Web. 2009: 531-540.
[13]
SCHWARTZ H A, EICHSTAEDT J C, KERN M L, et al. Personality, gender, and age in the language of social media: the open-vocabulary approach[J]. PloS one, 2013, 8(9): e73791.
[14]
VICENTE M, BATISTA F, CARVALHO J P. Gender detection of Twitter users based on multiple information sources[M]//Interactions between computational intelligence and mathematics part 2. Cham: Springer, 2019: 39-54.
[15]
SUN X, WU P, LIU H. Facial age estimation using bio-inspired features and cost-sensitive ordinal hyperplane rank[C]// IEEE, International Conference on Cloud Computing and Intelligence Systems. Shenzhen: IEEE, 2015:81-85.
[16]
GUO G, MU G, FU Y. Gender from body: a biologically-inspired approach with manifold learning[M]// Computer vision – ACCV 2009. Berlin: Springer, 2009.
[17]
LANITIS A, TAYLOR C J, COOTES T F. Toward automatic simulation of aging effects on face images[J]. Pattern analysis & machine intelligence IEEE transactions on, 2002, 24(4):442-455.
[18]
GUNAY A, NABIYEV V V. Automatic age classification with LBP[C]// International symposium on computer and information sciences. Istanbul :IEEE, 2008:1-4.
[19]
SHAN C. Learning local binary patterns for gender classification on real-world face images[M]. Amsterdam :Elsevier Science Inc. 2012.
[20]
BALUJA S, ROWLEY H. Boosting sex identification performance[J]. International journal of computer vision, 2007, 71(1):111-119.
[21]
MANSANET J, ALBIOL A, PAREDES R. Local deep neural networks for gender recognition[M]. Amsterdam: Elsevier Science Inc, 2016.
[22]
吴泽银. 基于集成卷积神经网络的人脸性别识别研究[D].广州:华南理工大学,2016.
[23]
BURGER J D, HENDERSON J, KIM G, et al. Discriminating gender on Twitter[C]// Conference on empirical methods in natural language processing. Edinburgh: Association for Computational Linguistics, 2011:1301-1309.
[24]
ALOWIBDI J S, BUY U A, YU P. Language independent gender classification on Twitter[C]// IEEE/ACM international conference on advances in social networks analysis and mining. Niagara Falls: IEEE, 2013:739-743.
[25]
钱铁云,尤珍妮,陈丽,等.基于兴趣标签的缄默用户性别预测研究[J].华中科技大学学报(自然科学版),2015,43(12):101-105.
[26]
LI S, WANG J, ZHOU G, et al. Interactive gender inference with integer linear programming[C]// International joint conference on artificial intelligence. Barcelona: AAAI Press, 2015:2341-2347.
[27]
戴斌,李寿山,贡正仙,等.基于多类型文本的半监督性别分类方法研究[J].山西大学学报(自然科学版),2017,40(1):14-20.
[28]
CHENG N, CHANDRAMOULI R, SUBBALAKSHMI K P. Author gender identification from text[J]. Digital investigation, 2012, 8(1):78-88.
[29]
FILHO J A B L, PASTI R, CASTRO L N D. Gender classification of twitter data based on textual meta-attributes extraction[C]// World conference on information systems and technologies. Switzerland: Springer, 2016:1025-1034.
[30]
WANG Q, MA S, ZHANG C. Predicting users’ demographic characteristics in a Chinese social media network[J]. The electronic library, 2017,35(4): 758-769.
[31]
PEERSMAN C, DAELEMANS W, VAERENBERGH L V. Predicting age and gender in online social networks[C]// International CIKM workshop on search and mining user-generated contents. Glasgow:DBLP, 2011:37-44.
[32]
王晶晶, 李寿山, 黄磊. 中文微博用户性别分类方法研究[J]. 中文信息学报, 2014, 28(6):150-155.
[33]
MILLER Z, DICKINSON B, HU W. Gender prediction on Twitter using stream algorithms with N-Gram character features[J]. International journal of intelligence science, 2012, 2(4):143-148.
[34]
RAO D, YAROWSKY D, SHREEVATS A, et al. Classifying latent user attributes in Twitter[C]// International workshop on search and mining user-generated contents. New York: ACM, 2010:37-44.
[35]
BIENDATA.SMPCUP2016微博用户画像数据[EB/OL]. [2020-10-08].https://www.biendata.xyz/competition/smpcup2016/data/.
[36]
BAMMAN D, EISENSTEIN J, SCHNOEBELEN T. Gender identity and lexical variation in social media[J]. Journal of sociolinguistics, 2014, 18(2):135–160.
[37]
BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of machine learning research, 2003, 3(3):993-1022.
[38]
CHEN J, HUANG H, TIAN S, et al. Feature selection for text classification with Naïve Bayes[J]. Expert systems with applications an international journal, 2009, 36(3):5432-5435.
[39]
GAO R, HAO B, LI H, et al. Developing simplified Chinese psychological linguistic analysis dictionary for Microblog[M]// Brain and health informatics, 2013:359-368.
[40]
KIM Y. Convolutional neural networks for sentence classification.[EB/OL].[2021-02-04]. https://arxiv.org/abs/1408.5882

刘雅琦: 实验设计与论文修改;

李得志: 数据收集、实验与部分论文撰写;

王瑞雪: 数据分析与部分论文撰写。

PDF(1822 KB)

Accesses

Citation

Detail

Sections
Recommended

/