
图书馆海量学术资源自动分类模型研究
Research on Automatic Classification Model of Massive Academic Resources in Library
[目的/意义] 针对用户在图书馆海量数字资源中常常面临获取信息困难的问题,构建一套个性化知识服务系统,认为该系统是图书馆帮助用户摆脱信息超载困境和提升知识服务质量的必然选择。[方法/过程] 通过建立中图法和学科分类法两大知识组织体系的映射模型,基于Hadoop分布式处理平台,提出一种改进TF-IDF+贝叶斯算法构建图书馆海量学术资源自动分类模型,辅助完善图书馆个性化知识服务系统的构建。[结果/结论] 以自中国知网抓取的600万余篇文献作为原始训练语料(语料涵盖75个学科)测试该分类模型的有效性,实验结果证明该模型的分类效率和效果都达到了预期。
[Purpose/significance] In order to solve the problem that users often have difficulty in obtaining information in massive digital resources of library, this paper construct a personalized knowledge service system, which is the inevitable choice of library to help users to get rid of the information overload predicament and improve the quality of knowledge service. [Method/process] Firstly, this paper built a mapping model of Chinese Library Classification(CLC) and subject classification. Then, based on Hadoop distributed processing platform, it proposed to build automatic classification model of massive academic resources in libraries by improving TF-IDF+ Bayesian algorithm, the model can help to construct the personalized knowledge service systems in library. [Result/conclusion] In the experimental part,we collected more than 6 million documents from CNKI as the original training corpus (corpus covers 75 disciplines) to test the effectiveness of the classification model, the experimental result shows that the classification efficiency and effectiveness of the model are achieved.
自动分类 / Hadoop / TF-IDF算法 / 贝叶斯
[1] |
VIKAS K, VIJAYAN K, LATHA P. A comprehensive study of text classification algorithms[C]// Proceedings of 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).Udupi:IEEE press,2017:1109-1113.
|
[2] |
高元. 面向个性化推荐的海量学术资源分类研究[D].宁波:宁波大学,2017.
|
[3] |
贺鸣,孙建军,成颖.基于朴素贝叶斯的文本分类研究综述[J].情报科学,2016,34(7):147-154.
|
[4] |
KUPERVASSER O. The mysterious optimality of naive bayes: estimation of the probability in the system of “classifiers”[J].Pattern recognition and image analysis,2014,24(1):1-10.
|
[5] |
LEWIS D. Naive (Bayes) at forty: The independence assumption in information retrieval[C]//Proceedings of 10th European Conference on Machine Learning Chemnitz. Berlin: Springer,1998:4-15
|
[6] |
LI Y J, LUO C N, CHUNG S M. Weighted naive bayes for text classification using positive term-class dependency[J].International journal on artificial intelligence tools, 2012,21(1):1250008-1250015.
|
[7] |
邸鹏,段利国.一种新型朴素贝叶斯文本分类算法[J].数据采集与处理,2014,29(1):71-75.
|
[8] |
杜选.基于加权补集的朴素贝叶斯文本分类算法研究[J].计算机应用与软件,2014,31(9):253-255.
|
[9] |
张杰,陈怀新.基于归一化词频贝叶斯模型的文本分类方法[J].计算机工程与设计,2016,37(3):799-802
|
[10] |
艾雰.2010—2016年《中国图书馆分类法》(第5版)研究现状分析[J].图书馆建设,2017(5):39-44,72.
|
[11] |
LI Q, CHEN L. Study on multi-class text classification based on improved SVM[C] //Proceedings of the Eighth International Conference on Intelligent Systems and Knowledge Engineering, Shenzhen:Springer Berlin Heidelberg,2014:519-526.
|
[12] |
ZHANG Y T, WANG GL. An improved TF-IDF approach for text classification[J].Journal of zhejiang university-science a,2005,6(1):49-55.
|
[13] |
KIM S B, RIM H C. Effective Methods for improving naive bayes text classifiers[C] //Proceedings of 7th Pacific Rim international conference on artificial intelligence. Berlin:Springer, 2002: 414-423.
|
[14] |
张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006(19):76-78.
|
[15] |
苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006(9):1848-1859.
|
杨亚: 数据处理,撰写并修正论文;
易远弘: 学术资源整理,提出论文的修改意见。
/
〈 |
|
〉 |