[1]许峰,张雪芬,忻展红.基于深度神经网络模型的中文分词方案[J].哈尔滨工程大学学报,2019,40(09):1662-1666.[doi:10.11990/jheu.201812073]
 XU Feng,ZHANG Xuefen,XIN Zhanhong.A Chinese word segmentation scheme based on a deep neural network model[J].hebgcdxxb,2019,40(09):1662-1666.[doi:10.11990/jheu.201812073]
点击复制

基于深度神经网络模型的中文分词方案(/HTML)
分享到:

《哈尔滨工程大学学报》[ISSN:1006-6977/CN:61-1281/TN]

卷:
40
期数:
2019年09期
页码:
1662-1666
栏目:
出版日期:
2019-09-05

文章信息/Info

Title:
A Chinese word segmentation scheme based on a deep neural network model
作者:
许峰1 张雪芬2 忻展红1
1. 北京邮电大学 经济管理学院, 北京 100876;
2. 北京联合大学 智慧城市学院, 北京 100101
Author(s):
XU Feng1 ZHANG Xuefen2 XIN Zhanhong1
1. School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing 100876, China;
2. Smart City College, Beijing Union University, Beijing 100101, China
关键词:
中文分词长短期记忆网络编码-解码模型词向量准确率F
分类号:
TN911.22
DOI:
10.11990/jheu.201812073
文献标志码:
A
摘要:
针对目前已有的分词算法和程序在处理海量网络文本分词时性能下降的问题,本文提出了一种基于深度神经网络模型的中文分词方案。该方案利用基于长短期记忆网络的编码-解码模型对数据模型进行训练,并采用得到的模型进行分词。为了提升分词性能,进一步提出了一种基于词向量的修正方法,对采用上述模型的分词结果进行修正。对典型微博语料数据集的实验结果表明,提出基于模型的分词性能相对于传统的分词软件的分词性能有了较大提升。采用提出的词向量修正方法修正后的分词准确率和F值略优于未修正的分词准确率和F值,从而验证了论文提出的分词方案的有效性。

参考文献/References:

[1] 罗刚, 张子宪. 自然语言处理原理与技术实现[M]. 北京:电子工业出版社, 2016.
[2] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3):8-19.HUANG Changning, ZHAO Hai. Chinese Word Segmentation:a decade review[J]. Journal of Chinese information processing, 2007, 21(3):8-19.
[3] 黄昌宁. 中文信息处理中的分词问题[J]. 语言文字应用, 1997(1):72-78.
[4] WU Andi, JIANG Zixin. Word segmentation in sentence analysis[C]//Proceedings of the 1998 International Conference on Chinese Information Processing. Beijing, 1998:169-180.
[5] UTIYAMA M, ISAHARA H. A statistical model for domain-independent text segmentation[C]//Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Toulouse, France, 2001:499-506.
[6] LOW J K, NG H T, GUO Wenyuan. A maximum entropy approach to Chinese word segmentation[C]//Proceedings of the 4th Sighan Workshop on Chinese Language Processing. Jeju Island, Korea, 2005:161-164.
[7] ZHAO Hai, HUANG Changning, LI Mu. An improved Chinese word segmentation system with conditional random field[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Sydney, 2006:162-165.
[8] XUE Nianwen. Chinese word segmentation as character tagging[J]. Computational linguistics and Chinese language processing, 2003, 8(1):29-48.
[9] TSENG H, CHANG Pichuan, ANDREW G, et al. A conditional random field word segmenter for Sighan bakeoff 2005[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing,Association for Computational Linguistics. 2005:168-171.
[10] CHANG Pichuan, GALLEY M, MANNING C D. Optimizing Chinese word segmentation for machine translation performance[C]//Proceedings of the 3rd Workshop on Statistical Machine Translation. Columbus, Ohio, 2008:224-232.
[11] 刘颖. 网络语言的变异分析:现象、成因及发展趋势[D]. 福州:福建师范大学, 2012.LIU Ying. Linguistic variation of netspeak:phenomenon, reasons and future developments[D]. Fuzhou:Fujian Normal University, 2012.
[12] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507.
[13] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8):1735-1780.
[14] CHO K, VAN MERRIENBOER B, GüL?EHRE ?, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar, 2014:1724-1734.
[15] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[C]//Proceedings of 2015 International Conference on Learning Representations. 2015:1-15.
[16] LAI Siwei, LIU Kang, HE Shi, et al. How to generate a good word embedding?[J]. IEEE intelligent systems, 2016, 31(6):5-14.
[17] 沈翔翔, 李小勇. 使用无监督学习改进中文分词[J]. 小型微型计算机系统, 2017, 38(4):744-748.SHEN Xiangxiang, LI Xiaoyong. Improving Chinese word segmentation via unsupervised learning[J]. Journal of Chinese computer systems, 2017, 38(4):744-748.
[18] QIU Xipeng, QIAN Peng, YIN Liusong, et al. Overview of the NLPCC 2015 shared task:Chinese word segmentation and POS tagging for micro-blog texts[C]//Proceedings of the 4th CCF Conference on Natural Language Processing and Chinese Computing. Nanchang, China, 2015:541-549.
[19] MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of Workshop at International Conference on Learning Representations. 2013:1-12.

备注/Memo

备注/Memo:
收稿日期:2018-12-22。
基金项目:国家自然科学基金项目(61672178).
作者简介:许峰,男,博士研究生;张雪芬,女,副教授;忻展红,男,教授,博士生导师.
通讯作者:张雪芬,E-mail:zhangxuefen@buu.edu.cn.
更新日期/Last Update: 2019-09-06