[1]刘里,肖迎元.基于术语长度和语法特征的统计领域术语抽取[J].哈尔滨工程大学学报,2017,38(09):1437-1443.[doi:10.11990/jheu.201605037]
 LIU Li,XIAO Yingyuan.A statistical domain terminology extraction method based on word length and grammatical feature[J].hebgcdxxb,2017,38(09):1437-1443.[doi:10.11990/jheu.201605037]
点击复制

基于术语长度和语法特征的统计领域术语抽取(/HTML)
分享到:

《哈尔滨工程大学学报》[ISSN:1006-6977/CN:61-1281/TN]

卷:
38
期数:
2017年09期
页码:
1437-1443
栏目:
出版日期:
2017-09-25

文章信息/Info

Title:
A statistical domain terminology extraction method based on word length and grammatical feature
作者:
刘里12 肖迎元12
1. 天津理工大学 计算机视觉与系统省部共建教育部重点实验室, 天津 300384;
2. 天津理工大学 天津市智能计算及软件新技术重点实验室, 天津 300384
Author(s):
LIU Li12 XIAO Yingyuan12
1. Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, Tianjin 300384, China;
2. Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin 300384, China
关键词:
自然语言处理术语抽取支持向量机术语长度语法特征词长比领域相关性领域一致性
分类号:
TP181
DOI:
10.11990/jheu.201605037
文献标志码:
A
摘要:
针对领域术语抽取中含字长度较大的术语被错误切分的问题,本文提出一种基于术语长度和语法特征的统计领域术语抽取方法。本方法在利用机器学习抽取候选术语时,加入基于术语长度和语法特征的约束规则;在使用统计方法确定候选术语的领域性时,充分考虑词长比这一概念的重要性,将其作为判断术语领域性的重要权值。实验表明,提出的方法能够正确抽取含字长度较大的领域术语,抽取结果的准确率和召回率相比以往的方法有所提高。

参考文献/References:

[1] 于欣丽, 全如, 粟武宾, 等. GB/T 10112-1999, 术语工作原则与方法[S]. 北京:国家质量技术监督局, 1999.YU Xinli, QUAN Ru, SU Wubin, et al. GB/T 10112-1999, Terminologywork-principles and methods[S]. Beijing:General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, 1999.
[2] 曾聪, 张东站. 基于同义词词林和《知网》的短语主题提取[J]. 厦门大学学报:自然科学版, 2015, 54(2):263-269.ZENG Cong, ZHANG Dongzhan. Phrase subject extraction based on synonyms and HowNet[J]. Journal of Xiamen University:natural science, 2015, 54(2):263-269.
[3] KANG N, SINGH B, BUI C, et al. Knowledge-based extraction of adverse drug events from biomedical text[J]. BMC bioinformatics, 2014, 15(1):64-64.
[4] SHAREF N M, NOAH S A, MURAD M A A. Linguistic rule-based translation of natural language question into sparql query for effective semantic question answering[J]. Journal of theoretical and applied information technology, 2015, 80(3):557-575.
[5] 张莉, 刘昱显. 基于语序位置特征的汉英术语对自动抽取研究[J]. 南京大学学报:自然科学, 2015(4):707-713.ZHANG Li, LIU Yuxian. Research on automatic Chinese-English term extraction based on order and position feature of words[J]. Journal of Nanjing University:natural sciences, 2015(4):707-713.
[6] BOLSHAKOVA E, LOUKACHEVITCH N, NOKEL. Topic models can improve domain term extraction[C]//Proceedings of the 35th European Conference on Advances in Information Retrieval. Berlin, 2013:684-687.
[7] FRANTZI K, ANANIADOU S, MIMA H. Automatic recognition of multi-word terms:the C-value/NC-value method[J]. International journal on digital libraries, 2000, 3(2):115-130.
[8] ESPINOSA A L, SAGGION H, RONZANO F. TALN-UPF:Taxonomy learning exploiting CRF-based hypernym extraction on encyclopedic definitions[C]//Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, Colorado, 2015:949-54.
[9] PANTEL P, LIN D. A statistical corpus-based term extractor[M]. Berlin, Springer, 2001:36-46.
[10] GELBUKH A, SIDOROV G, LAVIN V E, et al. Automatic term extraction using log-likelihood based comparison with general reference corpus[J]. Natural language processing and information systems, 2010:248-255.
[11] ITTOO A, BOUMA G. Term extraction from sparse, ungrammatical domain-specific documents[J]. Expert systems with applications, 2013, 40(7):2530-2540.
[12] LOPEZ P, ROMARY L. HUMB:Automatic key term extraction from scientific articles in GROBID[C]//Proceedings of the 5th International Workshop on Semantic Evaluation. Los Angeles, California, 2010:248-251.
[13] 李芸, 王强军. 信息技术领域术语字频、词频及术语长度统计[C]//第一届学生计算语言学研讨会论文集. 北京, 2002:268-274.LI Yun, WANG Qiangjun. Character frequency, word frequency and length of term in the field of information technology[C]//Proceedings of the First Student Workshop on Computational Iinguistics (SWCL 2002). Beijing, 2002:268-274.
[14] 周浪, 张亮, 冯冲, 等. 基于词频分布变化统计的术语抽取方法[J]. 计算机科学, 2009, 36(05):177-180.ZHOU Lang, ZHANG Liang, FENG Chong, et al. Terminology extraction based on statistical word frequency distribution variety[J]. Computer science, 2009, 36(05):177-180.
[15] CUNNINGHAM H, BONTCHEVA K, TABLAN V, et al. GATE[EB/OL]. Sheffield, The University of Sheffield, 2016.[2016-05-11]. https://gate.ac.uk/.
[16] 张华平. NLPIR汉语分词系统[EB/OL].[2016-05-11]. http://ictclas.nlpir.org/.
[17] BRUNZEL M, SPILIOPOULOU M. Domain relevance on term weighting[M]. Berlin, Springer, 2007:427-432.
[18] 谭松波, 王月粉. 中文文本分类语料库-TanCorpV1.0[EB/OL].[2016-05-11]. http://www.datatang.com/data/11970.

相似文献/References:

[1]蒋宗礼,王威.融合检索技术的译文推荐系统[J].哈尔滨工程大学学报,2017,38(03):419.[doi:10.11990/jheu.201601053]
 JIANG Zongli,WANG Wei.Translation recommendation system with information retrieval technology[J].hebgcdxxb,2017,38(09):419.[doi:10.11990/jheu.201601053]

备注/Memo

备注/Memo:
收稿日期:2016-05-12。
基金项目:国家自然科学基金项目(71501141,61301140);天津市科技特派员项目(15JCTPJC63800).
作者简介:刘里(1983-),男,讲师,博士;肖迎元(1969-),男,教授,博士生导师.
通讯作者:刘里,E-mail:llwork@yeah.net.
更新日期/Last Update: 2017-10-17