• Jun 14, 2017 News!Vol.8, No.5 has been indexed by EI (Inspec).   [Click]
  • Nov 09, 2017 News!Vol.9, No.5 has been published with online version. 16 peer reviewed articles from 11 specific areas are published in this issue.   [Click]
  • Jul 19, 2017 News!Vol.9, No.4 has been published with online version. 16 peer reviewed articles from 16 specific areas are published in this issue.   [Click]
General Information
Editor-in-chief
Prof. Wael Badawy
Department of Computing and Information Systems Umm Al Qura University, Canada
I'm happy to take on the position of editor in chief of IJCTE. We encourage authors to submit papers concerning any branch of computer theory and engineering.
IJCTE 2016 Vol.8(1): 32-35 ISSN: 1793-8201
DOI: 10.7763/IJCTE.2016.V8.1015

Automatic Keyword Extraction for Wikification of East Asian Language Documents

Kensuke Horita, Fuminori Kimura, and Akira Maeda
Abstract—In recent years, research on Wikification, which aims to promote the effective reuse the Wikipedia resources and the understanding of document contents, is attracting much attention. Wikification is a method to automatically extract keywords from a document, and to link them to an appropriate Wikipedia article. Wikification consists of two processes. First, we extract keywords from a document. Second, we identify the appropriate Wikipedia article for each of them. In this paper, we focus on the extraction of keywords from a document for Wikification. Research on Wikification has been conducted for documents in variety of languages. We focus on East Asian language documents and experiment with Japanese documents. Besides, we are planning to do the Wikification not only for documents in the same language but also for other languages (e.g. keywords in Japanese documents are linked to appropriate English Wikipedia articles).
Our proposed method consists of two steps. First, we extract nouns from a document using a morphological analysis tool, and extract candidate keywords by a method called Top Consecutive Nouns Cohesion (TCNC). The TCNC connects continuous nouns and treat them as one compound word. Second, we rank the extracted candidate keywords using one of two measures for keyword importance, Dice coefficient or Keyphraseness.
In our experiments of extracting appropriate keywords for Wikification in Japanese documents, our proposed method, especially the combination of TCNC and Keyphraseness, achieved the best results.

Index Terms—Wikipedia, wikification, keyword extraction, compound word.

K. Horita is with the Graduate School of Information Science and Engineering, Ritsumeikan University, Shiga, Japan (e-mail: is0038ep@ed.ritsumei.ac.jp).
F. Kimura is with Kinugasa Research Organization, Ritsumeikan University, Kyoto, Japan (e-mail: fkimura@is.ritsumei.ac.jp).
A. Maeda is with the College of Information Science and Engineering, Ritsumeikan University, Shiga, Japan (e-mail: amaeda@is.ritsumei.ac.jp).

[PDF]

Cite:Kensuke Horita, Fuminori Kimura, and Akira Maeda, "Automatic Keyword Extraction for Wikification of East Asian Language Documents," International Journal of Computer Theory and Engineering vol. 8, no. 1, pp. 32-35, 2016.

Copyright © 2008-2015. International Journal of Computer Theory and Engineering. All rights reserved.
E-mail: ijcte@vip.163.com