• Jun 03, 2019 News!Vol.9, No.5-Vol.10, No.3 have been indexed by EI (Inspec).   [Click]
  • May 13, 2020 News!Vol.12, No.3 has been published with online version.   [Click]
  • Apr 22, 2020 News!Vol.12, No.1-Vol.12, No.2 have been indexed by Crossref.
General Information
Prof. Wael Badawy
Department of Computing and Information Systems Umm Al Qura University, Canada
I'm happy to take on the position of editor in chief of IJCTE. We encourage authors to submit papers concerning any branch of computer theory and engineering.
IJCTE 2016 Vol.8(1): 32-35 ISSN: 1793-8201
DOI: 10.7763/IJCTE.2016.V8.1015

Automatic Keyword Extraction for Wikification of East Asian Language Documents

Kensuke Horita, Fuminori Kimura, and Akira Maeda
Abstract—In recent years, research on Wikification, which aims to promote the effective reuse the Wikipedia resources and the understanding of document contents, is attracting much attention. Wikification is a method to automatically extract keywords from a document, and to link them to an appropriate Wikipedia article. Wikification consists of two processes. First, we extract keywords from a document. Second, we identify the appropriate Wikipedia article for each of them. In this paper, we focus on the extraction of keywords from a document for Wikification. Research on Wikification has been conducted for documents in variety of languages. We focus on East Asian language documents and experiment with Japanese documents. Besides, we are planning to do the Wikification not only for documents in the same language but also for other languages (e.g. keywords in Japanese documents are linked to appropriate English Wikipedia articles).
Our proposed method consists of two steps. First, we extract nouns from a document using a morphological analysis tool, and extract candidate keywords by a method called Top Consecutive Nouns Cohesion (TCNC). The TCNC connects continuous nouns and treat them as one compound word. Second, we rank the extracted candidate keywords using one of two measures for keyword importance, Dice coefficient or Keyphraseness.
In our experiments of extracting appropriate keywords for Wikification in Japanese documents, our proposed method, especially the combination of TCNC and Keyphraseness, achieved the best results.

Index Terms—Wikipedia, wikification, keyword extraction, compound word.

K. Horita is with the Graduate School of Information Science and Engineering, Ritsumeikan University, Shiga, Japan (e-mail: is0038ep@ed.ritsumei.ac.jp).
F. Kimura is with Kinugasa Research Organization, Ritsumeikan University, Kyoto, Japan (e-mail: fkimura@is.ritsumei.ac.jp).
A. Maeda is with the College of Information Science and Engineering, Ritsumeikan University, Shiga, Japan (e-mail: amaeda@is.ritsumei.ac.jp).


Cite:Kensuke Horita, Fuminori Kimura, and Akira Maeda, "Automatic Keyword Extraction for Wikification of East Asian Language Documents," International Journal of Computer Theory and Engineering vol. 8, no. 1, pp. 32-35, 2016.

Copyright © 2008-2020. International Association of Computer Science and Information Technology. All rights reserved.