• Jun 14, 2017 News!Vol.8, No.5 has been indexed by EI (Inspec).   [Click]
  • Apr 24, 2018 News!Vol.10, No.2 has been published with online version.   [Click]
  • Mar 02, 2018 News!Vol.10, No.1 has been published with online version.   [Click]
General Information
Editor-in-chief
Prof. Wael Badawy
Department of Computing and Information Systems Umm Al Qura University, Canada
I'm happy to take on the position of editor in chief of IJCTE. We encourage authors to submit papers concerning any branch of computer theory and engineering.
IJCTE 2017 Vol.9(6): 427-432 ISSN: 1793-8201
DOI: 10.7763/IJCTE.2017.V9.1180

Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus

Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, and Christian V. Maderazo
Abstract—The Internet is a powerful instrument that contains hundreds to thousands of resources. There is a need to categorize these resources based on certain categories in order to organize the contents of the Web better. This research aims to build a corpus that would be representative of pre-defined educational categories. This study will experiment on seven different algorithms that will be able to categorize web pages based on educational domain. Many studies about web categorization have already been conducted but is based on a general set of categories. This research will focus primarily on a predefined set of categories that are closely related to educational domains. With the use of machine learning, the classifier will be able to analyze what a web page is all about and determine its category. The study will also compare the different classifiers used. As a result, the system will be able to assign a web page to a particular educational domain and can be used by schools to determine the categories of web pages frequently requested by students. Linear SVM was also able to build a lexicon for the different categories. The top words for each category were then determined using this lexicon.

Index Terms—Corpus, decision trees, k-nearest neighbor, linear support vector machine, logistic regression, machine learning, multinomial naïve bayes, multilayer perceptron, natural language processing, web page categorization.

Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, Christian V. Maderazo are with University of San Carlos, Philippines (e-mail: patwoogue@gmail.com).

[PDF]

Cite:Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, and Christian V. Maderazo, "Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus," International Journal of Computer Theory and Engineering vol. 9, no. 6, pp. 427-432, 2017.

Copyright © 2008-2018. International Journal of Computer Theory and Engineering. All rights reserved.
E-mail: ijcte@iacsitp.com