• Dec 30, 2021 News!Vol.13, No.1 & Vol.13, No.2 have been indexed by Inspec.   [Click]
  • Mar 29, 2022 News!IJCTE Vol.14, No.2 has been published.   [Click]
  • Jan 28, 2022 News!IJCTE had implemented online submission system   [Click]
General Information
Prof. Mehmet Sahinoglu
Faculty at Computer Science Department, Troy University, USA
I'm happy to take on the position of editor in chief of IJCTE. We encourage authors to submit papers concerning any branch of computer theory and engineering.

IJCTE 2017 Vol.9(6): 427-432 ISSN: 1793-8201
DOI: 10.7763/IJCTE.2017.V9.1180

Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus

Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, and Christian V. Maderazo
Abstract—The Internet is a powerful instrument that contains hundreds to thousands of resources. There is a need to categorize these resources based on certain categories in order to organize the contents of the Web better. This research aims to build a corpus that would be representative of pre-defined educational categories. This study will experiment on seven different algorithms that will be able to categorize web pages based on educational domain. Many studies about web categorization have already been conducted but is based on a general set of categories. This research will focus primarily on a predefined set of categories that are closely related to educational domains. With the use of machine learning, the classifier will be able to analyze what a web page is all about and determine its category. The study will also compare the different classifiers used. As a result, the system will be able to assign a web page to a particular educational domain and can be used by schools to determine the categories of web pages frequently requested by students. Linear SVM was also able to build a lexicon for the different categories. The top words for each category were then determined using this lexicon.

Index Terms—Corpus, decision trees, k-nearest neighbor, linear support vector machine, logistic regression, machine learning, multinomial naïve bayes, multilayer perceptron, natural language processing, web page categorization.

Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, Christian V. Maderazo are with University of San Carlos, Philippines (e-mail: patwoogue@gmail.com).


Cite:Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, and Christian V. Maderazo, "Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus," International Journal of Computer Theory and Engineering vol. 9, no. 6, pp. 427-432, 2017.

Copyright © 2008-2022. International Association of Computer Science and Information Technology. All rights reserved.