Abstract—The Internet is a powerful instrument that contains hundreds to thousands of resources. There is a need to categorize these resources based on certain categories in order to organize the contents of the Web better. This research aims to build a corpus that would be representative of pre-defined educational categories. This study will experiment on seven different algorithms that will be able to categorize web pages based on educational domain. Many studies about web categorization have already been conducted but is based on a general set of categories. This research will focus primarily on a predefined set of categories that are closely related to educational domains. With the use of machine learning, the classifier will be able to analyze what a web page is all about and determine its category. The study will also compare the different classifiers used. As a result, the system will be able to assign a web page to a particular educational domain and can be used by schools to determine the categories of web pages frequently requested by students. Linear SVM was also able to build a lexicon for the different categories. The top words for each category were then determined using this lexicon.
Index Terms—Corpus, decision trees, k-nearest neighbor, linear support vector machine, logistic regression, machine learning, multinomial naïve bayes, multilayer perceptron, natural language processing, web page categorization.
Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, Christian V. Maderazo are with University of San Carlos, Philippines (e-mail: patwoogue@gmail.com).
[PDF]
Cite:Patrick Dave P. Woogue, Gabriel Andrew A. Pineda, and Christian V. Maderazo, "Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus," International Journal of Computer Theory and Engineering vol. 9, no. 6, pp. 427-432, 2017.