Automated Exploring News related to Crisis based on Hierarchical Classification Analysis

Jiunn-Liang Guo

ABSTRACT

  Automatic text classification is a critical task while organizing large-scale textual dataset, particularly for exploring crisis-related events of news media. In general, recent researches tend to adopt the non-hierarchical or topologically “flat” classification techniques in organizing unstructured corpora. However, the “flat” scheme is not sufficient to manage large volume of texts and the most methods often neglect the hierarchical relation attached to classes. In this paper, we propose a hierarchical scheme to investigate three well-known classification methods: Naïve Bayes, k-Nearest Neighbor, and SVM, by exploiting competition-based approach of the top-down tree structure along with various weighting techniques. As a result, we can identify the most suitable classifier at each level of hierarchy when considering the different effect of the discriminatory power from diverse compositions of bibliographic information. The corpus contains 6271 out of 804,414 news stories extracted from on standard text-mining dataset: Reuters Corpus Volume 1 version 2(RCV1-v2). The results show that the strategy enables the best classifier selected among different levels and contributes a promising performance of predicted classification. Moreover, we find that SVM still outperforms its competitors in most contexts in the hierarchical classification structure.

KEYWORDS: Hierarchical Text Classification; Feature Selection; SVM; KNN; Naïve Bayes

Full Paper