INfORmER: Identifying local News articles related to an Organisational Entity and its Remit

Nowadays, the general public has the benefit of accessing a vast amount of information [4] and articles on the internet. This can lead to an ‘information overload’ problem [1]. There are several organisations that on a daily basis must go through all local online newspapers, in order to check whether there are any articles that are relevant to their organisation in some way. This is a very time- consuming and repetitive job [2] that is prone to many human errors or intra-individual variability when performed manually. It is for this reason that there is an ever-growing need for a reliable and efficient article recommender system that takes care of the tedious job of going through local news articles and automatically choosing the ones that are relevant to the organisation.

Throughout this research, we investigate similar article recommender systems and also develop a system based on classification to assist users recommend articles from local newspapers without having to go through the trouble of reading numerous and possibly lengthy articles. Hence, we created INfORmER, a system that uses a wrapper induction algorithm to scrape local newspapers. Using several different pre-processing techniques, such as random oversampling [3] and word weighing (empowerment) combined with a hybrid ensemble classifier, the system evaluates which articles to recommend. Multiple classifiers such as KNN and SVM were tested with numerous pre-processing techniques like stemming and stop word removal, and these techniques were combined to create 6 different pre- processing sets. Around 17,000 tests were performed to find

which combination of classifiers and pre-processing sets gave the best results. The use of multiple classifiers in a system is evaluated, therefore experiments were run on a different number of classifiers so that the optimal number of classifiers and their combination was found. INfORmER also provides the option to automatically send an email with the articles it deems relevant. The system developed employs a hybrid ensemble classifier technique which uses an ensemble classifier with union voting and another ensemble with majority voting and cosine similarity techniques. These recommendation candidates are then combined through a majority voting classifier.

A daily averaged dataset was compiled for the final classifier evaluation. The articles from this dataset were given to a human classifier day by day to better understand how the system proposed will behave in a real-world scenario. The F1-score that INfORmER reported when tested on the daily averaged dataset was 59.65%. The best result generated from the traditional cosine similarity technique was 38.88%, meaning that INfORmER gave a 20% better F1-score over the cosine similarity technique. Finally, a user study was carried out where the human annotator was given a set of articles that she deemed irrelevant but the hybrid ensemble classifier deemed relevant. From this user study, it was concluded that the hybrid ensemble classifier not only recommended the majority of the relevant articles, but also recommended articles which the human annotator initially said were irrelevant, but which turned out to be indeed relevant.

Figure 1. High level design of INfORmER.

References

[1]         P. R. Y. Jiang, H. Zhan, and Q. Zhuang. Application research on personalized recommendation in distance education. In 2010 International Conference on Computer Application and System Modelling (ICCASM 2010), volume 13, pages V13–357–V13–360, Oct 2010.

[2]         Y. Jiang, Q. Shen, J. Fan, and X. Zhang. The classification for e-government document based on svm. In 2010 International Conference on Web Information Systems and Mining, volume 2, pages 257–260, Oct 2010.

[3]         Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. Distributional random oversampling for imbalanced text classification. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, pages 805–808, New York, NY, USA, 2016. ACM.

[4]         G. Song, S. Sun, and W. Fan. Applying user interest on item-based recommender system. In 2012 Fifth International Joint Conference on Computational Sciences and Optimization, pages 635–638, June 2012.

Student: Dylan Agius
Supervisor: Dr Joel Azzopardi
Co-Supervisor: Mr Gavril Flores
Course: B.Sc. IT (Hons.) Artificial Intelligence