Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	SUCHOMEL Vít
Year of publication	2020
Type	Article in Proceedings
Conference	Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020
MU Faculty or unit	Faculty of Informatics
Citation
Web	PDF ve sborníku Domovská stránka workshopu
Keywords	web corpora; web spam; supervised learning
Description	Internet spam is a major issue hindering the usefulness of web corpora. Unlike traditional text corpora collected from trustworthy sources, the content of web based corpora has to be cleaned. In this paper, two experiments of non-text removal based on supervised learning are presented. First, an improvement of corpus based language analyses of selected words achieved by a supervised classifier is shown on an English web corpus. Then, a semi-manual approach of obtaining samples of non-text web pages in Estonian is introduced. This strategy makes the supervised learning process more efficient. The result spam classifiers are tuned for high recall at the cost of precision to remove as much non-text as possible. The evaluation shows the classifiers reached the recall of 71 % and 97 % for English and Estonian web corpus, respectively. A technique for avoiding spammed web sites by measuring the distance of web pages from trustworthy sites is studied too.
Related projects:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy