Website Properties in Relation to the Quality of Text Extracted for Web Corpora

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

SUCHOMEL Vít KRAUS Jan

Year of publication 2021
Type Article in Proceedings
Conference Recent Advances in Slavonic Natural Language Processing (RASLAN 2021)
MU Faculty or unit

Faculty of Informatics

Citation
Web
Keywords Web crawling; Web spam; Text corpus; Text processing
Description In this paper we present our research concerning the relation between two properties of websites and the quality of the text extracted from a website in the context of crawling the web and building large web corpora. A manual classification of text quality of 18 thousand websites from 21 European languages was used to verify our assumption that certain web domain properties can be used to identify potential sources of bad quality content. The first property is the distance of a web domain from the seed domains in a web crawl. The second property studied in this work is the length of the website name. Although these properties were recommended to help identify good quality websites in our previous work, in this paper we show there is only a small difference between the quality of text-rich web domains with various seed distances or name lengths. This conclusion holds for the post-crawling text processing when starting the web crawl with a large amount of seed domains.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info