chared: Character Encoding Detection with a Known Language
Authors | |
---|---|
Year of publication | 2011 |
Type | Article in Proceedings |
Conference | RASLAN 2011 |
MU Faculty or unit | |
Citation | |
Web | https://nlp.fi.muni.cz/raslan/2011/paper16.pdf |
Field | Informatics |
Keywords | character encoding; character encoding detection; charset; Unicode |
Description | chared is a system which can detect character encoding of a text document provided the language of the document is known. The system supports a wide range of languages and the most commonly used character encodings. We explain the details of the algorithm, describe the process of creating models for various languages and present results of an evaluation on a collection of Web pages. |
Related projects: |