Finding Terms in Corpora for Many Languages with the Sketch Engine
Autoři | |
---|---|
Rok publikování | 2014 |
Druh | Článek ve sborníku |
Konference | Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics |
Fakulta / Pracoviště MU | |
Citace | |
www | Plný text výsledku |
Obor | Informatika |
Klíčová slova | terminology; terms; corpora; sketch engine |
Popis | Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a refer- ence corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate. In this abstract we describe how we addressed each of the stages above. |
Související projekty: |