Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis

Varování

Publikace nespadá pod Fakultu sportovních studií, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

DENISOVÁ Michaela RYCHLÝ Pavel

Rok publikování 2024
Druh Článek ve sborníku
Konference International Conference on Text, Speech, and Dialogue
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www Preprint version
Doi http://dx.doi.org/10.1007/978-3-031-70563-2_3
Klíčová slova bilingual lexicon induction; cross-lingual word embeddings; neural machine translation systems
Popis Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info