PyData Yerevan 2022

EENLP: Cross-lingual Eastern European NLP Index
08-12, 17:00–17:15 (Asia/Yerevan), 113W PAB

In our project we present a wide index of existing Eastern European language datasets (90+) and models (60+). Furthermore, to support the evaluation of commonsense reasoning tasks, we compile and publish cross-lingual datasets for five such tasks and provide evaluation results for several existing multilingual models.

  1. In our project we present an index of existing Eastern European natural language datasets(90+) and models(60+) in our github repository
  2. For five semantic tasks we hand-crafted cross-lingual datasets by processing datasets from various sources into the same format, allowing more convenient and fast cross-lingual model evaluation. Since the source datasets are licensed under various licenses, we publish automatic scripts for our datasets compilation in the same github repository.
  3. We perform zero-shot cross-lingual transfer learning on these datasets using multiple existing cross-lingual models to define the performance baselines and publish detailed results of these experiments in our Wandb project

Prior Knowledge Expected

No previous knowledge expected

Andrey Manoshin got a specialist degree in control engineering in NRNU MEPHI (Moscow) in 2022. He worked for a few years as a Data Scientist with OCR and NLP tasks, and since late 2021 he has been working in Yandex.Research. His interests include but are not limited to fields of reinforcement learning, robotics, meta-learning, and representation learning.