Сборник 2021

Aleksandrova Polina, Mokhova Anna, Nikolaenkova Maria
Matching semantic sketches to predicates in context using the BERT model
Modern language models have extensive information about the compatibility and meanings of various words. One of the ways to represent such lexical information, which is presented in the present study, is the construction of semantic sketches. This paper presents a solution to the task of predicting a predicate from its most frequent actants and sirconstants using the application of the BERT neural network, which showed the best quality metrics in the Dialogue Evaluation SemSketches competition. The study analyzed several solutions approaching this task and ways to improve them based on the peculiarities of the architecture and the nature of data in terms of linguistics. The results of testing the selected methods showed that the most successful tool for determining the semantic sketch of a predicate is the Conversational RuBERT model combined with the search for synonyms of the verbs sought in the training data. Other promising ways to improve the quality of mapping the predicate to its semantic sketch include the use of contextualized embeddings to be able to take context into account, as well as fine-tuning of the models used.
Anastasyev Daniil
Annotated Span Normalization as a Sequence Labelling Task
In this paper, we describe a way to perform span normalization as a sequence labelling task. Our model predicts the modifications that should be applied to the span tokens to normalize them. This prediction is performed via sequence labelling, which means that each token is normalized independently. Despite the simplicity of the approach, we show that it can lead to the state­of­the­art results. We compare different pretraining schemas in application to this task. We show that the best quality can be achieved when the normalizer is trained on top of a BERT­based morpho­syntactic parser’s representations. Moreover, we propose some additional features useful in the task and prove that auxiliary morpho­syntactic losses can help the model. Furthermore, we show that the model compares favourably with other contestant models of the RuNormAS competition.
Arefyev Nikolay, Fedoseev Maksim, Protasov Vitaly, Homskiy Daniil, Davletov Adis, Panchenko Alexander
DeepMistake: Which Senses are Hard to Distinguish for a Word­in­Context Model
In this paper, we describe our solution of the Lexical Semantic Change Detection (LSCD) problem. It is based on a Word­in­Context (WiC) model detecting whether two occurrences of a particular word carry the same meaning. We propose and compare several WiC architectures and training schemes, and also different ways to convert WiC predictions into final word scores estimating the degree of semantic change. We participated in the RuShiftEval LSCD competition for the Russian language, where our model achieved 2nd best result during the competition. During post­evaluation experiments we improved the WiC model and managed to outperform the best system. An important part of this paper is detailed error analysis where we study the discrepancies between WiC predictions and human annotations and their effect on the LSCD results.
Arefyev Nikolay, Bykov Dmitriy
An Interpretable Approach to Lexical Semantic Change Detection with Lexical Substitution
In this paper we propose a new Word Sense Induction (WSI) method and apply it to construct a solution for the RuShiftEval shared task on Lexical Semantic Change Detection (LSCD) for the Russian language. Our WSI algorithm based on lexical substitution achieves state­of­the­art performance for the Russian language on the RUSSE2018 dataset. However, our LSCD system based on it has shown poor performance in the shared task. We have studied mathematical properties of the COMPARE score employed in the task for measuring the degree of semantic change, as well as the discrepancies between this score and our WSI predictions. We have found that our method can detect those aspects of semantic change, which the COMPARE metric is not sensitive to, such as appearance or disappearance of a rare word sense. An important property of our method is its interpretability, which we exploit to perform the detailed error analysis.
Bakhteev Oleg, Kuznetsova Rita, Khazov Andrey, Ogaltsov Aleksandr, Safin Kamil, Gorlenko Tatyana, Suvorova Marina, Ivahnenko Andrey, Botov Pavel, Chekhovich Yury, Mottl Vadim
Near-duplicate handwritten document detection without text recognition
The paper presents a novel method for near-duplicate detection in handwritten document collections of school essays. A large amount of online resources with available academic essays currently makes it possible to cheat and reuse them during high school final exams. Despite the importance of the problem, at the moment there is no automatic method for near-duplicate detection for handwritten documents, such as school essays. The school essay is represented as a sequence of scanned images of handwritten essay text. Despite advances in recognition of handwritten printed text, the use of these methods for the current task is a challenge. The proposed method of near-duplicate detection does not require detailed markup text, which makes it possible to use it in a large number of tasks related to the information extraction in zero-shot regime, i.e. without any specific resources written in the processed language. The paper presents a method based on series analysis. The image is segmented into words. The text is characterized by a sequence of features, which are invariant to the author’s writing style: normalized lengths of the segmented words. These features can be used for both handwritten and machine-readable texts. The computational experiment is conducted on IAM dataset of English handwritten texts and the dataset of real images of handwritten school essays.
Baranov A.N., Dobrovol’skij D.O.
Idiomaticity of a Text as a Matter of the Individual Style: A Quantitative Approach
The paper suggests one of the ways to formally define the degree of idiomaticity of a given text. Text idiomaticity is understood as the density of the use of idioms per text unit. The assessment of the degree of idiomaticity is carried out in the proposed approach as the ratio of the total number of idioms to the volume of the text in which they met. The conducted corpus experiment allows us to conclude that the degree of idiomaticity of the most important representatives of the prose of the second half of the 19th century varies significantly. Thus, the degree of idiomaticity of the text turns out to be an essential factor of the individual style.
Bazhukov M.O., Chubarova L.I., Slioussar N.A., Toldova S. Yu.
The order of objects in Russian: a corpus study
The paper presents the results of a corpus study of the order of direct and indirect objects in ditransitive constructions in Russian (like Petya dal Mashe yabloko ‘Petya gave Masha an apple’ or Petya dal yabloko Mashe ‘Petya gave an apple to Masha’). This topic has been widely discussed in the literature, but previous hypotheses have been based on individual examples and have never been tested on corpus data. Based on earlier research, we have selected parameters that affect the order of the objects, such as the length, depth, animacy and role of individual verbs and statistically tested their real effect on two subsamples: with a dative indirect object and with a prepositional one.
Беликов В.И., Дубяга А.O., Рванова Л.Ю., Селегей В.П.
Корпусная региональная лексикография: принципы, методы и предварительные результаты
В статье подводятся итоги многолетнего проекта «Языки Русских Городов» (ЯРГ) по сбору и исследованию региональной лексики, который, к сожалению, не был «финализирован» по ряду причин в виде академических публикаций. Был собран и систематизирован значительный (ок. 4 тыс. единиц) региональный материал, на базе которого рассматривается типология региональных различий, вводится/обсуждается понятие региональной нормы. Особое внимание уделяется вопросам надежности и методикам компьютерных региональных корпусных исследований, включая автоматическую классификацию и текстов и профилирование авторов. Вместе с этой публикацией возвращается в фонд открытых лексикографических ресурсов и «реинкарнация» проекта ЯРГ – теперь на базе объединенного портала для дифференциальных социолингвистических исследований, включающего интернет-корпус ГИКРЯ и интерактивный словарь ЯГеЛь (Языки Городов и Людей).
Belkova Lubov
Influence of speech breathing after physical activity on intonational-pausal segmentation of speech
This study raises the problem of the difference between normal and forced (deep) speech breathing. The aim of this work was to study the intonational-pausal segmentation of speech in normal and forced breathing after physical activity. The results of the study show that in the process of reading, the structure of the text determines the organization of breathing, and the breathing rate and respiration depth have an impact on the intonational-pausal segmentation of speech, as well as on the duration and quantity of intonation pauses.
Bernasconi Beatrice, Noseda Valentina
Examining the role of linguistic context in aspectual competition: a statistical study
This paper aims to show the results of a quantitative study on verbal aspect in modern Russian. Adopting a corpus-based approach, we investigate the phenomenon known as ‘aspectual competition’, which can take place when the imperfective aspect (ipf) is used instead of perfective to designate a single and complete event in the past. In particular, we investigate the interaction between the choice of aspect and co-textual factors in overlapping situations. In this study the attention is focused on one aspectual pair, namely pokupat’ipf - kupit’pf, ‘to buy’. The work consists of two parts: in Phase 1 data were collected from the spoken subcorpus of the Russian National Corpus and the webcorpus RuTenTen11, annotated for several morpho-syntactic factors, and then examined. In Phase 2 a questionnaire was submitted to native speakers in order to collect more empirical evidence on aspect choice and verify the results obtained from the corpus study. In both phases, statistical methods were used to analyse the data. Results show that the aspect of the target verb mainly interacts with two factors: the presence of a contiguous verbs in the linguistic context and the presence of an object modifier.
Bogdanova-Beglarian N. V., Blinova O. V., Sherstinova T. Ju., Troshchenkova E. V., Gorbunova D. A., Zajdes K. D., Popova T. I., Sulimova T. S.
Pragmatic Markers of Russian Everyday Speech: Quantitative Data
The article summarizes the results of a large research project dedicated to investigation of pragmatic markers (PM) in Russian everyday speech. Pragmatic markers are essential in spontaneous spoken discourse; thus, the quantitative data on their usage are necessary for solving both theoretical and practical issues related to the study of spoken communication. New results were obtained on the data of two speech corpora: “One Day of Speech” (ORD; mostly dialogues; the annotated subcorpus contains 321 504 tokens) and “Balanced Annotated Text Library” (SAT; monologues; the annotated subcorpus includes 50 128 tokens). Statistical data were calculated for PM in dialogic and monologic speech, pragmatic markers common in both types of speech (e. g., hesitative markers like vot, tam, tak) are identified, as well as PM that are the most typical for monologues (e. g., boundary markers like znachit, nu, vot, vs’o) or dialogue (e. g., ‘xeno’-markers such as takoi, grit and metacommunicative markers like vidish’, (ja) ne znaju). Special attention is given to the pragmatic markers usage in different communicative situations.
Boguslavsky Igor, Dikonov Vyacheslav, Inshakova Evgeniya, Iomdin Leonid, Lazursky Alexandre, Rygaev Ivan, Timoshenko Svetlana, Frolova Tatyana
Semantic Representations in Computational and Theoretical Linguistics: the Potential for Mutual Enrichment
Research in semantics is actively conducted both in theoretical and computational linguistics, but the formulation of tasks, objectives and results of semantic research in the two communities are usually largely different. As a step towards reducing this gap and increasing the awareness of theoretical linguists about what computational linguists are doing, we examine meaning representation approaches in computational linguistics and contrast them with how this is done within one of the best-known theoretical approaches – the Meaning ⇔Text Theory.
Boguslavsky I. M., Iomdin L. L.
Semantic features and valency properties of the Russian verb podoždat’ ‘wait’
The paper presents a detailed account of the semantics of the Russian perfective verb подождать (≈ ‘wait some time’), which belongs to the family of words focused around the verb ждать ‘wait’. The verb, much like the whole family, has a set of unique and non-trivial semantic properties that have not been so far adequately represented either in traditional and computer dictionaries of the Russian language or in scientific descriptions. The main features of this verb include its peculiar morphological and semantic relationship with the dominant word of the family, the verb ждать, as well as a ramified valence frame, characterized by rarely occurred means of implementing semantic valencies and unusual conditions of cooccurrence
Bolshakova E.I., Sapin A.S.
Building Dataset and Morpheme Segmentation Model for Russian Word Forms
The paper describes a way to generate a dataset of Russian word forms, which is needed to build an appropriate neural model for morpheme segmentation of word forms. The developed generation procedure produces word forms segmented into morphs that are classified by morpheme types, based on existing dataset of segmented lemmas and additional dictionary data, as well as fine-grained classification of Russian inflectional paradigms, which makes it possible to correctly process word forms with alternating consonants and fluent vowels in endings. The built representative dataset (more than 1,6 million word forms) was used to develop a neural model for morpheme segmentation of word forms with classification of segmented morphs. The experiments have shown that in detecting morphs boundaries the model has comparable quality with the best segmentation models for lemmas (98% of F-measure), slightly outperforming them in word-level classification accuracy (with score 91%).
Chuikova Oksana
On (non-)compatibility of genitive partitive and imperfective in Russian: a corpus study
The paper provides the results of the study of the use of the genitive case with partitive semantics as the means of direct object marking within imperfective verbs in Russian. The genitive partitive is traditionally claimed to be compatible with perfective verbs and as an exception with imperfective verbs used as the substitution for perfective verbs in neutralization contexts. The analysis of the data from the Russian National Corpus and the Russian-language Internet shows that the use of the genitive partitive within imperfective verbs is neither rare nor marginal. The compatibility level of the genitive and imperfective aspectual correlates of prefixed perfective verbs is dependent on the imperfectivability level and frequency. The use of the genitive partitive is sensitive to the semantics of the imperfective, however, it means the coverage of a broader range of phenomena than it is traditionally assumed. Although the use of the genitive partitive is mostly restricted to neutralization contexts such as iterativity and historical present, a number of gradual achievement imperfective verbs with progressive semantics as well as verbs that refer to constant situations are compatible with the genitive partitive.
Dementieva Daryna, Moskovskiy Daniil, Logacheva Varvara, Dale David, Kozlova Olga, Semenov Nikita, Panchenko Alexander
Methods for Detoxification of Texts for the Russian Language
We introduce the first study of detoxification of Russian texts to combat offensive language in social media. While much work has been done for the English language in this field, it has never been solved for the Russian language yet. We test two types of models – unsupervised approach based on BERT architecture that performs local corrections and supervised approach based on pretrained language GPT-2 model – and compare them with several baselines. In addition, we describe evaluation setup providing training datasets and metrics for automatic evaluation. The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
Dmitrieva Anna, Laposhina Antonina, Lebedeva Mariia
A Quantitative Study of Simplification Strategies in Adapted Texts for L2 Learners of Russian
Nowadays there has been a growing interest in the topic of Russian text adaptation, both in theoretical aspects of intralingual translation into Simple and Plain Russian, and in practical tasks like automatic text simplification. Therefore, it is important to study the characteristics that make an adapted text more accessible. In this paper, we aim to investigate the strategies that human experts employ when simplifying texts, particularly when the texts are being adapted for learners of Russian as a foreign language. The main data source for this research is the RuAdapt parallel corpus, which consists of Russian literature texts adapted for the learners of RaaFL and the original versions of these texts. We study the changes that occur during the adaptation process on lexical, morphological, and syntax level, and compare them to the methods usually described in methodological recommendations for teaching RaaFL.
Emelyanov Anton, Shliazhko Oleg, Katricheva Nadezhda, Shavrina Tatiana
Using RuGPT3-XL Model for RuNormAS competition
The paper presents a fine-tuning methodology of the RuGPT3-XL (Generative Pretrained Transformer-3 for Russian) language model for the normalization of text spans task. The solution is presented in a competition for two tasks: Normalization of Named Entities (Named entities) and Normalization of a wider class of text spans, including the normalization of different parts of speech (Generic spans). The best solution has achieved 0.9645 accuracy on the Generic spans task and 0.9575 on the Named entities task.
Fedorova Olga
Oculomotor everyday communication: How to pick a good metric
This paper contributes to the research field of bimodal linguistics that explores two modalities involved in everyday communication – vocal and kinetic. When exploring almost any scientific phenomenon, one addresses two opposite issues: individual differences, on the one hand, and general patterns, on the other. We have focused on the individual differences and proposed a “portrait” approach to communication. We are faced with a difficult task to find a good metric for analyzing oculomotor behavior of people in everyday communication. In previous papers, starting from [14], the authors were looking for oculomotor patterns, but their results depend critically on the metric used. In this paper, we compared the most common metrics and showed that individual differences have a much more serious weight than general patterns. We then identified four coefficients that determine these individual differences: kaside, kvip, kchain, and dur75. By comparing these Core Oculomotor Portraits, we were able to make these individual differences more clear. However, a fact is a fact: there are far more individual differences than general patterns between our Narrators behavior. The proposed coefficients, in our opinion, clearly show (and even explain and predict) the observed individual differences.
Fenogenova Alena
Text Simplification with Autoregressive Models
Text Simplification is the task of reducing the complexity of the vocabulary and sentence structure of the text while retaining its original meaning with the goal of improving readability and understanding. We explore the capability of the autoregressive models such as RuGPT3 (Generative Pre-trained Transformer 3 for Russian) to generate high quality simplified sentences. Within the shared task RuSimpleSentEval we present our solution based on different usages of RuGPT3 models. The following setups are described: 1) few-shot unsupervised generation with the RuGPTs models 2) the effect of the size of the training dataset on the downstream performance of fine-tuned model 3) 3 inference strategies 4) the downstream transfer and post-processing procedure using pretrained paraphrasers for Russian. This paper presents the second-place solution on the public leaderboard and the fifth-place solution on the private leaderboard. The proposed method is comparable with the novel state-of-the-art approaches. Additionally, we analyze the performance and discuss the flaws of RuGPTs generation.
Fenogenova Alena, Shavrina Tatiana, Kukushkin Alexandr, Tikhonova Maria, Emelyanov Anton, Malykh Valentin, Mikhailov Vladislav, Shevelev Denis, Artemova Ekaterina
Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP-models
In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks. This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological improvements, including fixes of the benchmark vulnerabilities unresolved in the previous version: novel and improved tests for understanding the meaning of a word in context (RUSSE) along with reading comprehension and common sense reasoning (DaNetQA, RuCoS, MuSeRC). Together with the release of the updated datasets, we improve the benchmark toolkit based on jiant framework for consistent training and evaluation of NLP-models of various architectures which now supports the most recent models for Russian. Finally, we provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO (MOdel ResOurCe COmparison), in which the models are evaluated according to the weighted average metric over all tasks, the inference speed, and the occupied amount of RAM.
Fishcheva I. N., Goloviznina V. S., Kotelnikov E. V.
Traditional Machine Learning and Deep Learning Models for Argumentation Mining in Russian Texts
Argumentation mining is a field of computational linguistics that is devoted to extracting from texts and classifying arguments and relations between them, as well as constructing an argumentative structure. A significant obstacle to research in this area for the Russian language is the lack of annotated Russian-language text corpora. This article explores the possibility of improving the quality of argumentation mining using the extension of the Russian-language version of the Argumentative Microtext Corpus (ArgMicro) based on the machine translation of the Persuasive Essays Corpus (PersEssays). To make it possible to use these two corpora combined, we propose a Joint Argument Annotation Scheme based on the schemes used in ArgMicro and PersEssays. We solve the problem of classifying argumentative discourse units (ADUs) into two classes – “pro” (“for”) and “opp” (“against”) using traditional machine learning techniques (SVM, Bagging and XGBoost) and a deep neural network (BERT model). An ensemble of XGBoost and BERT models was proposed, which showed the highest performance of ADUs classification for both corpora.
Galeev Farit, Leushina Marina, Ivanov Vladimir
ruBTS: Russian Sentence Simplification Using Back-translation
Automatic text simplification is a crucial task enabling to reduce text complexity while preserving meaning. This paper presents our solution to the Russian Sentence Simplification Shared Task (RSSE) based on a backtranslation technique. We show that applying the simple back-translation approach for sentence simplification can give competitive results with the other methods without fine-tuning or training.
Golubev Anton, Loukachevitch Natalia
Transfer Learning for Improving Results on Russian Sentiment Datasets
In this study, we test transfer learning approach on Russian sentiment benchmark datasets using additional train sample created with distant supervision technique. We compare several variants of combining additional data with benchmark train samples. The best results were achieved using three-step approach of sequential training on general, thematic and original train samples. For most datasets, the results were improved by more than 3% to the current state-of-the-art methods. The BERT-NLI model treating sentiment classification problem as a natural language inference task reached the human level of sentiment analysis on one of the datasets.
Golubkova Ekaterina, Trubochkin Alexander
A Corpus-Based Model of the English Phrasal Verb Construction: Attraction
The article investigates the semantic of English phrasal verbs (PhVs) which are viewed as lexico-grammatical constructions. Triangulation of introspective, cognitive and corpus methods of analysis allows us to identify the semantic dimensions which feature the semantic pattern of the PhV-construction. The construction reveals the features of attraction involving new verbs provided the action or motion event is identical. Depending on the attraction strength level between the verb and the particle a new verb may be accepted to fill in the corresponding slot of the construction, which gives rise to a new phrasal verb. It allows us to categorise PhVs according to the attraction level and spot their PhV-patterns on corpus data.
Gusev Ilya, Smurov Ivan
Russian News Clustering and Headline Selection Shared Task
This paper presents the results of the Russian News Clustering and Headline Selection shared task. As a part of it, we propose the tasks of Russian news event detection, headline selection, and headline generation. These tasks are accompanied by datasets and baselines. The presented datasets for event detection and headline selection are the first public Russian datasets for their tasks. The headline generation dataset is based on clustering and provides multiple reference headlines for every cluster, unlike the previous datasets. Finally, the approaches proposed by the shared task participants are reported and analyzed.
Iazykova Tatyana, Bystrova Olga, Kapelyushnik Denis, Kutuzov Andrey
Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks
Leaderboards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world’s best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings. In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leaderboard for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leaderboard is due to exploiting these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leaderboard even more representative of the real progress in Russian NLU.
Ilina Daria, Kononenko Irina, Sidorova Elena
On Developing a Web Resource to Study Argumentation in Popular Science Discourse
This paper discusses the experience of developing a web resource intended to study argumentation in popular science discourse. Such type of argumentation is, on the one hand, the main mean of achieving a communicative goal and, on the other hand, often not expressed in explicit form. The web resource is built around a corpus of 2256 articles, distributed over 13 subcorpora. The annotation model, which is based on the ontology of argumentation and D. Walton's argumentation schemes for presumptive reasoning, underlies the argument annotation of the corpus. The distinctive features of the argument annotation model are the introduction of weighting characteristics into text markup through assessing the persuasiveness of the argumentation, as well as highlighting argumentative indicators visually. The paper considers a scenario of argument annotation of texts, which allows constructing an argumentative graph based on the typical reasoning schemes. The scenario includes a number of procedures that enable the annotator to check the quality of the text markup and assess the persuasiveness of the argumentation. The authors have annotated 162 texts, using the developed web resource, and as a result, identified the most frequent schemes of argumentation (Example Inference, Cause to Effect Inference, Expert Opinion Inference), as well as described some specific indicators of frequent schemes. Based on the above-mentioned outcomes, the authors listed the indicators of the most frequent schemes of argumentation and made some recommendations for annotators about identifying the main thesis.
Inkova Olga
Defining discourse relations: Supracorpora database of connectives
The research is focused on definitions of discourse relations, a topic that is currently little-studied. The paper gives a brief overview of existing solutions for discourse relations definitions: Rhetorical Structure Theory (RST), Segmented Discourse Representation Theory (SDRT), Penn Discourse Treebank (PDTB), and Cognitive approach to Coherence Relations. The author shows criteria used to define a discourse relation, or, in case of a narrower definition, a logical-semantic relation, in these approaches and outlines the shortcomings of the described definitions. The author also describes the principles used to build the classification and the definitions of logical-semantic relations (LSR) in the Supracorpora Database of connectives (SDB). The classification is based on four basic semantic operations upon which rests every LSR's definition: implication, location on the chronological scale, comparison, correlation between specific and general or an element and a set. The classification consistently distinguishes the levels at which the LSR can be established: propositional, illocutionary, and metalinguistic. Each LSR is defined on the basis of these two criteria. Thus, for example, for the LSR of alternative based on the comparison operation, one has the choice between the LSR of propositional, illocutionary and metalinguistic alternative (We will go to the mountains or to the sea vs. Put the gun away, or are you scared? vs. The symbol of the year or, simply speaking, cutie-pie). In case of LSRs based on implication or comparison, the polarity criterion is added, distinguishing whether the LSR is established between p and q or their negative correlates ¬ p and ¬ q are also to be taken into account in order to obtain a correct interpretation (cf. well-known descriptions of how the Russian conjunction no ‘but’ functions). In addition, semantic and pragmatic characteristics of the context are also considered in the classification. For example, in the case of the LSR of specification and generalization, the semantic correlation between p and q (together with their intensional and extensional interpretations) is taken heed of. Several definitions of LSR and corresponding examples are provided. Thus, the LSR of extensional specification is defined as follows: based on the operation of correlation between the general and the particular; established at the propositional level; X contains a generalized notion or state of things p; Y contains a more particular q-notion, limiting p-extensional. And the LSR of intensional specification is defined as follows: based on the operation of correlation between the general and the particular; established at the metalinguistic level; X contains a generalized concept or state of things p; Y contains a more particular q-notion, limiting p-intensional. The definitions used in the SDB definitions make it possible to evaluate, on the basis of the proposed criteria, the semantic closeness of relations and increase the level of consistency in the work of experts and annotators. That in turn increases the value of the annotated material, and therefore its reliability.
Inkova O., Nuriev V.
Divergent translation of connectives in human and machine translations
The paper is focused on divergent ways of conveying discourse relations in translation. For data collection, we used the supracorpora database of connectives storing parallel texts from the Russian-French subcorpus of the Russian National Corpus. These data show what logical-semantic relations tend to be translated using divergent ways, i.e. other than connectives (exclusion in its various gradations, propositional concomitance and substitution, the share of divergent translations ranging from 30% to 50%). Also, such data help define what causes divergent ways of translation to be used. The causes may be as follows: (a) the lack of an adequate equivalent of a given connective in the target language; (b) differences in the syntactic structure of the source and target languages; (c) usage differences; (d) contextually determined use of divergent translation. If there is a prototypical indicator of logical-semantic relations (i.e. connective) in the source text, it also occurs in translation in more than 90% of cases. The data on human translations are then compared with those on machine translations, which shows that the machine translation system also tends to keep a connective if there is one in the source text (it occurs in almost 98% of cases). However, there are cases where the machine translation system has difficulties processing а multiword connective (failing to perceive it as a whole) or a polyfunctional unit (failing to tell a connective from a non-connective) and thus uses divergent ways to translate it. Some causes of divergently translating connectives are likely to be the same for human and machine translations. These are differences in the syntactic structure of languages and usage differences. Further research of divergent means of conveying discourse relations will allow to draw a sharper border-line between explicitly expressed and implicit discourse relations. The data collected from annotated corpora (both monolingual and multilingual and parallel) will help determine what the divergent ways of expressing logical-semantic relations are and how frequently they are used. The research results can be used both in automatic text processing and automatic text generation. Also, the data on divergent translations of discourse relations can serve to improve the machine translation quality.
Ivanov V., Solovyev V.
The Relation of Categories of Concreteness and Specificity: Russian Data
The categories of concreteness and specificity are important for understanding the mechanisms of information representation and processing in human brain. These two categories are quite close, but still different. A method for quantifying the degree of correlation of these categories for the English has recently been proposed. This paper deals with a similar research of the Russian. Ratings from the Concreteness/Abstractness Dictionary (RDCA) are taken as a measure of the words’ concreteness. The degree of a word specificity is estimated by its location in the RuThes thesaurus. The paper represents the comparison with the English data and shows the similarity of the results for Russian and English.
Karpov Dmitry, Burtsev Mikhail
Data pseudo-labeling while adapting BERT for multitask approaches
Nowadays, BERT models have found wide use in the NLP field. However, standard BERT architecture training can be stifled by the lack of labels for different tasks while treating multitask settings as a one-task multilabel setting. For every example, we have labels from this example’s source task but not from other tasks. This article addressed this issue, exploring eight different data pseudo-labeling approaches in the GLUE 4-task setting. These approaches do not require changes in samples or model architecture. One of the presented techniques excels results on RTE from the original article, by 6.2 %, and falls behind the original article on QQP, MNLI, and SST only by 0.5-1.2 %. This way also excels other pseudo-labeling approaches explored in the article by 0.5-2% on average if we consider similar tasks. However, for tasks that are dissimilar to each other, different proposed approach yields the best results.
Kazakov Roman, Lyashevskaya Olga
Adjunct role labeling for Russian
The task of the semantic role labeling usually focuses on identifying and classifying the core, obligatory arguments of the predicate. The adjuncts of Time, Location, etc. (non­core, modifier arguments) are considered on the periphery of the task [30] and even doing the easy part of it [44], despite the fact that they are highly integrated into the clause structure and may non­trivially interact with the meaning of the verb [4, 32]. In this paper, we present experiments on labeling the adjunct roles of LOCATION, TIME, MANNER, DEGREE, REASON, and PURPOSE, based on the manually annotated Adjuncts­FrameBank data set. The results show an average F1­score of 0.94 on the gold adjunct phrase annotations using the word2vec representations of adjuncts, word2vec representations of predicates, and the moprhosyntactic marking of adjuncts. Our findings generally corroborate the theoretical hypothesis on the structural and semantic autonomy and lexico­morphosyntactic specialization of adjuncts. Yet, more complicated organization of their network is revealed, pointing to the diversity of adjuncts in terms of their distribution and behavior.
Kazartsev Evgeny, Zemskova Tatiana
A New Electronic System for Comparative Analysis of Verse and Prose
This paper will focus on the development of a new computational system, Prosimetron, which enables comparative statistical studies of the rhythm of verse and prose in different languages (currently 10 languages are operative, with the possibility of adding more). The results of the analysis can be used not only for studying the processes for the genesis, expansion, and modification of various versification systems, but also for commenting on and interpreting the verse rhythm in different national poetic traditions in comparison with their foreign sources and language prosody. In addition, the possibility to model various processes of poetic speech generation and to analyze rhythmic vocabularies of prose allows hypotheses about the cognitive mechanisms of verse generation. This system operates in a semiautomatic mode and, by minimizing errors and enabling the processing of large amounts of data, provides a unique tool for computer research on the rhythm of different modes of speech.
Khaustov S. V., Gorlova N. E., Kalmykov A. V., Kabaev A. S.
BERT for Russian news clustering
This paper provides results of participation in the Russian News Clustering task within Dialogue Evaluation 2021. News clustering is a common task in the industry, and its purpose is to group news by events. We propose two methods based on BERT for news clustering, one of them shows competitive results in Dialogue 2021 evaluation. The first method uses supervised representation learning. The second one reduces the problem to binary classification.
Klyachko Elena, Grebenkin Daniil, Nosenko Daria, Serikov Oleg
LowResourceEval­2021: a shared task on speech processing for low­resource languages
This paper describes the results of the first shared task on speech processing for low­resource languages of Russia. Speech processing tasks are notoriously data­consuming. The aim of the shared task was to evaluate the performance of state­of­the­art models on low­resource language data as well as draw the attention of experts to field linguistics data (using Lingovodoc project data). The tasks included language identification and IPA transcription, with three teams participating in them. The paper also provides a description for the datasets as well as an analysis of the participants’ solutions. The datasets created as a result of the shared task can be used in other tasks to enhance speech processing and help develop modern NLP tools for both speech communities and field linguists.
Knyazev S. V., Pronina M. K.
The intonation of yes and no in an archaic Russian dialect
The present paper analyzes the intonation of pragmatic particles da "yes" and net "no" found in the spontaneous dialogue speech corpus of a Northern Russian dialect, in which each word bears a pitch accent. Intonation that marks such particles sounds unusual for speakers of Standard Russian and is perceived by them as blunt and impolite. The main aim was to find a consistent pattern explaining the distribution of falling and rising pitch accents on such particles in a dialect of Vaduga (Arkhangelsk region). We tested three hypotheses that can account for this distribution: (a) semantic explanation (the type of pitch accent depends on the semantics of the very particle); (b) communicative explanation (it depends on the communicative function of the preceding utterance, that is, whether it is a question or not); (c) phonetic explanation (it depends on the pitch accent of the preceding utterance). A total of 240 utterances from 3 speakers were analyzed. Results showed that the semantics of the particle is not a relevant factor, while the communicative type and the pitch accent of the preceding utterance are significant predictors of the pitch accent that marks the particle, with the latter better explained the data. We propose that when analyzing the intonation of a dialect, semantic interpretation of the intonational constructions of the standard dialect should not be taken into account. Moreover, we suggest that a new approach of collecting prosodic data with elderly people while controlling for pragmatic context is needed.
Korotaev Nikolay
Parenthetical constructions in Russian spoken discourse: Basic types and prosodic features
The paper discusses the notion of parentheticals in Russian spoken discourse. Using data from two prosodically annotated corpora — “Stories about presents and skiing” and “Russian Pear Chats & Stories” — I advocate for a discourse-oriented approach to parenthetical constructions. I define a parenthetical construction as consisting of three elements: the left context, the parenthetical unit, and the right context. Each element constitutes a separate discourse unit and is thus prosodically autonomous. I rely on the notion of projection [Auer 2005] to account for the discourse relationships between these three components. When the speaker pronounces the left context, she projects a continuation that is to be realized in the right context, while the parenthetical unit provides a digressive discourse step. Typically (around 50% in my data), parentheticals are anchored to their left contexts and are pronounced with a falling or level pitch accent. Noted deviations from this prototype include free parentheticals, parenthetical uses of vot, and parentheticals pronounced with a rising pitch accent. Furthermore, I explore two prosodic features frequently associated with parentheticals, namely, increased articulation rate and pitch range narrowing. I show that, while both these tendencies are statistically significant, the latter has a larger effect size than the former.
Korzun V. A., Dimov I. N., Zharkov A. А.
Audio and Text-Driven approach for Conversational Gestures Generation
This paper describes FineMotion’s gesture generating system entry for the GENEA Challange 2020. We start by using simple baselines and expand them by using context and combining both audio and textual features. Among the participating systems, our entry attained the highest median score in the human-likeness evaluation and second highest median score in appropriateness.
Kotelnikov Evgeniy
Current Landscape of the Russian Sentiment Corpora
Currently, there are more than a dozen Russian-language corpora for sentiment analysis, differing in the source of the texts, domain, size, number and ratio of sentiment classes, and annotation method. This work examines publicly available Russian-language corpora, presents their qualitative and quantitative characteristics, which make it possible to get an idea of the current landscape of the corpora for sentiment analysis. The ranking of corpora by annotation quality is proposed, which can be useful when choosing corpora for training and testing. The influence of the training dataset on the performance of sentiment analysis is investigated based on the use of the deep neural network model BERT. The experiments with review corpora allow us to conclude that on average the quality of models increases with an increase in the number of training corpora. For the first time, quality scores were obtained for the corpus of reviews of ROMIP seminars based on the BERT model. Also, the study proposes the task of the building a universal model for sentiment analysis.
Koziuk Evgenia, Badryzlova Yulia
‘No way!’ Discourse formulae of disagreement in Russian and English: a comparative study
The study explores the discourse formulae (DFs) of disagreement in Russian and English belonging to the subclasses of refusal and prohibition. Starting with a subset of six Russian target DFs, we establish their English equivalents using corpus analysis. We also define the typical speech acts to which the DFs in both languages react, and design model contexts that exemplify these types of speech acts. We use the model contexts as stimuli in our Russian and English surveys where we look at the preferences of native speakers in choice of DFs across the speech acts. We use the data of the surveys to establish the pragmatic function of each DF, (i.e. refusal or prohibition, or both), and their potential in each subclass (strong, medium, or weak). For each DF, we also identify the types of speech acts to which they react most readily. We compare the results of our analysis to the lexicographic description of the target DFs as presented in the Russian-English Dictionary of Idioms.
Kustova Galina
The types of infinitive constructions with predicatives (according to the Russian National Corpus)
The paper considers constructions «predicative + infinitive». For the first time, a class of interpretive infinitive constructions (opposed to emotional reactions) is introduced. For emotional reactions, the predicative and the infinitive refer to the same subject, the infinitives of the perception, mental, speech verbs are typical for them: It hurts / scares to see how forests are dying (‘X sees, X is scared’) → It hurts that forests are dying. For interpretive constructions, the subjects of the predicative and the infinitive do not coincide: It is heartless to separate the mother from the children – ‘X separates, Y evaluates such an act as heartless’. The infinitives of perceptual and mental verbs in such a construction are either not used, or they denote a kind of action: It is tactless to listen to private conversations.
Letuchiy Alexander, Nikishina Elena
Animacy in the use of anaphoric and demonstrative pronouns in Russian and French
The article focuses on the role of animacy in Russian and French pronominal systems. Although animacy is a grammatical category only in Russian, while in French it is not reflected in the behavior of nouns, it turns out that some animacy-based restrictions on the use of anaphoric and demonstrative pronouns are common for the two languages. We address syntactic restrictions that affect the following types of uses: (i) use of anaphoric pronouns in copular constructions; (ii) repetition of anaphoric pronouns for the sake of clearness and / or emphasis; (iii) deictic use of anaphoric pronouns; (iv) anaphoric use of demonstrative pronouns. In all the four cases, except, perhaps, the fourth one, pronouns tend to have an animate referent, while inanimate ones are more problematic. We conclude that these restrictions mainly result from the fact that animate objects have a greater discourse importance and more often become the main subject of the discourse than inanimate ones. At the same time, degree of strictness of restrictions sometimes differ between the two languages: for instance, demonstrative pronouns in the anaphoric use tend to have an animate antecedent in Russian, while for French, this tendency is weaker.
Levontina Irina
The semantic component ‘scale’ in the meaning of a discourse particle uzh
The modal particle uzh is perhaps the most difficult Russian discourse word to describe since its semantics is highly elusive. The existing descriptions are rather abstract and poorly correlate with various cases of usage of uzh. Besides, they do not take into consideration several crucial components of this particle’s meaning. For instance, in phrases like Uzh ya-to znayu (‘I do know’) one can notice a hugely important component of meaning - the idea of a scale. One can say Ya-to etot sekret znayu, a vot drugim nevdomek (‘I do know the secret, whereas others have no idea about it’), and in this example, uzh would be irrelevant. Uzh ya-to eto znayu presupposes that others probably know it too, but it’s me who knows it for sure. This very idea of a scale and poles together with the idea of the exceedance of expectations (which is also important for the meaning of uzh) constitutes the semantic contribution that this particle makes. Moreover, uzh partly smooths the opposition between the central and other elements of a multitude, because it does not exclude them from consideration, it just gives emphasis to that one. The aim of this research is to examine those types of uzh usage, where the idea of a scale is most clearly actualized. Probably, if we understand how the significant components of this particle’s meaning function, we will get closer to the development of a complete picture of its usage. For example, the idea of a scale within the meaning of uzh is expressed in the context of a special question (Zachem uzh tak zlo? ‘Why so mean?’). In an argument uzh often implies that the speaker was almost ready to back down, but not to this extent - like in a famous poem by Daniil Kharms called «Liar» (1930). The idea of a scale is vividly realized in the context of an implicit (Gde uzh mne!, ‘How can I…’) or explicit negation. It is especially interesting to pay attention to the peculiar effects of the combination of uzh with comparative forms (luchshe uzh, ‘it would be better...’). The usage of uzh in standard word combinations raz uzh, esli uzh, togda uzh has its restrictions, also connected with the idea of a scale. The development of a modal meaning in a temporal word, which brings the transformation of a timeline into a scale of expectations or possibilities, is quite typical.
Magomedova V. D., Slioussar N. A.
Gender and Case in Russian Nouns Denoting Professions and Social Roles
In the present paper, we analyzed a group of Russian nouns denoting professions and social roles. Historically, these nouns were masculine; in modern Russian, they can also be used with feminine agreement, but only nominative forms are regarded as normative (e.g. etot / eta vrač ‘thisM/F doctor’). We showed that oblique case feminine forms occur naturally using the Web-as-corpus approach and conducted three experimental studies. We discovered that offline rating and online processing of such forms depends on their case. Firstly, this is a unique example of the properties of the form influencing the properties of the lexeme. Secondly, the fact that all oblique forms are regarded as marginal and that locative was found to be significantly worse than other oblique cases points to a deep connection between grammatical gender and inflectional classes and to the crucial role of affix syncretism in morphological processing. This presents a challenge for different approaches in theoretical morphology.
Michurina Mariia, Ivoylova Alexandra, Kopylov Nikolay, Selegey Daniil
Morphological annotation of social media corpora with reference to its reliability for linguistic research
This paper presents the results of the study devoted to the applicability of SOTA methods for morphological corpus annotation (based on GramEval2020) for analytical sociolinguistic research. The study shows that statistically successful technologies of morphosyntactic annotation for such purposes create a number of problems for researchers if they are used purely i.e. without any linguistic knowledge. In this paper, methods for improving the morphological annotation, successfully implemented in GICR, from the point of view of its reliability are presented.
Mityushin Leonid, Iomdin Leonid
Experiments on human incremental parsing of English
Experiments have been carried out in which human subjects incrementally constructed dependency trees of English sentences. The subjects were successively presented with growing initial segments of a sentence, and had to draw syntactic links between the last word of the segment and the previous words. They were also shown a fixed number of lookahead words following the last word of the segment. The results of the experiments show that lookahead of 1 or 2 words is sufficient for confident incremental parsing of English declarative sentences.
Mustajoki Arto, Cherkunova Natalia, Sherstinova Tatiana
Communication Failures in Everyday Conversations: a Case Study Based on the “Retrospective Commenting Method”
The paper deals with communication failures in everyday spoken discourse. The spontaneous character of oral speech is its basic property and becomes a prerequisite for the appearance of such a phenomenon as communicative failures. By communicative failures, we mean speech situations when the recipient of a speech message does not understand it correctly, i.e., in the way the speaker intended. The purpose of this pilot study is 1) to assess the total number of communication failures that occur with a person during a single day and 2) to determine the dependence of communication failure frequency on the communication settings and conditions. The main result of the study is a qualitative and quantitative assessment of communication failures during a subjects’s day. The research is based on a special experiment based on 24-hour monitoring of the subject’s speech and his subsequent retrospective commentary on all recorded data. Such an approach allows one to reduce the subjectivity inherent in much linguistic work. The research continues a series of studies devoted to the effectiveness of spoken communication and is important not only for understanding the fundamental processes of speech perception but is also crucial for the development of artificial intelligence systems involving human-computer speech dialogue systems and for speech technologies of the next generation.
Orzhenovskii Mikhail
RuSimScore: unsupervised scoring function for Russian sentence simplification quality
We propose an unsupervised complex scoring function (RuSimScore) to measure simplification quality of Russian sentences, and a model for text simplification based on this function. The function allows to score simplicity and original meaning preservation. First, filtered a noisy parallel corpus (machine translated WikiLarge) and extracted good simplification examples. After that, a pretrained language model was fine-tuned on these examples. We generate multiple outputs from the language model and select the best one according to the scoring function. The weights in the scoring function can be adjusted to balance between better content preservation and getting simpler sentences (controllable simplification).
Pivovarova Lidia, Kutuzov Andrey
RuShiftEval: a shared task on semantic shift detection for Russian
We present the first shared task on diachronic word meaning change detection for the Russian. The participating systems were provided with three sub-corpora of the Russian National Corpus — corresponding to pre-Soviet, Soviet and post-Soviet periods respectively — and a set of approximately one hundred Russian nouns. The task was to rank those nouns according to the degrees of their meaning change between periods. Although RuShiftEval is in many respects similar to the previous tasks organized for other languages, we introduced several novel decisions that allow for using novel methods. First, our manually annotated semantic change dataset is split in more than two time periods. Second, this is the first shared task on word meaning change which provided a training set. The shared task received submissions from 14 teams. The results of RuShiftEval show that a training set could be utilized for word meaning shift detection: the four top-performing systems trained or fine-tuned their methods on the training set. Results also suggest that using linguistic knowledge could improve performance on this task. Finally, this is the first time that contextualized embedding architectures (XLM-R, BERT and ELMo) clearly outperform their static counterparts in the semantic change detection task.
Podlesskaya V. I., Pozhilov Ju. M.
Semantics, Grammar and Prosody of parentheticals introduced by the Subordinator kak ‘as’
Based on data from the multimedia subcorpus of the Russian National Corpus, the paper addresses syntactic, sematic and prosodic features of the particular type of quotations with the reporting frame headed by the subordinator kak ‘as’ (kak skazal mne staryj rab pered tavernoj…). Our data show mixed evidence regarding the parenthetical status of the construction. On the one hand, typically for parentheticals, its function is clearly pragmatized, since it expresses speaker’s attitude towards the quote. On the other hand, typical parentheticals have only loose syntactic connection with their “host”, while the kak-phrase is introduced by the subordinator and has the form of the standard adverbial clause. Further on, while typical parentheticals are characterized by grammatical and prosodic reduction, grammatical and prosodic restrictions operating in the kak-phrase are optional and context (e.g., word order) sensitive. The kind of data we present supports the approach to parenthesis that doesn’t favor either/or decisions, but rather is based on multifactorial analysis that considers the whole range of possible parameters and isolates their observed language-specific clusters.
Ponomareva Maria, Petrova Maria, Detkova Julia, Serikov Oleg, Yarova Maria
SemSketches­2021: experimenting with the machine processing of the pilot semantic sketches corpus
The paper deals with elaborating different approaches to the machine processing of semantic sketches. It presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches­2021 Shared Task was organized. The participants were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one had to assign the proper contexts to the corresponding sketches.
Pugachev Leonid, Burtsev Mikhail
Short Text Clustering with Transformers
Recent techniques for the task of short text clustering often rely on word embeddings as a transfer learning component. This paper shows that sentence vector representations from Transformers in conjunction with different clustering methods can be successfully applied to address the task. Furthermore, we demonstrate that the algorithm of enhancement of clustering via iterative classification can further improve initial clustering performance with classifiers based on pre-trained Transformer language models.
Rachinskiy Maxim, Arefyev Nikolay
Zero­-shot Cross­lingual Transfer of a Gloss Language Model for Semantic Change Detection
Consulting word definitions from a dictionary is a familiar way for a human to find out which senses a particular word has. We hypothesize that a system that can select a proper definition for a particular word occurrence can also naturally solve Semantic Change Detection (SCD) task. To verify our hypothesis, we followed an approach previously proposed for Word Sense Disambiguation (WSD) and trained a system that embeds word definitions and word occurrences into the same vector space. In this space, the embedding of the most appropriate definition has the largest dot product with a contextualized word embedding. The system is trained on an English WSD corpus. To make it work for the Russian language, we replaced BERT with the multilingual XLM­R language model and exploited its zero­shot cross­lingual transferability. Despite not finetuning the encoder model on any Russian data, this system achieves the second place in the competition, and likely works for any of one hundred other languages XLM­R was pre­trained on, though the performance may vary. We then measure the impact of such WSD pre­training and show that this procedure is crucial for our results. Since our model was trained to choose a proper definition for a word, we propose an algorithm for the interpretation and visualization of the semantic changes through time. By employing additional labeled data in Russian and training a simple regression model, that converts the distances between output contextualized embeddings into more human­like scores of sense similarity between word occurrences, we further improve our results and achieve the first place in the competition.
Rudneva E.A.
Switching to Work in an Inclusivity Workshop: Multimodal Analysis of Interaction
The study focuses on switching from talk to work in an “inclusivity workshop” for people with mental disabilities. Work activities and conversation about general topics can be approached from the perspective of multiactivity and considered courses of actions intertwined in social interaction. The order of activities is negotiated among participants using both linguistic and non-linguistic means. The data are extracts of video recordings containing a participant getting others to do things. The paper provides multimodal analysis of 6 cases of an instructor getting an autistic participant to switch to work, which occurred within a 17-minute conversation about animals. In the data, the autistic participant never provides a second-pair response to a directive. In 5 out of 6 cases analysed in the paper he fulfils the action to different extents, demonstrating various degrees of involvement. Getting the autistic person to switch to work is more effective when suggesting actions one by one, through concrete embodied actions, and when orienting to phases of the ongoing talk. The study highlights differences between autistic and non-autistic participants switching from one course of actions to another. Considering goals of an inclusivity workshop, success of switching to work can be also determined by the opportunities for the smooth conversation.
Ryzhova Anastasiia, Ryzhova Daria, Sochenkov Ilya
Detection of Semantic Changes in Russian Nouns with Distributional Models and Grammatical Features
The paper presents the models detecting the degree of semantic change in Russian nouns developed by the team aryzhova within the RuShiftEval competition of the Dialogue 2021 conference. We base our algorithms mostly on unsupervised distributional models and additionally test a model that uses vectors representing morphological preferences of the words in question. The best results are obtained by the model built on the ELMo architecture with a small window, while the quality of performance of the “grammatical” model is comparable to that of the models based on much more sophisticated algorithms.
Sakhovskiy Andrey, Izhevskaya Alexandra, Pestova Alena, Tutubalina Elena, Malykh Valentin, Smurov Ivan, Artemova Ekaterina
RuSimpleSentEval-2021 Shared Task: Evaluating Sentence Simplification
This report presents the results from the RuSimpleSentEval Shared Task conducted as a part of the Dialogue 2021 evaluation campaign. For the RSSE Shared Task, devoted to sentence simplification in Russian, a new middlescale dataset is created from scratch. It enumerates more than 3000 sentences sampled from popular Wikipedia pages. Each sentence is aligned with 2.2 simplified modifications, on average. The Shared Task implies sequenceto-sequence approaches: given an input complex sentence, a system should provide with its simplified version. A popular sentence simplification measure, SARI, is used to evaluate the system’s performance. Fourteen teams participated in the Shared Task, submitting almost 350 runs involving different sentence simplification strategies. The Shared Task was conducted in two phases, with the public test phase allowing an unlimited number of submissions and the brief private test phase accepting one submission only. The post-evaluation phase remains open even after the end of private testing. The RSSE Shared Task has achieved its objective by providing a common ground for evaluating state-of-the-art models. We hope that the research community will benefit from the presented evaluation campaign.
Shatilov A. A., Rey A. I.
Sentence simplification with ruGPT3
This paper describes our solution for the RuSimpleSentEval shared task on sentence simplification held together with Dialogue 2021 сonference. Our approach was to filter the provided dataset, finetune the pretrained ruGPT3 model on it and select generated simple candidates based on cosine similarity and ROUGE­L with a complex sentence as an input. The system achieved SARI 38.49 and took third place in the competition. We have reviewed and analyzed examples of simplified sentences produced by the model. The analysis showed that the sentences produced by the system lose the original meaning of the input sentence in about half of the cases.
Shmelev Alexei
The Russian particle zhe in the light of parallel corpora
This paper deals with the Russian particle zhe and its use in the Russian translations from English and demonstrates the possibilities of “one-focus analysis” in contrastive studies based on the parallel corpora. It correlates the explications of zhe given in earlier studies (it makes special reference to the Active Dictionary of Russian) with the stimuli to translation, that is, fragments of the original English text that might cause the appearance of zhe in a Russian translation as a reaction to those stimuli. The study sought to validate, disprove or improve the semantic analysis of zhe made without recourse to electronic corpora. The analysis of the stimuli that have led Russian translators to use the particle zhe reveals important characteristics of this word. It turns out that the Russian particle zhe is often pragmatically obligatory as its absence would violate the idiomatic nature of the utterance and change its illocutionary force. It is often the case that if a translator had given word-for-word translation, that is without a particle, they would convey the precise meaning, but the translation would be inadequate: the wrong implicature would appear. On the other hand, when they add the particle, they may impart new shades of meaning which the original text did not contain.
Shulginov V.A., Mustafin R. Zh., Tillabaeva A.A.
Automatic Detection of Implicit Aggression in Russian Social Media Comments
This article studies the characteristics of implicit and explicit types of aggression in the comments of a Russian social network with the means of machine learning. As it is hypothesized that expression of aggression depends on local norms, the dataset contains the comments collected from a single social media community. These comments were divided into three classes: polite communication, implicit aggression, and explicit aggression. Trying different combinations of data preprocessing, we discovered that lemmatization and replacement emojis with placeholders contribute to better results. We tested several models (Naive Bayes, Logistic Regression, Linear Classifiers with SGD Training, Random Forest, XGBoost, RuBERT) and compared their results. The study describes the misclassifications and compares the keywords of each class of comments. The results can be helpful while enhancing the algorithm of detection of implicit aggression.
Skrebtsova Tatiana, Grebennikov Alexander, Sherstinova Tatiana
The Dynamics of Vocabulary in Russian Prose (Based on Frequency Dictionaries of the Corpus of Russian Short Stories 1900-1930)
The paper presents the results of a study that is part of a large-scale project aimed at studying the changes that took place in the Russian language during the first three decades of the 20th century. In the history of Russia, this period was marked by stormy events that led to a radical change in the state system and the formation of a new society. To quantify the scale of changes that occurred in the language in the result of these dramatic events, it is necessary to analyze the representative volume of linguistic data and to compare different chronological periods in dynamics using quantitative methods. The research was carried out on the data of an annotated sample from the Corpus of the Russian Short Stories of 1900-1930, which contains texts by 300 Russian writers. All the texts in the Corpus are divided into three time frames: 1) the pre-war period (1900-1913), 2) the war and revolutionary years (1914-1922) and 3) the early Soviet period (1923-1930). Frequency distribution of significant vocabulary in dynamics was analyzed, which made it possible to identify the main tendencies in the change of individual words and lexical groups frequencies from one historical period to another and to correlate them with the previously identified dynamics of literary themes. The technique used allows to trace the influence of large-scale political changes on the vocabulary of literary language, to note the peculiarities and tendencies of the writers' worldview in a certain historical period, and also makes it possible to significantly supplement the analysis of the dynamics of literary themes in fiction.
Stenger Irina, Avgustinova Tania
On Slavic cognate recognition in context
This study contributes to a better understanding of reading intercomprehension as manifested in the intelligibility of East and South Slavic languages to Russian native speakers in contextualized cognate recognition experiments using Belarusian, Ukrainian, and Bulgarian stimuli. While the results mostly confirm the expected mutual intelligibility effects, we also register apparent processing difficulties in some of the cases. In search of an explanation, we examine the correlation of the experimentally obtained intercomprehension scores with various linguistic factors, which contribute to cognate intelligibility in a context, considering common predictors of intercomprehension associated with (i) morphology and orthography, (ii) lexis, and (iii) syntax.
Tatevosov S. G., Kisseleva X. L.
What have I seen? On the meaning and distribution of an experiential discourse marker
The paper explores the discourse marker ja vižu (lit. ‘I see’) and its cross-linguistic counterparts. We argue that it presents its scope proposition as the product of abduction, a logical inference that derives the optimal explanation for the observed state of affairs. This view is supported by the set of observations suggesting that restrictions on the distribution of ja vižu are mostly derivable as restrictions on abuctive reasoning, which involve informativeness, likelihood and parsimony considerations.
Tikhomirov M. M., Loukachevitch N. V.
Meta-Embeddings in Taxonomy Enrichment Task
In this paper we consider the taxonomy enrichment task based on a recently appeared dataset, called Diachronic wordnets, created on the basis of English and Russian wordnets. We study meta-embeddings approaches, which combine several source embeddings, to the hypernym prediction of novel words and show that meta-embedding approaches obtain the best results for this task if compared to other methods based on different principles. When combining with automatically extracted features from the Wiktionary online dictionary, the joint approach improves the results.
Vatolin A. S., Smirnova E. Y., Shkarin S. S.
Russian News Similarity Detection with SBERT: pre-training and fine-tuning
Computation of text similarity is one of the most challenging tasks in NLP as it implies understanding of semantics beyond the meaning of individual words (tokens). Due to the lack of labelled data this task is often accomplished by means of unsupervised methods such as clustering. Within the DE2021: “Russian News Clustering and Headline Selection” we propose a method of building robust text embeddings based on Sentence Transformers architecture, pretrained on a large dataset of in-domain data and then fine-tuned on a small dataset of paraphrases leveraging GlobalMultiheadPooling.
Velichko A.N., Karpov A.A.
Automatic Detection of Deceptive and Truthful Paralinguistic Information in Speech using Two-Level Machine Learning Model
In this work, we present a novel approach to one of computational paralinguistic tasks – automatic detection of deceptive and truthful information in human’s speech. This task belongs to the aspects of destructive behaviour and was first presented at the International INTERSPEECH Computational Paralinguistics Challenge ComParE in 2016. The need of contactless method for deception detection follows from the fact that existing contact-based approaches such as polygraphs and lie detectors have multiple restrictions, which significantly limit their usage. Both for training and testing of the proposed models we used two English-language corpora (Deceptive Speech Database and Real-Life Trial Deception Detection Dataset). We extracted tree sets of acoustic features from those audio samples using openSMILE toolkit. The proposed approach includes preprocessing of the extracted acoustic features with the usage of methods for data augmentation and dimensionality reduction of feature space. We have got 1680 speech utterances and 986-dimensional informative feature vector for each utterance. The main part of the proposed approach is two-level recognition model, where the first level includes three models of gradient boosting (Catboost, XGBoost and LightGBM). The second level consists of logistic regression-based model for final prediction on truthfulness or deceptiveness that takes into account predictions from the first level. Using this approach, we have achieved the result of classification in terms of F-score = 85.6%. The proposed approach can be used both independently and as a component of multimodal systems for detection of deceptive and truthful utterances in speech, as well as in systems for detection of a destructive behaviour.
Voropaev Pavel, Sopilnyak Olga
Transformers for Headline Selection for Russian News Clusters
In this paper, we explore various multilingual and Russian pre-trained transformer-based models for the Dialogue Evaluation 2021 shared task on headline selection. Our experiments show that the combined approach is superior to individual multilingual and monolingual models. We present an analysis of a number of ways to obtain sentence embeddings and learn a ranking model on top of them. We achieve the result of 87.28% and 86.60% accuracy for the public and private test sets respectively.
Yanko Tatiana
The prosody of spoken dialogue
This paper is aimed at establishing the parameters of the dialogic communication expressed through Russian prosody. The linguistic and extra-linguistic constituents of dialogue are analyzed. These are: the illocutionary meanings that generate speech acts, characteristic of the dialogic communication; the discourse links that combine the successive speech acts of one interlocutor if his/her current contribution into the dialogue is not limited to a single speech act; the prosodic characteristics of genre typical for a concrete type of communication (a friendly talk, an exam, a press conference, a scientific presentation, or an interrogation). The proposed taxonomy is based on the analysis of the minor working corpus of spoken dialogues from the Russian National corpus (Multimodal sub-corpus Murko), the annotated database Spokencorpora.ru, video-hosting Youtube.com, films, scientific conferences, and press conferences. The computer system Praat is used to analyze the sound data. The paper is illustrated with tracings of sound records.
Zalizniak Anna
Russian discourse markers vidimo and po-vidimomu (‘apparently’): synchronic and diachronical semantics
The article analyzes the meaning of Russian discursive words vidimo and po-vidimomu (‘apparently’), and reconstructs the ways of their semantic evolution over the past two centuries. It is shown that the meaning of an inference made by the speaker on the basis of some data, which is the only one for both words in modern language, arose in different ways. The semantic evolution of both words includes the the replacement of the meaning of visual perception with the meaning of epistemic evaluation and the acquisition of egocentric semantics. The word vidimo initially served as a marker of a true visual impression; the word po-vidimomu which initially included an interpretative component, acquired the meaning of a potentially false judgment, which was subsequently lost. The research is based on texts included in the Russian National Corpus (www.ruscorpora.ru).
Zanchi Chiara, Luraghi Silvia, Biagetti Erica
Linking the Ancient Greek WordNet to the Homeric Dependency Lexicon
The Ancient Greek WordNet is a new resource that is being developed at the Universities of Pavia and Exeter, based on the Princeton WordNet. The Princeton WordNet provides sentence frames for verb senses, but this type of information is lacking in most WordNets of other languages. In fact, exporting sentence frames from English to other languages is not a trivial task, as sentence frames depend on the syntax of individual languages. In addition, the information provided by the Princeton WordNet is not corpus-based but relies on native speakers’ knowledge. This type of information is not available for dead languages, which are by definition corpus languages. In this paper, we show how sentence frames can be extracted from morpho-syntactically parsed corpora by linking an existing dependency lexicon of Homeric verbs (HoDeL) to verbs in the Ancient Greek WordNet. Given its features, HoDeL allows automatically extracting all subcategorization frames available for each verb along with information concerning their frequency as well as semantic information regarding the possible arguments occurring in specific frames. In the paper, we show our method to automatically link the two resources and compare some of the resulting sentence frames with the English sentence frames in the Princeton WordNet.
Zimmerling Anton
Russian predicatives and the ontology of states
Basing on the frequency dictionary of Russian predicatives, I measure the volume of the lexical class of nonagreeing predicatives licensing the productive dative-predicative sentence pattern, where the predicative assigns dative case to its animate subject. The tested vocabulary includes 422 elements. Their frequency rates are derived from the main corpus of RNC using an approximation — the number of hits in the context “predicative + dative subject in 1Sg” in the window {-1; 1}. I argue that the Russian dative-predicative construction has an invariant meaning of internal state, i.e. spaciotemporal stative situation with a priority argument. However, most predicatives licensing dative-predicative structures in Russian also express external states, i.e. spaciotemporal stative situations without a priority argument, if used without overt referential dative subject. This can be proved both for words denoting physical sensations, cf. X-y kholodno ‘X is cold’ vs kholodno ‘It is cold’ and for some words denoting affections, cf. tosklivo ‘dreary’, ‘sad’, Х-у tosklivo ‘Х feels sad’ vs zdes’tosklivo ‘It’s dreary here’. The shift from internal state to external state is licensed in Russian. If a lexical item has regular uses in the dative-predicative structure, it generally can express the meaning of external state outside this structure. The reverse if false: if a lexical item has regular uses as an external state, cf. vetreno ‘windy’, pyl’no ‘dusty’, it only can have infrequent side uses with a dative subject. This asymmetry is confirmed by the corpus data. I check an additional list of words with the meaning of external state, measure their frequency rate in the context “predicative + dative subject in 1Sg” in the window {-1; 1} and compare them to standard dative predicatives.