Proceedings 2020

Contents (SCOPUS)


Anastasyev D.G.
Exploring pretrained models for joint morpho-syntactic parsing of Russian
In this paper, we build a joint morpho-syntactic parser for Russian. We describe a method to train a joint model which is significantly faster and as accurate as a traditional pipeline of models. We explore various ways to encode the word-level information and how they can affect the parser’s performance. To this end, we utilize learned from scratch character-level word embeddings and grammeme embeddings that have shown state-of-theart results for similar tasks for Russian in the past. We compare them with the pretrained contextualized word embeddings, such as ELMo and BERT, known to lead to the breakthrough in miscellaneous tasks in English. As a result, we prove that their usage can significantly improve parsing quality.
Arefyev N. V., Fedoseev M. V., Kabanov A. V., Zizov V. S.
Word2vec Not Dead: Predicting Hypernyms of Co-Hyponyms Is Better Than Reading Definitions
Expert-built lexical resources are known to provide information of good quality for the cost of low coverage. This property limits their applicability in modern NLP applications. Building descriptions of lexical-semantic relations manually in sufficient volume requires a huge amount of qualified human labour. However, given some initial version of a taxonomy is already built, automatic or semi-automatic taxonomy enrichment systems can greatly reduce the required efforts. We propose and experiment with two approaches to taxonomy enrichment, one utilizing information from word definitions and another from word usages, and also a combination of them. The first method retrieves co-hyponyms for the target word from distributional semantic models (word2vec) or language models (XLM-R), then looks for hypernyms of co-hyponyms in the taxonomy. The second method tries to extract hypernyms directly from Wiktionary definitions. The proposed methods were evaluated on the Dialogue-2020 shared task on taxonomy enrichment. We found that predicting hypernyms of cohyponyms achieves better results in this task. The combination of both methods improves results further and is among 3 best-performing systems for verbs. An important part of the work is detailed qualitative and error analysis of the proposed methods, which provide interesting observations of their behaviour and ideas for the future work.
Badryzlova Yu. G.
Exploring Semantic Concreteness and Abstractness for Metaphor Identification and Beyond
The paper presents a method for computing indexes of semantic concreteness and abstractness in two languages (Russian and English). These indexes are used in metaphor identification experiments in both languages; the results are either comparable to or surpass pervious work and the baselines. We analyze the obtained indexes of concreteness and abstractness to see how they align with the linguistic intuitions about the corresponding semantic categories. The results of the analysis may have broader implications for computational studies of the semantics of concreteness and abstractness.
Баранов А.Н., Добровольский Д.О.
Динамика стиля русской письменной речи XIX века: корпусный эксперимент
Рассматривается гипотеза о том, что распределение дискурсивных слов характеризует тенденции развития стиля письменной речи XIX века. Излагаются и обсуждаются результаты эксперимента на материале Национального корпуса русского языка по исследованию частоты использования дискурсивных слов с семантикой эпистемической модальности: конечно, разумеется, по-видимому, как кажется, казалось бы, наверно, вероятно, пожалуй, действительно и др. Показывается, что частоты этой группы выражений возрастают во второй половине XIX века. Аналогичная тенденция прослеживается также для некоторых синтаксических конструкций с той же семантикой: (я) думаю, что…; (я) считаю, что…; (мне) кажется, что… Выявленная закономерность рассматривается как дискурсивная практика в изменении стиля художественной литературы, которая заключалась в расширении модусной части высказывания по сравнению с более ранним периодом. Дискурсивная практика расширения модуса была присуща лишь группе писателей-новаторов (прежде всего, Ф. М. Достоевскому, М. Е. Салтыкову-Щедрину, Л. Н. Толстому, И. А. Гончарову, А. Ф. Писемскому, П. И. Мельникову-Печерскому, Н. С. Лескову и И. С. Тургеневу), которые, однако, в силу таланта, общественной значимости и количества опубликованных текстов оказали существенное влияние на язык художественной прозы. Задача исследования динамики художественного стиля заключается в выявлении и описании комплекса дискурсивных практик, формирующих письменный дискурс как таковой.
Беликов В., Селегей В., Селегей Д.
Интернет-корпус как инструмент лингвистических исследований: дифференциальность, авторизация, тематические смещения (или корпусы, которым так хочется верить)
Статья посвящена вопросам надежности выдачи в интернет-корпусах на примере корпуса ГИКРЯ. Несколько лет использования корпуса для лингвистических исследований дали нам пищу для размышлений и некоторых выводов. Рассматриваются проблемы, общие для любых интернет корпусов: важность учета социолингвистической вариативности, влияние ложноатрибутированных текстов, тематические смещения при нетематической классификации, перспективы и недостатки новых методов агрегации результатов поиска.
Blinova O. V., Tarasov N. A., Modina V. V., Blekanov I. S.
Modeling Lemma Frequency Bands for Lexical Complexity Assessment of Russian Texts
The paper is devoted to the problem of modeling general-language frequency using data of large Russian corpora. Our goal is to develop a methodology for forming a consolidated frequency list which in the future can be used for assessing lexical complexity of Russian texts. We compared 4 frequency lists developed from 4 corpora (Russian National Corpus, ruTenTen11, Araneum Russicum III Maximum, Taiga). Firstly, we applied rank correlation analysis. Secondly, we used the measures “coverage” and “enrichment”. Thirdly, we applied the measure “sum of minimal frequencies”. We found that there are significant differences between the compared frequency lists both in ranking and in relative frequencies. The application of the “coverage” measure showed that frequency lists are by no means substitutable. Therefore, none of the corpora in question can be excluded when compiling a consolidated frequency list. For a more detailed comparison of frequency lists for different frequency bands, the ranked frequency list, based on RNC data, was divided into 4 equal parts. Then 4 random samples (containing 20 lemmas from each quartile) were formed. Due to the wide range of values, accepted by ipm measure, relative frequency values are difficult to interpret. In addition, there are no reliable thresholds separating high-frequency, mid-frequency, and low-frequency lemmas. Meanwhile, to assess the lexical complexity of texts, it is useful to have a convenient way of distributing lemmas with certain frequencies over the bands of the frequency list. Therefore, we decided to assign lemmas “Zipf-values”, which made the frequency data interpretable because the range of measure values is small. The result of our work will be a publicly accessible reference resource called “Frequentator”, which will allow to obtain interpretable information about the frequency of Russian words.
Boguslavsky I. M., Dikonov V. G., Frolova T. I., Iomdin L. L., Lazursky A. V., Timoshenko S. P., Rygaev I. P.,
Full-Fledged Semantic Analysis as A Tool for Resolving Triangle-Copa Social Scenarios
Text interpretation often requires common sense knowledge and reasoning. A convenient tool for developing methods of common sense reasoning are special sets of challenge problems whose interpretation requires sophisticated reasoning. An interesting example is a recently published data set called Triangle Choice of Plausible Alternatives (Triangle-COPA), which contains 100 multiple-choice problems that test the interpretation of social scenarios. Each problem includes a statement and two alternatives. The task is to identify the more plausible alternative. For processing Triangle-COPA data we use SemETAP, a general purpose semantic analyzer. We implement the full scenario of NL understanding starting from NL texts and not from manually composed simplified logical formulas, which is a common practice in logic-based approaches to common sense reasoning. We produce Enhanced Semantic Structures of the statement and both alternatives and check which alternative manifests more semantic agreement with the statement in terms of inferences.
Bolshina A. S., Loukachevitch N. V.
Generating Training Data for Word Sense Disambiguation in Russian
The best approaches in Word Sense Disambiguation (WSD) are supervised and rely on large amounts of hand-labelled data, which is not always available and costly to create. For the Russian language there is no sensetagged resource of the size sufficient to train supervised word sense disambiguation algorithms. In our work we describe an approach that is used to create an automatically labelled collection based on the monosemous relatives (related unambiguous entries). The main contribution of our work is that we extracted monosemous relatives that can be located at relatively long distances from a target ambiguous word and ranked them according to the similarity measure to the target sense. The selected candidates are then used to extract training samples from the news corpus. We evaluated word sense disambiguation models based on a nearest neighbor classification on BERT and ELMo embeddings. Our work relies on the Russian wordnet RuWordNet.
Bocharov V. V., de Chalendar G.
The Russian Language Pipeline in the LIMA Multilingual Analyzer
В этой статье описана реализация обработки текста на русском языке в анализаторе LIMA и наше участие в соревновании GramEval-2020. Анализатор LIMA—это модульная система обработки текста, включающая статистические и основанные на правилах компоненты. Обработка текста на русском языке реализована при помощи статистических моделей на основе глубоких нейронных сетей и включает токенизацию, морфологический анализ, лемматизацию и построение деревьев зависимостей. Морфологический и синтаксический анализ соответствуют правилам Universal Dependencies.
Budennaya E. V., Evdokimova A. A., Nikolaeva Ju. V., Sukhova N. V.
Referential Phenomena in Speaker’s Kinetic Channels
The article addresses the relation of referential expressions and co-occurring kinetic phenomena (hand and head gestures) on the material of the RUPEX multimodal corpus. The results reflect significant differences in how individual movements and gestures are aligned with two major types of reference (full NPs vs. reduced expressions). It was initially assumed that full NPs are more often accompanied by a gesture. Our data support this hypothesis not only through the material of hand gestures, but also through head movements. Moreover, full NPs are more likely to be accompanied by downward movements in both manual and cephalic channels, as well as by metadiscourse gestures, in comparison to reduced referential units (personal and demonstrative pronouns). In addition, pronouns are more likely to be aligned with pointing hand gestures and zero reference is often accompanied by descriptive hand gestures. However, the kinetic behavior of the interlocutors is determined by a variety of factors, including the topic of the conversation, which predisposes to certain types of gestures and the relative position of the interlocutors.
Чернова Д.А., Алексеева С.В., Слюсарь Н.А.
Чему нас учат ошибки: трудности при обработке слов с частотными орфографическими ошибками
Even if we know how to spell, we often see words misspelled by other people — especially nowadays when we constantly read unedited texts on social media and in personal messages. In this paper, we present two experiments showing that the incidence of orthographic errors reduces the quality of lexical representations in the mental lexicon—even if one knows how to spell a word, repeated exposure to incorrect spellings blurs its orthographical representation and weakens the connection between form and meaning. As a result, it is more difficult to judge whether the word is spelled correctly, and—more surprisingly—it takes more time to read the word even when there are no errors. We show that when all other factors are balanced the effect of misspellings is more pronounced for the words with lower frequency. We compare our results with the only previous study addressing the problem of misspellings’ influence on the processing of correctly spelled words—it was conducted on the English data. It may be interesting to explore this issue in a cross-linguistic perspective. In this study, we turn to Russian, which differs from English by a more transparent orthography. Much larger corpora of unedited texts are available for English than for Russian, but, using a different way to estimate the incidence of misspellings, we obtained similar results and could also make some novel generalizations. In Experiment 1 we selected 44 words that are frequently misspelled and presented in two conditions (with or without spelling errors) and were distributed across two experimental lists. For every word, participants were asked to determine whether it is spelled correctly or not. The frequency of the word and the relative frequency of its misspelled occurrences significantly influenced the number of incorrect responses: not only it takes longer to read frequently misspelled words, it is also more difficult to decide whether they are spelled correctly. In Experiment 2 we selected 30 words from the materials of Experiment 1 and for every selected word, we found a pair that is matched for length and frequency, but is rarely misspelled due to its orthographic transparency. We used a lexical decision task, presenting these 60 words in the correct spelling, as well as 60 nonwords. We used LMMs for statistics. Firstly, the word type factor was significant: it takes more time to recognize a frequently misspelled word, which replicates the results obtained for English. Secondly, the interaction between the word type factor and the frequency factor was significant: the effect of misspellings was more pronounced for the words of lower frequency. We can conclude that high frequency words have more robust representations that resist blurring more efficiently than low frequency ones. Finally, we conducted a separate analysis showing that the number of incorrect responses in Experiment 1 correlates with RTs in Experiment 2. Thus, whether we consciously try to find an error or simply read words orthographic representations blurred due to exposure to frequent misspellings make the task more difficult.
Чуйкова О.Ю.
Об особенностях вторичной имперфективации глаголов с префиксом по- в русском языке
The paper deals with a number of characteristics of the secondary imperfectivation of po-perfectives in Russian. The study is based on the analysis of the level of imperfectivability of Russian perfective verbs with the prefix po- compared to a number of other prefixed perfective verb groups (e. g. the verbs with such perfectivizing prefixes as na-, za-, etc.) according to the Dictionary of Russian Language, the Russian National Corpus and the Russian-language Internet (Runet). It is shown that the discussed perfective verb group is specific as a whole as well as with respect to its subgroups, i. e., deperfective perfective verbs and morphologically marked Aktionsarten. Po-perfectives demonstrate a low average imperfectivability in comparison to corresponding figures for other prefixed verb groups. For the subgroup of deperfective (formed from perfective stems) verbs the level of imperfectivability is also unusually low. The delimitative Aktionsart shows a higher imperfectivability than other morphologically marked Aktionsarten do. Possible explanations for the peculiarities of imperfectivability of poperfectives rather confirm than contradict the hypothesis about the regularity of the secondary imperfectivation in Russian.
Davletov A. A., Gordeev D. I., Rey A. I., Arefyev N. V.
RENERSANS: Relation Extraction and Named Entity Recognition as Sequence Annotation
In this work we present our system for RuREBus shared task held together with Dialog 2020 conference. The task consisted of 3 subtasks: named entity recognition, relation extraction with provided named entity tags and end-to-end relation extraction. Our system took the first and the second place in the first and the second subtasks respectively. For the third subtask we submitted our solution only in the post-evaluation phase, however, it was among the top 2 best performing systems. The systems for all tasks are based on Transformer models. Relation extraction was solved as a sequence labelling problem. We also used joint task named entity and relation extraction learning.
Dale D. S.
A Simple Solution for The Taxonomy Enrichment Task: Discovering Hypernyms Using Nearest Neighbor Search
In this paper, we present the system we used in the Taxonomy Enrichment for the Russian Language evaluation campaign. The goal of this challenge is to predict hypernyms for the words not included in the taxonomy. Our approach was to generate and score candidate hypernyms by word embedding similarity of the input words and concepts already in the taxonomy. Despite being very simple, our system was ranked first on the verbs track.
Derbanosov R., Bakhanova M.
Stability of Topic Modeling Via Modality Regularization
Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.
Деткова Ю., Новицкий В., Петрова М., Селегей В.
Дифференциальные семантические скетчи для русскоязычных интернет-корпусов
В статье описывается новый тип агрегированной корпусной выдачи — семантические скетчи, получивший пробную реализацию на одном из подкорпусов ГИКРЯ. Семантические скетчи являются естественным распространением идеи корпусных скетчей на анализ сочетаемости в терминах семантических отношений и семантических классов. Уточняющий атрибут «дифференциальный» означает возможность дополнительной параметризации скетчей метатекстовыми характеристиками. Разумеется, построение таких скетчей требует семантической разметки корпуса, в качестве которой в данной работе использовались частичные семантические разборы Compreno. В статье приводятся примеры построенных скетчей и оцениваются достоинства и проблемы корпусной статистики такого рода.
Dyachkov V. V., Khomchenkova I. A., Pleshak P. S., Stoynova N. M.
Annotating and Exploring Code-Switching in Four Corpora of Minority Languages of Russia
This paper describes code-switching with Russian in four spoken corpora of minority languages of Russia: two Uralic ones (Hill Mari and Moksha) and two Tungusic ones (Nanai and Ulch). All narrators are bilinguals, fluent both in the indigenous language (IL) and in Russian; all the corpora are comparable in size and genres (small field collections of spontaneous oral texts, produced under the instruction to speak IL); the languages are comparable in structural (dis)similarity with Russian. The only difference concerns language dominance and the degree of language shift across the communities. The aim of the paper is to capture how the degree of language shift influences the strategy of code-switching attested in each of the corpora using a minimal additional annotation of code-switching. We added to each corpus a uniform annotation of code-switching of two types: first, a simple semi-automatic word-by-word language annotation (IL vs. Russian), second, a manual annotation of structural code-switching types (for smaller sub-corpora). We compared several macro-parameters of code-switching by applying some existing simple measures of code-switching to the data of annotation 1. Then we compared the rates of different structural types of code-switching, basing on annotation 2. The results of the study, on the one hand, verify and enhance the existing generalizations on how language shift influences code-switching strategies, on the other hand, they show that even a very simple annotation of code-switching integrated to an existing field records collection appears to be very informative in code-switching studies.
Evseev D. A., Arkhipov M. Yu.
Sparql Query Generation for Complex Question Answering with Bert and Bilstm-Based Model
In this paper we describe question answering system for answering of complex questions over Wikidata knowledge base. Unlike simple questions, which require extraction of single fact from the knowledge base, complex questions are based on more than one triplet and need logical or comparative reasoning. The proposed question answering system translates a natural language question into a query in SPARQL language, execution of which gives an answer. The system includes the models which define the SPARQL query template corresponding to the question and then fill the slots in the template with entities, relations and numerical values. For entity detection we use BERT-based sequence labelling model. Ranking of candidate relations is performed in two steps with BiLSTM and BERT-based models. The proposed models are the first solution for LC-QUAD2.0 dataset. The system is capable of answering complex questions which involve comparative or boolean reasoning.
Эндресен A. A., Жукова В.А., Мордашова Д.Д., Рахилина Е.В., Ляшевская О.Н.
Русский конструктикон: новый лингвистический ресурс, его устройство и специфика
We present a new open-access electronic resource named the Russian Constructicon that offers a searchable database of Russian constructions accompanied by descriptions of their properties and illustrated with corpus examples. The project was carried out over the period 2016–2020 and at present contains an inventory of over 2200 multi-word constructions of Contemporary Standard Russian. We prioritize “partially schematic” constructions that lie between the two extremes of fully compositional syntactic sequences on the one hand and fully idiomatic (phraseological) expressions on the other hand. Constructions of this type are difficult to account for in terms of either lexicon or grammar alone, and are often underrepresented in reference works of Russian. A typical construction in our database contains a fixed part (anchor words) and an open slot that can be filled with a restricted set of lexemes. In this paper we first focus on key characteristics of this resource that make it different from existing constructicons of other languages. Second, we describe how the new interface will be designed and how it will serve the needs of both linguists and L2 learners of Russian. In particular, we discuss various search possibilities relevant for different users and those parameters that are available for specifying the retrieval output. An example of an entry is given to show how the information about each construction is structured and presented. Third, we provide an overview of our multi-level semantic classification of constructions. We argue that our system of semantic and syntactic tags subdivides our items into meaningful classes and smaller groups and eventually facilitates the identification of constructional families and clusters. This methodology works well in turning the initial list of constructions as unrelated units into a structured network and makes it possible to refine and expand the collected inventory of constructions in a systematic way.
Eremeev M. A., Vorontsov K. V.
Quantile-Based Approach to Estimating Cognitive Text Complexity
This paper introduces an approach to measuring the cognitive complexity of texts on various language levels. While standard readability indices are based on the linear combination of primary statistics, our general approach allows us to estimate complexity on morphological, lexical, syntactic, and discursive levels. Each model is defined by the tokens for the specific language level and the complexity function of a single token. We then use the reference collection of moderately complex texts and the quantile-based approach to spot the abnormally rare tokens. The proposed supervised ensemble, based on the ElasticNet model, incorporates models from all language levels. Having collected a labeled dataset through crowdsourcing, consisting of pairs of articles from the Russian Wikipedia, we consider several models and ensembles and compare them to common baselines. Suggested models are flexible due to the freedom in choosing the reference collection. The described experiments confirm the competitiveness of the proposed approach, as the ensembles demonstrate the best target metric value.
Feldman D. G., Sadekova T. R., Vorontsov K. V.
Combining Facts, Semantic Roles and Sentiment Lexicon in A Generative Model for Opinion Mining
Opinion mining is a popular task, that is applied, for example, to determine news polarisation and identify product review classes. Our task is unsupervised clusterization of opinionated texts, in particular news on political events. Many papers that tackle this issue use generative models based on lexical features. Our goal is to determine the entities defying an opinion amongst lexical, syntactic and semantic features as well as their compositions. More specifically, we test the hypothesis that an opinion is determined by the composition of the mentioned facts (SPO triples), the semantic roles of the words and the sentiment lexicon used in it. In this paper we formalise this task and prove that using a composition of the above features provides the best quality when clusterising opinionated texts. To test this hypothesis we have gathered and labelled two corpuses of news on political events and proposed a set of unsupervised algorithms for extracting the features.
Fenogenova A. S., Tikhonova M. I., Filipetskaya D. V., Mironenko F., Tabisheva A. O.D.,
Event2mind For Russian: Understanding Emotions and Intents in Texts. Corpus and Model for Evaluation
The paper provides a comprehensive overview of the corpus for the Russian language for the commonsense inference task. Namely, we construct event phrases, which cover a wide range of everyday situations with labelled intents and reactions of the event main participant and emotions of other people involved. The dataset consists of two parts: a crowdsourced corpus of 6,756 examples from Russian sources and a translated into Russian part of the original corpus of 23,409 examples. Apart from this, we use the collected data in order to train the event2mind model for the Russian language. The paper presents careful description of the best Russian model and the results of the conducted experiments.
Гончаров А.А., Инькова О.Ю.
Имплицитные логико-семантические отношения и метод их поиска в параллельных текстах
One of the main characteristics of logical-semantic relations (LSRs) between two fragments of a text is that these relations can be either explicit (expressed by some marker, e. g. a connective) or implicit (derived from the interrelation of these fragments’ semantics). Since implicit LSRs do not have any marker, it is difficult to find them in a text (whether automatically or not). In this paper, approaches to analysing implicit LSRs are compared, an original definition for them is offered and differences between implicit LSRs and LSRs expressed by non-prototypical means are described. A method is proposed to identify implicit LSRs using a parallel corpus and a supracorpora database of connectives. Based on the well-known statement that LSRs can be explicitated by adding connectives in the translation, it is argued here that through selecting pairs in which fragments where a connective is used to express an LSR in the translation correspond to those containing any of the translation stimuli standard for this connective in the source language, it is possible to get an array of contexts in which this LSR is implicit in the source text (or expressed by means other than connectives). This method is then applied to study the French causal connectives car, parce que and puisque using a Russian-French parallel corpus. The corpus data are analysed to obtain information about LSRs particularly about cases where the causal LSR in Russian is implicit, as well as about the use of causal connectives in French. These results are used to show that the method proposed allows to quickly create a representative array of contexts with implicit LSRs, which can be useful in both text analysis and in machine learning.
Горбова Е.В.
Видовые тройки русского глагола в диахронии (на материале НКРЯ)
The paper deals with the so-called aspectual triplets of the Russian verb. Based on the data from the Russian National Corpus, it proposes a diachronic method to study triplets as well as a two-component model of the Russian aspect as an alternative to the traditional word-based classification model. The first component of the model is a morphological mechanism of the imperfectivizing suffixation of prefixed verbs that is inflectional (ras-kry-t’PFV — ras-kry-va-t’IPFV2 ‘disclose, reveal’), but has a limited scope of action (prefixed verbs only). The second component of the model is the actionality (lexical aspect) with a maximal scope. Related to the verb class as a whole, it is especially crucial for non-prefixed simplexes. Actionality enables the functioning and perfective / imperfective characterization of simplexes which do not fall under the inflectional grammatical aspect.
Gordeev D. I., Davletov A. A., Rey A. I., Akzhigitova G. R., Geymbukh G. A.
Relation Extraction Dataset for The Russian
There are few existing relation extraction datasets for the Russian language and they contain a rather small number of examples. Thus, we decided to create a new Ontonotes-based named entities and relation extraction sentence-level dataset called RURED. The dataset contains more than 500 annotated texts and more than 5,000 labelled relations. We also publish baseline models for relation extraction and named entity recognition trained on the dataset. Our models achieve 0.85 for named entity recognition and 0.78 for relation extraction in F1-score.
Ilvovsky D. A., Galitsky B. A.
Dialogue Management Using Extended Discourse Trees
In this paper we learn how to manage a dialogue relying on discourse of its utterances. We consider two complementary approaches of dialogue management based on the discourse text analysis to extend the abilities of the interactive information retrieval-based chat bot.
Инькова О.Ю.
Количественный метод анализа коннекторов: «портрет» русского союза ИЛИ в надкорпусной базе данных коннекторов
The functional properties of the conjunction ili ‘or’ are quite well studied and discussed in grammars and number of specific studies. However, they were not subjected to multivariate quantitative analysis. The paper proposes this kind of analysis, carried out according to six parameters: i) logical-semantic relation expressed by the conjunction, ii) syntactic structure of the text fragment introduced by it, iii) position of the conjunction in this text fragment, iv) order of the text fragments connected by it, v) status of ili in the context (cf. its use as a particle mentioned in MAS), vi) disposition of the elements that make up the multiword connectives (cf. ili ... ili prosto ‘or ... or just’). The analysis of the formal variants of ili, carried out using the Supracorpora database of connectives, made it possible to formulate the conclusions that follow and to confirm them with quantitative data. i) Always occupying the initial position in the text fragment introduced by him, ili is used mainly as a connective. ii) The most typical order of text fragments for formal variants with ili is the p CNT q scheme. iii) By its syntactic characteristics ili is directly opposite to i ‘and’: ili is used in written texts in most cases for a non-predicative coordination. iv) Formal variants with ili express the relation of alternative at all three semantic levels (propositional, illocutive, metalinguistic), as well as the relations of substitution, correction, and negative alternative, but with a clear predominance of propositional alternative. v) Ili forms multiword and two- and multicomponent connectives; their composition varies depending on the relation expressed by them.
Inshakova E. S., Sizov V. G.
An Experimental Rule-Based Parser for Russian Employing the NLP Resources of The ETAP System
This paper presents a rule-based dependency parser for Russian based on bottom-up approach. Its rules are partially rewritten ETAP syntagms, organized into groups that constitute a single pipeline. We demonstrate that such an organization enhances the performance of our parser relative to the ETAP system’s and enables it to successfully process long phrases (more specifically, heavy nominal and prepositional phrases at the current experimental stage of our work).
Иомдин Б.Л., Иомдин Л.Л.
Валентная структура некоторых речевых предикатных слов: новые находки
В статье рассматриваются валентные рамки ряда русских глагольных предикатов, в значение которых входит речевой акт, а также, на некоторой стадии семантического разложения, отрицание — такие как возражать, возмущаться, извиняться и др. Высказывается предположение, что валентные рамки таких предикатов включают в себя пару пропозициональных валентностей, отчетливо противопоставленных друг другу: (1) валентность стимула, которая выражает положение дел, и (2) валентность реакции, которая вводит речевой акт, совершаемый субъектом в качестве отклика на это положение дел и предлагающий его объяснение. Например, в предложении Иван извинился, что не пришел на мой день рождения клауза, вводимая союзом что, выражает положение дел, а в предложении Иван извинился, что плохо себя чувствовал такая клауза передает речевую реакцию Ивана на положение дел (например, отсутствие на моем дне рождения), стимулирующее его дать объяснение этому отсутствию. Показано, что эти валентности нельзя адекватно описать в рамках единой семантической роли содержания. Авторы также предлагают обобщение этого явления, сравнивая его с другими типами валентных пар, и выдвигают гипотезу о существовании предикатов, имеющих два валентных центра.
Ivanin V. A., Artemova E. L., Batura T. V., Ivanov V. V., Sarkisyan V. V., Tutubalina E. V., Smurov I. M.
RUREBUS-2020 Shared Task: Russian Relation Extraction for Business
In this paper, we present a shared task on core information extraction problems, named entity recognition and relation extraction. In contrast to popular shared tasks on related problems, we try to move away from strictly academic rigor and rather model a business case. As a source for textual data we choose the corpus of Russian strategic documents, which we annotated according to our own annotation scheme. To speed up the annotation process, we exploit various active learning techniques. In total we ended up with more than two hundred annotated documents. Thus we managed to create a high-quality data set in short time. The shared task consisted of three tracks, devoted to 1) named entity recognition, 2) relation extraction and 3) joint named entity recognition and relation extraction. We provided with the annotated texts as well as a set of unannotated texts, which could of been used in any way to improve solutions. In the paper we overview and compare solutions, submitted by the shared task participants. We release both raw and annotated corpora along with annotation guidelines, evaluation scripts and results at
Kononenko I. S., Sidorova E. A., Akhmadeeva I. R.
Comparative Analysis of Rhetorical and Argumentative Structures in The Study of Popular Science Discourse
The proposed work is performed as a part of an on-going research project aimed at creation of discourse annotated corpus of popular science texts written in Russian. Annotation is carried out within the framework of a multi-level model of discourse, which considers the text from the perspective of genre, rhetorical and argumentative organization. We conduct a comparative study of the rhetorical and argument annotations, discuss their similarities and differences on the segment and structural levels and show them on the examples of standard schemes of reasoning described in D. Walton’s theory of structured argumentation: “Argument from Expert Opinion”, “Argument from Example”, and “Argument from Cause to Effect”. Special attention is paid to discourse markers registered during annotation as key indicators of discourse structure. We report the results of the experiment with argument indicator patterns, based on the list of rhetorical markers, and aimed at the extraction of “from Expert Opinion” arguments.
Konovalov V. P., Gulyaev P. A., Sorokin A. A., Kuratov Y. M., Burtsev M. S.
Exploring the Bert Cross-Lingual Transfer for Reading Comprehension
Multilingual BERT has been shown to generalize well in a zero-shot crosslingual setting. This generalization was measured on POS and NER tasks. We explore the multilingual BERT cross-language transferability on the reading comprehension task. We compare different modes of training of question-answering model for a non-English language using both English and language-specific data. We demonstrate that the model based on multilingual BERT is slightly behind the monolingual BERT-based on Russian data, however, it achieves comparable results with the language-specific variant on Chinese. We also show that training jointly on English data and additional 10,000 monolingual samples allows it to reach the performance comparable to the one trained on monolingual data only.
Korotaev N. A., Podlesskaya V. I., Smirnova K. V., Fedorova O. V.
Disfluencies in Russian Spoken Monologues: A Distributional Analysis
The paper addresses the overall distribution of speech disfluencies in Russian spoken monologic discourse: basing on corpus data, we investigate qualitatively and quantitatively how disfluencies of different types group (or do not group) with each other and how isolated disfluencies and their sequences are sandwiched with periods of fluent speech in the course of speech production. Self-repairs, filled and silent pauses, and instances of hesitation lengthening were annotated in a subcorpus of the “Russian Pears Chats and Stories” (RUPEX). A distribution-oriented typology of disfluencies was proposed that distinguishes between isolated disfluencies, disfluency clusters, and quasiclusters. We claim that disfluency tokens tend to cluster, as isolated occurrences are significantly less frequent in our data than it could have been expected basing on the relative frequency of tokens. This finding contradicts previous studies that treated disfluency clusters as a more marginal phenomenon, and emphasizes the importance of a distributional, rather than merely structural, approach to annotating disfluencies. Furthermore, individual types of disfluency tokens demonstrate significantly different distributional patterns. Compared to other types, self-repairs occur more often in isolation, while words with hesitation lengthening appear predominantly in clusters, and filled pauses most often group with silent pauses to form quasi-clusters.
Korzun V. A.
R-BERT for Relationship Extraction on Russian Business Documents
This paper provides results of participation in the Russian Relation Extraction for Business shared task (RuREBus) within DialogueEvaluation 2020. Our team took the first place among 5 other teams in Relation Extraction with Named Entities task. The experiments showed that the best model is based on R-BERT model. R-BERT achieved significant result in comparison with models based on Convolutional or Recurrent Neural Networks on the SemEval-2010 task 8 relational dataset. In order to adapt this model to RuREBus task we also added some modifications like negative sampling. In addition, we have tested other models for Relation Extraction and Named Entity Recognition tasks.
Kunilovskaya M., Kutuzov A., Plum A.
Taxonomy Enrichment for Russian: SYNSET Classification Outperforms Linear Hyponym-Hypernym Projections
We present the description of our system that was ranked third in the noun sub-track of the Taxonomy Enrichment for the Russian Language shared task offered by Dialogue Evaluation 2020. Our best-performing system appears against the backdrop of other methods and their combinations attempted, and its results argue in favour of Occam’s razor for this task. A simple supervised classifier was trained on static distributional embeddings of hyponym words as features and their numeric hypernym synset identifiers from the taxonomy as class labels. It outperformed more complicated approaches based on learning linear projections from hyponym embeddings to hypernym embeddings and returning synset identifiers for the nearest neighbours of the predicted vectors. Training specially tailored word embeddings for ruWordNet multi-word expressions proved to be one of the key factors for both approaches.
Кустова Г.И.
Семантические эффекты времени во вводных конструкциях с ментальными глаголами
Вводные конструкции с глаголом мнения (как я думаю) в статье рассматриваются как результат редукции главной клаузы: Я думаю, что приглашение прислал профессор Уилер → Приглашение, как я думаю, прислал профессор Уилер. Показано, что значение времени ментального глагола влияет на интерпретацию предложения. В настоящем времени как я думаю вводит предположение с нейтральным статусом: Это произойдет, как я думаю, в самом ближайшем будущем [Ю. Семенов] = ‘неизвестно, Р или не-Р’; в прошедшем времени как я думал вводит неправильное предположение: Дядя, который, как я думал, давно забыл о подаренных часах, воспринял эту новость болезненно [Ф. Искандер] ‘я думал, что забыл, а на самом деле — не забыл’.
Kutuzov A., Fomin V., Mikhailov V., Rodina J.
SHIFTRY: Web Service for Diachronic Analysis of Russian News
We present the ShiftRy web service. It helps to analyze temporal changes in the usage of words in news texts from Russian mass media. For that, we employ diachronic word embedding models trained on large Russian news corpora from 2010 up to 2019. The users can explore the usage history of any given query word, or browse the lists of words ranked by the degree of their semantic drift in any couple of years. Visualizations of the words’ trajectories through time are provided. Importantly, users can obtain corpus examples with the query word before and after the semantic shift (if any). The aim of ShiftRy is to ease the task of studying word history on short-term time spans, and the influence of social and political events on word usage. The service will be updated with new data yearly.
Kuvshinova T.
Sentence Compression for Russian: Dataset and Baselines
Sentence compression is the task of removing redundant information from a sentence while preserving its original meaning. In this paper, we approach deletion-based sentence compression for the Russian language. We use the data from the plagiarism detection corpus (ParaPlag) to create a corpus for sentence compression in Russian of almost 3,000 pairs of sentences. We align source sentences and their compressions using the NeedlemanWunsch algorithm and perform human-evaluation of the corpus by readability and informativeness. Then we use bidirectional LSTM to solve sentence-compression task for Russian, which is a typical baseline for the problem. We also experiment with RuBert and Bert-multilingual. For the latter, we use transfer-learning, firstly pretraining the model on English data, which improves performance. We conduct human evaluation by readability and informativeness and do error analysis for the models. We are able to achieve f-measure of 74.8%, readability of 3.88 and informativeness of 3.47 (out of 5) on test data. We also implement post-hoc syntax-based evaluator, which can detect some of the wrong compressions, increasing overall quality of the system. We provide the data and baseline results for future studies.
Левонтина И.Б.
«Understatement» и сарказм: лексикализация риторического приема
Understatement is a rhetorical device, based on making a statement weaker than it could be made in a given situation (i. e. underrating, less confident, presented as unimportant). In modern Russian, especially in colloquial speech, an extremely popular rhetorical figure is a combination of understatement and sarcasm; recently, several new ways of forming this figure have appeared: na minutochku, esli chto, nichego chto..? [Eto na minutochku moya professiya; Eto, esli chto, moya professiya; A nichego, chto eto moya professiya?] ([literally This is my profession, for a minute; This is my profession, just in case; Doesn’t it mean anything that this is my profession?]). For some language units, the corresponding meaning is partially or completely lexicalized. So, na minutochku and na sekundochku do not initially possess a “degrading” sense (if it is not really about time, meaning that you need a tiny bit of time for something); they are always used sarcastically. That said, as opposed to na minutochku and na sekundochku, other word forms (na minutu, na minutku, na sekundu, na mig, na mgnovenie) are not used this way. Thus, here we have a completely lexicalized figure of speech. In general, sarcasm is extremely difficult to formalize. Therefore, detection of linguistic manifestations of sarcasm appears to be extremely valuable.
Loukachevitch N. V., Rusnachenko N. L.
Sentiment Frames for Attitude Extraction in Russian
Texts can convey several types of inter-related information concerning opinions and attitudes. Such information includes the author’s attitude towards mentioned entities, attitudes of the entities towards each other, positive and negative effects on the entities in the described situations. In this paper, we described the lexicon RuSentiFrames for Russian, where predicate words and expressions are collected and linked to so-called sentiment frames conveying several types of presupposed information on attitudes and effects. We applied the created frames in the task of extracting attitudes from a large news collection.
Lyashevskaya O. N., Shavrina T. O., Trofimov I. V., Vlasova N. A.
GRAMEVAL 2020 Shared Task: Russian Full Morphology and Universal Dependencies Parsing
The paper presents the results of GramEval 2020, a shared task on Russian morphological and syntactic processing. The objective is to process Russian texts starting from provided tokens to parts of speech (pos), grammatical features, lemmas, and labeled dependency trees. To encourage the multi-domain processing, five genres of Modern Russian are selected as test data: news, social media and electronic communication, wiki-texts, fiction, poetry; Middle Russian texts are used as the sixth test set. The data annotation follows the Universal Dependencies scheme. Unlike in many similar tasks, the collection of existing resources, the annotation of which is not perfectly harmonized, is provided for training, so the variability in annotations is a further source of difficulties. The main metric is the average accuracy of pos, features, and lemma tagging, and LAS. In this report, the organizers of GramEval 2020 overview the task, training and test data, evaluation methodology, submission routine, and participating systems. The approaches proposed by the participating systems and their results are reported and analyzed.
Malykh V., Cherniavskii D., Valukov A.
Summary Construction Strategies for Headline Generation in The Russian
In the modern world, texts are plenty in the everyday life of a person—the news articles, blogs, social networks. These texts could be long, for example, the typical length of a New York Times news article is more than 700 words [13]. The reading process could take significant time for even one article, so this raises a question of shortening this time. To handle the mentioned issue there were proposed techniques of extractive and later abstractive text summarization, i.e. the generation of a short text summary using longer original text. There is an issue with most of abstractive and some of extractive summary generation strategies, they all need a training set, which could take time and labour to create, like CNN/DailyMail dataset initially presented in [5] and compiled for text summarization task in [10]. To overcome this issue there was presented a separate task of headline generation for news documents. Since the news documents are plenty, and they could be used with ease. The headline generation task could be considered as a two-stage task. On the first stage, a summary of the article body is constructed and on the second stage, the headline is generated using the constructed summary. In this work, we concentrated on a headline generation task for the Russian language in an aspect of comparison summary construction techniques. This work is composed as follows: related work, dataset and metrics description, base models description, summary strategies, experiments, and conclusion.
Nikishina I., Logacheva V., Panchenko A., Loukachevitch N.
RUSSE’2020: Findings of The First Taxonomy Enrichment Task for The Russian Language
This paper describes the results of the first shared task on taxonomy enrichment for the Russian language. The participants were asked to extend an existing taxonomy with previously unseen words: for each new word their systems should provide a ranked list of possible (candidate) hypernyms. In comparison to the previous tasks for other languages, our competition has a more realistic task setting: new words were provided without definitions. Instead, we provided a textual corpus where these new terms occurred. For this evaluation campaign, we developed a new evaluation dataset based on unpublished RuWordNet data. The shared task features two tracks: “nouns” and “verbs”. 16 teams participated in the task demonstrating high results with more than a half of them outperforming the provided baseline.
Оленикова А.В., Федорова О.В.
Совместный синтаксис в диалогах с заикающимися
Dialogue implies a high degree of coordination between the interlocutors, which makes possible the existence of co-constructed turns used by speakers for various purposes. One of the reasons for them to appear is difficulties in articulation experienced by one of the participants and prompting the other participant interested in achieving the communicative goal to increase their own contribution to the dialogue. In conversations with people who stutter, co-constructions are more common than in conversations between people who have no diagnosed speech disorders; among them completions prevail, because one of the interlocutors more often spells out uncompleted constructions. The study of stuttering from a linguistic perspective is of considerable interest, since it provides an opportunity to study dialogue as a process including cooperation between participants. During this collaborative process one interlocutor’s contribution affects the contribution of another and can trigger non-standard turn-taking techniques.
Pimonova E., Durandin O., Malafeev A.
Doc2vec Or Better Interpretability? A Method Study for Authorship Attribution
In this work, we perform a method study for the problem of authorship attribution in Russian and English. The datasets used consist of 324 works written in Russian and 207 works in English. We propose a set of text representation models that reflect various linguistic phenomena, in particular, morphological and syntactic ones. One distinctive feature of the proposed models is that they are interpretable. These models are used individually and in combination against a Doc2Vec baseline. For Russian, some of our models outperform Doc2Vec, but this does not happen in the case of English, for various reasons. However, the proposed models can also be used together with Doc2Vec, dramatically improving its performance: by 16.79% in the case of Russian and by 7.2% for English. Additionally, we experiment with two different methods for separating texts into blocks of K sentences (contiguous and bootstrapped) and performed parameter tuning of K. Finally, we conduct a feature importance analysis and show which linguistic markers of author style are the most pertinent for Russian, English and for both these languages. All code used in this work is made freely available to the community.
Пиперски А.Ч.
Русский язык и корпусное разнообразие
В статье даётся обзор применений наиболее известных корпусных ресурсов исследования для русского языка. На примере лингвистических публикаций 2019 года демонстрируется, что русистика недостаточно активно использует возможности, которые открываются перед исследователями благодаря наличию широкого разнообразия корпусов. В качестве примеров демонстрируется, какую пользу различные «неклассические» корпуса могут принести в исследованиях, посвящённых анализу явлений на различных уровнях языка: в морфологии и синтаксисе, в словообразовании и лексике, в частности в исследовании субстандартных языковых явлений, а также в сфере конструкций. Обсуждаются достоинства и недостатки отдельных корпусов с точки зрения интерфейса и удобства для использования в различных аспектах.
Подлесская В.И.
«А тот Перовской не дал всласть поспать»: просодия и грамматика анафорического тот в зеркале корпусных данных
Based on data from the Russian National Corpus and the General InternetCorpus of Russian, the paper addresses syntactic, sematic and prosodic features of constructions with the demonstrative TOT used as an anaphor. These constructions have gained some attention in earlier studies [Paducheva 2016], [Berger, Weiss 1987], [Kibrik 2011], [Podlesskaya 2001], but their analysis (a) covered primarily their prototypical uses; and (b) was based on written data. The data from informal, esp. from spoken discourse show however that the actual use of these constructions may deviate considerably from the known prototype. The paper aims at bridging this gap. I claim (i) that the function of TOT is to temporary promote a referent from a less privileged discourse status to a more privileged one; and (ii) that TOT can be analyzed on a par with switch reference devices in the languages where the latter are grammatically marked (e. g. on verb forms). The following parameters of TOT-constructions are discussed: syntactic and semantic roles of TOT and of its antecedent in their respective clauses, linear and structural distances between TOT and its antecedent, animacy of the maintained referent. Special attention is payed to the information structure of the TOT construction: I give structural and prosodic evidence that TOT never has a rhematic status. The revealed actual distribution of TOT (a) adds to our understanding of cross-linguistic variation of anaphoric functions of demonstratives; and, hopefully, (b) may contribute to further developing computational approaches to coreference and anaphora resolution for Russian, e. g. by improving datasets necessary for this task.
Shaheen Z., Wohlgenannt G., Zaity B., Mouromtsev D., Pak V.
Russian Natural Language Generation: Creation of A Language Modeling Dataset and Evaluation with Modern Neural Architectures
Generating coherent, grammatically correct, and meaningful text is very challenging, however, it is crucial to many modern NLP systems. So far, research has mostly focused on English language, for other languages both standardized datasets, as well as experiments with state-of-the-art models, are rare. In this work, we i) provide a novel reference dataset for Russian language modeling, ii) experiment with popular modern methods for text generation, namely variational autoencoders, and generative adversarial networks, which we trained on the new dataset. We evaluate the generated text regarding metrics such as perplexity, grammatical correctness and lexical diversity.
Шмелев А.Д.
Лингвоспецифичные слова в зеркале перевода: ТОСКА
This paper presents a semantic analysis of the most language-specific Russian word for ‘sadness’, namely, toska. The analysis is based on the hypothesis that one may regard translation equivalents and paraphrases of a linguistic unit extracted from real translated texts as a source of information about its semantics. The appearance of language-specific words in translated texts may be even more useful for studying their semantics. It turns out that тоска is not all that rare in Russian translated texts. The study of the incentives that lead Russian translators to use the word тоска often reveals important aspects of the semantics of this word. Stimuli for the appearance of toska in translations into Russian vary greatly. In general, when the original describes some bad feelings, the word toska appears if the original speaks of a subject’s unsatisfied desire, which desire may be vague and not well understood and it usually cannot be satisfied. In addition, the subject often feels lonely.
Sorokin A. A., Smurov I. M., Kirianov D. P.
Tagging and Parsing of Multidomain Collections
In this paper we describe our submission to GramEval2020 competition on morphological tagging, lemmatization and dependency parsing. Our model uses biaffine attention over the BERT representations. The main feature of our work is the extensive usage of language model, tagger and parser fine-tuning on several distinct genres and the implementation of genre classifier. To deal with dataset idiosyncrasies we also extensively apply handwritten rules. Our model took second place in the overall model performance scoring 90.8 aggregate measure over all 4 tasks.
Stenger I., Avgustinova T.
Visual vs. Auditory Perception of Bulgarian Stimuli by Russian Native Speakers
This study contributes to a better understanding of receptive multilingualism by determining similarities and differences in successful processing of written and spoken cognate words in an unknown but (closely) related language. We investigate two Slavic languages with regard to their mutual intelligibility. The current focus is on the recognition of isolated Bulgarian words by Russian native speakers in a cognate guessing task, considering both written and audio stimuli. The experimentally obtained intercomprehension scores show a generally high degree of intelligibility of Bulgarian cognates to Russian subjects, as well as processing difficulties in case of visual vs. auditory perception. In search of an explanation, we examine the linguistic factors that can contribute to various degrees of written and spoken word intelligibility. The intercomprehension scores obtained in the online word translation experiments are correlated with (i) the identical and mismatched correspondences on the orthographic and phonetic level, (ii) the word length of the stimuli, and (iii) the frequency of Russian cognates. Additionally we validate two measuring methods: the Levenshtein distance and the word adaptation surprisal as potential predictors of the word intelligibility in reading and oral intercomprehension.
Tarasov D., Matveeva T., Galiullina N.
An Empirical Investigation of Language Model based Reverse Turing Test as a Tool for Knowledge and Skills Assessment
Automating assessment of person’s skills is an important area of study in artificial intelligence and natural language processing. In this work we conduct empirical study of a recently proposed Reverse Turing Test for Knowledge Assessment approach—a completely automated domain agnostic method of knowledge assessment that can operate completely without human assessor involvement. Our study involved 53 participants and three different knowledge domains. We conclude that this method can reliably differentiate between expertise levels and therefore can be a compelling alternative to human grading and multiple-choice tests in many domains.
Татевосов С.Г., Киселева К.Л.
Семантика ОБРАТНО: возвращение в прерванное состояние
This paper explores the meaning and distribution of obratno, one of the Russian repetitive and restitutive morphemes. We identify three essential characteristics of obratno: obligatoriness of the restitutive reading, narrow scope with respect of indefinites, and incompatibility with eventuality descriptions that entail a result state in the sense of [Kratzer 2000]. We argue that like garden-variety repetitive and restitutive morphemes (e.g., Russian opjat’), obratno denotes a partial identity function with a presupposition. Unlike such morphemes, however, the presuppositional content of obratno involves a return to the same state in which an entity had been before. We capture this characteristic relying on [Landman’s 2008] notion of crosstemporal identity of eventualities and the derivative notion of a cross-temporal substate. This makes the repetitive reading of obratno unavailable, forces identity of the holders of a state, deriving the narrow scope effect, and guarantees that obratno is only compatible with target state descriptions.
Tikhomirov M. M., Loukachevitch N. V., Sirotina A. Yu., Dobrov B. V.
Pretraining and Augmentation in Named Entity Recognition Task for Cybersecurity Domain in Russian
The paper presents the results of applying the BERT representation model in the named entity recognition task for the cybersecurity domain in Russian. Several variants of the model were investigated. The best results were obtained using the BERT model, trained on the target collection of information security texts. This model achieved results, which were 15 percentage points of F1-macro measure greater than results of CRF, the best method in previous experiments for the same task and data. We also explored a new form of data augmentation for the task of named entity recognition.
Tikhomirov M. M., Loukachevitch N. V., Parkhomenko E. A.
Combined Approach to Hypernym Detection for Thesaurus Enrichment
This paper describes a combined approach to hypernym detection task. The approach combines the following techniques: distribution semantics, rulebased patterns, and modern neural networks (BERT). An important feature of our solution is that hypernyms are extracted only from a single text collection provided by the organizers. The described approach obtained the fourth result on the private nouns track. It was found out that the use of the rulebased patterns can significantly improve the results. Also, using the BERT model as an additional factor always helps to improve the performance.
Toldova S., Davydova T., Kobozeva M., Pisarevskaya D.
Discourse Features of Blogs in Subcorpus of Russian Ru-RSTreebank
The paper presents a corpus study of the discourse features in the corpus of blogs. It is based on the data of Ru-RSTreebank annotated within the framework of the Rhetorical Structure theory [Mann, Thompson 1988]. The Ru-RSTreebank represents genres of news and popular science, scientific papers, and blogs texts. Blog subcorpus contains such topics as travelling, cosmetics, sports and health, psychology, IT and tech and some others. Blogs texts constitute a specific genre as they combine properties of written and spoken discourse. The purpose of the paper is to investigate discourse features of blogs in comparison with other genres. We analyze the variation in rhetoric relations distribution among genres, and single out the differences in discourse connectives usage. Furthermore, we check the distribution of other discourse features reported in different studies for spoken discourse and for social media in the Ru-RSTreebank blogs subcorpus. The general frequency analysis and the experiments on RandomForest classifier application to genre recognition have shown that the most important rhetoric relations specific to blogs are Evaluation and Contrast, that there is a tendency to use shorter discourse units and not to express the discourse relations overtly via subordinative conjunctions.
Yadrintsev V. V., Ryzhova А. A., Sochenkov I. V
Distributional Models and Auxiliary Methods for Determining the Hypernyms of Words in Russian
This paper describes our participation in the first shared task on Automatic Taxonomy Construction for the Russian language RUSSE’2020. The goal of this task is the following: input words (neologisms that are not yet included in the taxonomy) need to be associated with the appropriate hypernyms from an existing taxonomy. For example, for the input word “duck”, it is expected that participants will provide a list of its ten hypernyms-synsets to which the word can most likely be attributed, such as “animal,” “bird” and so on. An input word can refer to one, two, or more “parents” at the same time. In this article we are trying to answer the following question: what results can be achieved using only “raw” vectors from distributional models without additional training? The article presents the results for several pre-trained models that are based on fastText, Elmo, and BERT algorithms. Also, an outof-vocabulary analysis was performed for the models under consideration. Taking into account all public scores from the leaderboards, we showed the results corresponding to the following places in the ranking: the 3rd place on public nouns, the 2nd on private nouns, the 4th on public verbs, and the 4th on private verbs.
Янко Т.Е.
Наречие ДАВНО по данным звучащего корпуса
During the last twenty years, the Russian adverb davno ‘long ago, for a long time’ was widely discussed in literature. It was recognized that the unique parameter of davno is its inability to be the theme of a sentence. Moreover, if davno functions in the context of aspectual forms relating to the past it can only be the rheme. In the context of the aspectual verbal forms relating to the past but preserving the connection with the moment of speech, davno can be either the rheme proper, or a component of the rheme. A classic example of an aspectual verb form referring to the past is the general factual meaning of the imperfective aspect. At present, the spoken data corpora can shed light on the communicative structure analysis, since the prosodic structure of the sound speech provides a straightforward access to the communicative structure. Novel parameters of davno are as follows. 1) Whereas davno is traditionally recognized as a word of rhematic polarity it can nevertheless function as a component of the theme in the context of attributive clauses and constructions (Davno soglasovannyj visit dolzhen byl sostojatjsja v aprele ‘A visit planned long ago would take place in April’). 2) The general factual meaning of the imperfective aspect, contrary to what was assumed before, is not an absolute prerequisite for davno to function as the rheme. The spoken corpus showed that in the context of negation and in the context of the verbs of speech, the general factual allows for davno to function as a component of the rheme but not the rheme proper (Ja davno tebja ne videl ‘I have not been seeing you for a long time’; My davno govorili, chto nasha zadacha — eto borjba s terrorismom ‘We have been insisting for a long time that our main goal is the struggle against terrorism’). 3) A specific type of questions with the initial davno (as well as with other adverbs with the meaning of a considerable quantity like chasto ‘often’, mnogo ‘much’, and daleko ‘far away’) is singled out. Such questions cannot be unambiguously classified either as yes-no-questions or as wh-questions (I davno vy zdesj stoite? ‘And how long are you staying here?’). A description of unique prosody of such questions is given. 4) In the context of discourse continuity, davno acquires the rising prosody which is in fact uncharacteristic of a word, which is unable be the theme (Xotel eto sdelat’ davno, no teperj sdelaju tochno ‘I wished to do it long ago, but now I will do it for sure’). The rising tone is accounted for by the meaning of continuity, which has the same prosody as the theme. 5) In constructions kogda-to davno ‘once upon a time’, ochenj davno ‘very long ago’, davno-davno ‘very long ago’, davnymdavno ‘very long ago’, dovoljno davno ‘quite long ago’, ne tak davno ‘not so long ago’ davno loses its rhematic polarity. The parameters of davno are exemplified by spoken fragments taken from the Multimodal corpus of the Russian National corpus, and the minor working collection of the Russian speech recordings specifically set up for this investigation. The software program Praat was used in the process of analyzing the sound data.
Зализняк Анна А.
Русское КАК БЫ: семантика, прагматика, диахрония
The article considers the semantics of the Russian word kak by. It demonstrates that there are three main types of use of this word that are relevant for the modern Russian language: 1) as an approximation indicator, i. e. the marker of an approximative, indirect or metaphorical use of the linguistic unit it introduces (cf. lёd na reke sluzhil kak by mostom ‘ice on the river served as a kind of bridge’; on kak by veduschij specialist v dannoj oblasti ‘he is sort of leading specialist in this field’); 2) as an indicator of epistemic indefiniteness (cf. infljatsii kak by net ‘there is <kak by> no inflation’); 3) as an illocutionary operator (“illocutionary mitigator”), mitigating the illocutionary force of the assertive speech act (cf. Ja kak by ispolnitel’nyj director kompanii ‘I am <kak by> the chief executive officer of the company’, uttered by the actual CEO of the company). We suggest that the initial meaning of kak by is that of a marker of descriptive indefiniteness (in an outdated use after the verbs of fuzzy perception), which has served as a source for both the approximation meaning, which is the main function of this word in contemporary Russian and that of epistemic indefiniteness. In its function as an “illocutionary mitigator” that emerged at the very end of the 20th century in the course of pragmaticalisation, the word kak by belongs to the class of discourse markers that ensure the success of a communicative act. The study was based on the Russian National Corpus (, including its oral and parallel subcorpora.
Zimmerling A. V
Zero Forms in Morphological Paradigms: the Verb “Be” in Russian
This paper offers a corpus analysis of the Russian verb быть ‘be’ which has an abnormal present tense paradigm including a zero form ØBE.PRES and overt forms естьBE.PRES and сутьBE.PRES which do not discriminate person and number and are distributed syntactically. I discuss different approaches to the grammar of быть and argue that Apresjan’s model which recognizes ØBE.PRES, естьBE.PRES and сутьBE.PRES as parts of one and the same lemma is superior to alternative models splitting быть split into two lemmas representing copula vs content verb ‘be’. The peripheral status of overt present BE-forms compared with ØBE.PRES in the Russian National Corpus is confirmed by three measures: 1) dispersion of texts where a BE-form occurs; 2) uneven coverage in different persons and numbers; 3) ratio of copular uses vs content verb uses. 1–2 person present tense BE-forms attested in RNC are internal borrowings from Old Russian and Old Church Slavonic, while естьBE.PRES and сутьBE.PRES are inherited 3rd person elements which take over 1–2 person uses. The historical 3Pl суть is redundant in a system, where a more frequent 3rd person form есть is licensed in the plural: it survives by a minority of speakers either as an optional 3Pl copula in formal discourse or as an emphatic copula in oral discourse. The form естьBE.PRES occurs in all persons and numbers both as content verb and as copula but is underrepresented as 3Pl copula: this gap is filled by ØBE.PRES. The frequency of the zero copula ØBE.PRES can be measured in corpora without syntactic annotation on the basis of systemic proportion between present vs past tense uses of быть and on the basis of approximation samples for contexts where overt copulas alternate with ØBE.PRES.
Zinina A. A., Zaidelman L. Y., Kotov A. A., Arinkin N. A.
The Perception of Robots Emotional Gestures and Speech by Children Solving A Spatial Puzzle
The emotional behavior of a companion robot is important for human-robot interaction in the situation of training tasks. We examined the influence of emotional gestures and emotional speech of the robot on its perception by primary school students (N=52, male, female, mean age 9.8) in the situation of joint solution of the spatial Tangram puzzle. It was shown that emotional gestures make a significant contribution to the attractiveness of the robot for the child. It was also found that test subjects prefer the robot with emotional gestures and speech over the robot with neutral gesture and speech behavior. The study also analyzed the communicative behavior of children, identified typical communicative signs that are typical for interaction start with the robot, for monitoring the game and for difficult situations. We described typical mistakes that children make when assembling a puzzle together with the robot.