Сборник 2022

Содержание (SCOPUS)

Формат PDF (SCOPUS)

Полная версия сборника (SCOPUS)

Abrosimov K.I., Mosyagina A.G.

Sodner for Russian nested named entity recognition

The article describes the solution for Russian nested named entity recognition that we presented in the RuNNE competition. The solution is based on the Sodner model that predicts named entities in a text as a graph. During the competition we improved the training dataset and annotated the additional corpus that contains entities of the fewshot classes. After several experiments with different model parameters high macro F1 and few-shot F1 scores were obtained – 74.08 and 64.41 respectively.

Alibaeva K., Loukachevitch N.V.

Analyzing COVID-related Stance and Arguments using BERT-based Natural Language Inference

In this paper we present our approach for stance detection and premise classification from COVID-related messages developed for the RuArg-2022 evaluation. The methods are based on so-called NLI-setting (natural language inference) of BERT-based text classification (Sun et al., 2019), when the input of a model includes two sentences: a target sentence and a conclusion (for example, positive to masks). We also use translating Russian messages to English, which allows us to leverage COVID-trained BERTmodel. Besides, we use additional marking techniques of targeted entities. Our approach achieved the best results on both RuArg-2022 tasks. We also studied the contribution of marking techniques across datasets, tasks, models and languages of RuArg evaluation. We found that "<A:ASPECT> keyword </A:ASPECT>” gave the highest average increase over corresponding basic methods.

Апресян В.Ю., Шмелев А.Д.

Русские итеративные наречия: штрихи к лексикографическому портрету

The paper is a corpus study of Russian frequency adverbs chasto ‘frequently’, zachastuju ‘often’, redko ‘rarely’, izredka ‘rarely’, etc. In Russian lexicographic tradition, frequency adverbs either lack separate entries and are explained via references to their adjectival counterparts or are treated exclusively as denotations of intervals between events. As our study demonstrates, this covers only a small fraction of their actual corpus usage. Many frequency adverbs can quantify over subjects, and thus resemble classical quantifiers such as ‘many’ or ‘few’. Even when frequency adverbs quantify over predicates, they mostly refer not to intervals between events, but merely to their number. In some cases, they quantify over aspects of events, expressed by adjectives. There are also other important properties of Russian frequency adverbs missed by the dictionaries yet revealed by corpus analysis. Most frequency adverbs have a strong preference for topic or focus position, as motivated by their semantics. Some adverbs are preferable in generalized contexts, while others refer to specific events. Certain adverbs describe violations of the norm or undesirable events. Different adverbs quantify over different time periods: while some require a long time period, others may focus on very short stretches of time.

Artemova E.L., Zmeev M., Loukachevitch N., Rozhkov I., Batura T., Ivanov V., Tutubalina E.

RuNNE-2022 Shared Task: Recognizing Nested Named Entities

The RuNNE Shared Task approaches the problem of nested named entity recognition. The annotation schema is designed in such a way, that an entity may partially overlap or even be nested into another entity. This way, the named entity “The Yermolova Theatre” of type ORGANIZATION houses another entity “Yermolova” of type PERSON. We adopt the Russian NEREL dataset (Loukachevitch et al., 2021) for the RuNNE Shared Task. NEREL comprises news texts written in the Russian language and collected from the Wikinews portal. The annotation schema includes 29 entity types. The nestedness of named entities in NEREL reaches up to six levels. The RuNNE Shared Task explores two setups. (i) In the general setup all entities occur more or less with the same frequency.(ii) In the few-shot setup the majority of entity types occur often in the training set. However, some of the entity types are have lower frequency, being thus challenging to recognize. In the test set the frequency of all entity types is even. This paper reports on the results of the RuNNE Shared Task. Overall the shared task has received 156 submissions from nine teams. Half of the submissions outperform a straightforward BERT-based baseline in both setups. This paper overviews the shared task setup and discusses the submitted systems, discovering meaning insights for the problem of nested NER. The links to the evaluation platform and the data from the shared task are available in our github repository.

Баранов А.Н.

Корпусный эксперимент в лингвистической экспертизе

В докладе рассматриваются современные тенденции в рассмотрении дел по защите чести, достоинства и деловой репутации, которые приводят к использованию судами категории «злоупотребления правом». Пред- лагаются лингвистические критерии, которые позволяют выявлять в тексте лингвистические признаки зло- употребления правом. Одним из критериев является корпусный эксперимент, в процессе которого частота использования маркеров оценки, мнения, предположения, вероятности и под. сравниваются с частотой ис- пользования этих форм в представительном корпусе русского языка. Критерий корпусного эксперимента до- полняется семантическим критерием и критерием метатекста.

Богуславский И.М., Вилинбахова Е.Л.

Имена собственные в сфере действия метаязыкового отрицания

Феномен метаязыкового отрицания характеризуется тем, что в его сфере действия находится не пропозиция, а способ ее языкового выражения. Мы рассматриваем конструкции, в которых в сфере действия метаязыкового отрицания оказывается имя собственное. На основании данных Национального корпуса русского языка можно сделать следующие выводы. Во-первых, основной функцией конструкции является коррекция информации, кодируемой в способе представления носителя имени, а не просто исправление ошибки в передаче этого имени. Во-вторых, распространенность временных маркеров в качестве дополнительных элементов конструкции указывает на использование метаязыковой конструкции для указания на изменение референта с течением времени. Далее, данные корпуса свидетельствуют, что говорящий использует конструкцию для коррекции в сторону более официальной формы имени чаще, чем наоборот, что, по-видимому, говорит о том, что недостаточный уровень формальности осуждается в коммуникации больше, чем ее переизбыток. Также было выявлено, что случаи исправления говорящим самого себя преобладают над случаями коррекции другого, что позволяет говорить о более широком спектре употребления конструкции, чем утверждалось в предшествующих исследованиях. Наконец, было показано, что метаязыковой предикат называния может реализовываться не только лексически, но и входить в состав значения синтаксических конструкций.

Bolshakova E.I., Telegina A.D.

Refining Criteria of Paronymy for Building Computer Dictionaries of Russian Paronyms

Paronyms are words that have some similarity in sounding and spelling, but differ in meaning and usage (e.g., sensitive − sensible, излишек – излишество). In morphologically rich languages like Russian, paronymy is rather frequent phenomenon and one of the sources of speech difficulties. However, known dictionaries of Russian paronyms are not complete enough to help language learning or to support automatic correction of paronymy errors, and they do not provide precise definition of paronymy, which is necessary for constructing more extensive computer dictionaries. Aiming to clarify the concept of paronymy and to refine the previously proposed formal affix criterion of paronymy, we have performed a statistical study of paronyms taken from two printed dictionaries of Russian paronyms. Formal and semantic similarity of paronymy pairs were numerically estimated across various dimensions: proximity in affixes, in sounding, and in word meanings (the latter with the aid of neural models of distributive semantics and with an extensive base of Russian word combinations). Based on results of the study, refined criteria of paronymy and thresholds were proposed, which can be useful to automatically construct computer dictionaries of Russian paronyms, as well to replenish them by diagnostic contexts.

Bondarenko I.

Contrastive fine-tuning to improve generalization in deep NER

A novel algorithm of two-stage fine-tuning of a BERT-based language model for more effective named entity recognition is proposed. The first stage is based on training BERT as a Siamese network using a special contrastive loss function, and the second stage consists of fine-tuning the NER as a "traditional" sequence tagger. Inclusion of the contrastive first stage makes it possible to construct a high-level feature space at the output of BERT with more compact representations of different named entity classes. Experiments have shown that this fine-tuning scheme improves the generalization ability of named entity recognition models fine-tuned from various pre-trained BERT models. The source code is available under an Apache 2.0 license and hosted on GitHub https://github.com/ bond005/runne_contrastive_ner.

Buyanov I., Sochenkov I.

The dataset for presuicidal signals detection in text and its analysis

The paper says about dataset for presuicidal signal detection in Russian posts from social media. To the best of our knowledge, it is a first dataset of a such type for this language. We develop a collection methodology and conduct linguistic analysis of completed dataset. We also build a classification baseline with machine learning models to solve the detection task.

Chistova E., Smirnov I.

Discourse-aware text classification for argument mining

We show that using the rhetorical structure automatically generated by the discourse parser is beneficial for paragraph-level argument mining in Russian. First, we improve the structure awareness of the current RST discourse parser for Russian by employing the recent top-down approach for unlabeled tree construction on a paragraph level. Then we demonstrate the utility of this parser in two classification argument mining subtasks of the RuARG-2022 shared task. Our approach leverages a structured LSTM module to compute a text representation that reflects the composition of discourse units in the rhetorical structure. We show that: (i) the inclusion of discourse analysis improves paragraph-level text classification; (ii) a novel TreeLSTM-based approach performs well for the computation of the complex text hidden representation using both a language model and an end-to-end RST parser; (iii) structures predicted by the proposed RST parser reflect the argumentative structures in texts in Russian.

Chuprina A.O.

Russian Verbal Affixation in Mental Lexicon: Priming Study and Its Online Replication With True and Stem-Modified Relative Prime

While suffixed and prefixed words share common lexical features with their base word in the mental lexicon, the two derivational processes have their own properties. Whether their differences are reflected in the mental storage of the group of relative words or not is one of the topical psycholinguistic questions. My experimental results indicate that memory representations of the derivatives differ: while between the stem and the suffixed relative, the relationship is closer and based on transparency of the derived meaning, the relationship between the stem and the prefixed derivative is rather formal. The results also signal that the decompositional route is not a preferred strategy in complex verb processing for a Russian speaker. I base this conclusion on the results of two in-person experiments and their online replicas. Additionally, the data suggest that lexical organization is modified through the aspectual information of family members. These findings need to be taken into account in future studies, both within psycholinguistic and computational fields, using verbal material of the Russian language.

Dementieva D., Logacheva V., Nikishina I., Fenogenova A., Dale D., Krotova I., Semenov N., Shavrina T., Panchenko A.

RUSSE-2022: Findings of the First Russian Detoxification Shared Task Based on Parallel Corpora

Text detoxification is the task of rewriting a toxic text into a neutral text while preserving its original content. It has a wide range of applications, e.g. moderation of output of neural chatbots or suggesting less emotional version of posts on social networks. This paper provides a description of RUSSE-2022 competition of detoxification methods for the Russian language. This is the first competition which features (i) parallel training data and (ii) manual evaluation. We describe the setup of the competition, the solutions of the participating teams and analyse their performance. In addition to that, the large-scale evaluation allows us to analyse the performance of automatic evaluation metrics.

Добровольский Д.О., Зализняк Анна А.

Эвиденциальность и эпистемическая оценка в значении немецких глаголов sollen и wollen (по данным немецко-русского параллельного корпуса)

В статье на основе анализа семантики немецких модальных глаголов sollen и wollen и их переводных эквивалентов, выявленных на материале немецко-русского параллельного подкорпуса НКРЯ, демонстрируется, что категории эви- денциальности и эпистемической модальности и должны рассматриваться как имеющие независимый статус; при этом эти два языковых значения могут выражаться одновременно. Рассматриваются возможные комбинации типов эвиден- циального и эпистемического компонентов значения, выражаемые данными немецкими глаголами. Предлагается уточнить классификацию типов косвенной эвиденциальности за счет введения третьего промежуточного типа – репортативно-инференциальной эвиденциальности (= вывод, сделанный говорящим на основании интерпретации чужого высказывания). Обращение к параллельному корпусу позволило, с одной стороны, разграничить типы эвиденциальных значений, выражаемых рассматриваемыми немецкими глаголами, с другой – уточнить семантику и выявить потенциальную многозначность единиц русского языка, выступающих в роли их переводных эквивалентов.

Dobrovolskii V., Michurina M., Ivoylova A.

RuCoCo: a new Russian corpus with coreference annotation

We present a new corpus with coreference annotation, Russian Coreference Corpus (RuCoCo). The goal of RuCoCo is to obtain a large number of annotated texts while maintaining high inter-annotator agreement. RuCoCo contains news texts in Russian, part of which were annotated from scratch, and for the rest the machine-generated annotations were refined by human annotators. The size of our corpus is one million words and around 150,000 mentions. We make the corpus publicly available.

Дурягин П.В.

Просодия и многозначность в русских дискурсивных формулах

В статье описан первый опыт использования методов экспериментальной фонетики для описания просодии многозначных дискурсивных формул русского языка. В качестве объекта была выбрана одна из наиболее частотных единиц такого типа – дискурсивная формула да ну. Анализ контуров ЧОТ показал, что эта идиоматическая единица может оформляться двумя тональными конфигурациями: нисходящей, которая может быть идентифицирована как ИК-2 системы интонационных конструкций, и восходящей, которая существенно отличается от ИК-3 и, вероятно, представляет собой одиночную «высокую» тональную цель в сочетании с нерегулярно усекаемым низким пограничным тоном. Употребления этих конфигураций по-разному распределены в зависимости от прагматических оттенков значения, задаваемых диалогическим контекстом. Кроме этого, согласно данным эксперимента, прагматические оттенки значения рассмотренной дискурсивной формулы могут маркироваться длительностью гласных. При выражении удивления испытуемые использовали более долгие гласные, чем при неприятии новой информации; при этом да ну при выражении недоверия занимает промежуточное положение и характеризуется продленным предударным гласным в сочетании с кратким ударным.

Evdokimova A., Nikolaeva Ju., Budennaya E.

Motion verbs in multimodal communication

The article explores correlations between motion verbs and head and hands gestures using the RUPEX corpus. The verbs are divided into four groups based on their meanings. Мonological and dialogical parts of the recordings are compared along with the speaker’s role and viewpoint in gestures. The pilot analysis of motion verbs in the multimodal corpus showed that the relationships between verb type, non-verbal behavior and speaker’s role depend on a complex set of factors and manifests itself in different ways in different channels. In the verbal channel no direct relationship between the semantic type of the verb and the speaker’s role was detected; however, the narrators and commentators who have seen the film used more affectional vocabulary than the reteller while the latter tended to use more vector-prefixed verbs. In manual channel рrefixes or their absence do not influence the use of hand gestures. Transitive verbs meaning manipulations of different items are more probable to be illustrated by depictive gestures. Predictably, motion verbs in the strict sense are more prone to be supported by observer viewpoint (O-VPT) gestures, while verbs of manipulation are usually used with C-VPT gestures. In cephalic channel motion verbs in the strict sense (relocation of a character) are usually illustrated by O-VPT depictive gestures, and manipulation verbs are more probably supported by pantomime C-VPT gestures similar to manual channel. In some head gestures the viewpoint is combined. If the verb is repeated by the same or another speaker the gestures differ in both manual and cephalic channels. Cephalic gesture clusters on motion verbs have mostly a depictive function, which may be considered a gestural illustration.

Evseev D.A., Nagovitsin M., Kuznetsov D.

Controllable Multi-attribute Dialog Generation with PALs and Grounding Knowledge

Today, neural language models are commonly employed for generation of natural like responses in dialog systems. The main issue that limits wide adoption of neural generation is related to poor predictability of responses in terms of content, as well as dialog attributes such as dialog acts and sentiment. In this paper we propose a method based on projected attention layers (PALs) for controllable multi-attribute knowledge grounded dialog generation. We compared a number of methods for training and blending representations produced by PALs combined with DialoGPT base model. Results of our experiments demonstrate that separate pre-training of PAL branches for different attributes followed by transfer and fine-tuning of dense blending layer gives the highest accuracy of control of a generated response for fewer trainable parameters per an attribute. Furthermore, we applied our approach for controllable multi-attribute generation with grounding knowledge to Blenderbot model. Our solution outperforms the baseline Blenderbot and CRAYON model in control accuracy of dialog acts and sentiment on Daily Dialog as well demonstrates a comparable overall quality of dialog generation given grounding knowledge on Wizard of Wikipedia.

Evseev D.A.

Lightweight and accurate system for entity extraction and linking

Entity extraction and linking components in dialogue assistants should meet the requirements of low resource consumption and high accuracy. In this paper we present lightweight system which extracts entity mentions from the text and finds corresponding Wikidata i ds a nd Wikipedia p ages l inks. Entity extraction a nd linking is performed into the following steps: extraction of entity substrings from the text, retrieval of candidate entities from Wikidata knowledge base and entity disambiguation. Entity extraction is based on RoBERTa-tiny model for token classification. Extracted substrings a re classified in to 42 fin e-grained tag s for filtering of candidate entities. Candidate entities are ranked by number of connections of candidate entities in the text in Wikidata knowledge graph. The proposed system outperforms on WNED-WIKI other lightweight solutions, such as REL and OpenTapioca. The system supports easy adding new Wikidata entities to the database and using other knowledge bases for entity linking.

Fedorova O.V.

В прямом эфире «Фильм о грушах»: когнитивные особенности репортажа

В этом исследовании изучалось, существует ли взаимосвязь между объемом вербальной рабочей памяти и порождением речи в жанре репортажа. В эксперименте участовали 16 студентов МГУ имени М.В. Ломоносова. Объем рабочей памяти участников оценивался с помощью теста Speaking span [12]. В качестве стимульного материала был использован «Фильм о грушах» У. Чейфа [2]. Оценивались три аспекта порождения речи: непрерывность репортажа, скорость речи и лексическое разнообразие. Статистический анализ показал, что, как и ожидалось, объем рабочей памяти положительно коррелирует со скоростью речи и лексическим разнообразием, однако отрицательно с непрерывностью репортажа, вопреки ожиданиям.

Fishcheva I., Osadchiy D., Bochenina K., Kotelnikov E.

Argumentative Text Generation in Economic Domain

The development of large and super-large language models, such as GPT-3, T5, Switch Transformer, ERNIE, etc., has significantly improved the performance of text generation. One of the important research directions in this area is the generation of texts with arguments. The solution of this problem can be used in business meetings, political debates, dialogue systems, for preparation of student essays. One of the main domains for these applications is the economic sphere. The key problem of the argument text generation for the Russian language is the lack of annotated argumentation corpora. In this paper, we use translated versions of the Argumentative Microtext, Persuasive Essays and UKP Sentential corpora to fine-tune RuBERT model. Further, this model is used to annotate the corpus of economic news by argumentation. Then the annotated corpus is employed to fine-tune the ruGPT-3 model, which generates argument texts. The results show that this approach improves the accuracy of the argument generation by more than 20 percentage points (63.2% vs. 42.5%) compared to the original ruGPT-3 model.

Goloviznina V.S., Kotelnikov E.V.

Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods

This paper investigates the problem of creating summaries of Russian-language texts based on extractive (TextRank and LexRank) and abstractive (mBART, ruGPT3Small, ruGPT3Large, ruT5-base and ruT5-large) methods. For our experiments, we used the Russian-language corpus of news articles Gazeta and the Russianlanguage parts of the MLSUM and XL-Sum corpora. We computed ROUGE-N, ROUGE-L, BLEU, METEOR and BERTScore metrics to evaluate the quality of summarization. According to the experimental results, the methods are ranked (from best to worst) as follows: ruT5-large, mBART, ruT5-base, LexRank, ruGPT3Large, TextRank, ruGPT3Small. The study also highlights the salient features of summaries obtained by various methods. In particular, mBART summaries are less abstractive than ruGPT3Large and ruT5-large, and ruGPT3Large summaries are often incomplete and contain errors.

Гончаров А.А., Кобозева И.М.

Еще раз о существительном причина: конструкции с сентенциальным актантом, вводимым союзом что

Статья посвящена рассмотрению синтаксических конструкций причина, что P и причина того, что P на корпусном материале. Цель статьи — описать синтаксические и семантические свойства этих конструкций в современном русском языке. Для достижения данной цели были критически рассмотрены имеющиеся описа ния семантики существительного причина и его актантной структуры, после чего была проанализирована представительная выборка примеров, где это существительное употреблено в составе указанных конструк ций. В ходе исследования были получены следующие результаты: 1) дополнено описание модели управления существительного причина; 2) показано, что ограничение на что-придаточное без то при отглагольных име нах, выявленное в [10], действует и в случае непроизводного (в современном русском языке) имени причина; 3) установлено, какую валентность — Причины или Следствия — заполняет что-придаточное при существи тельном причина в зависимости от синтаксической функции конструкций; 4) показано, что в обеих конструк циях могут реализоваться оба возможные значения существительного причина — объективной и субъектив ной причины.

Горбова Е.В., Чуйкова О.Ю.

Суффиксальная имперфективация приставочных глаголов: рекордсмены и аутсайдеры (в словаре, корпусе и Рунете)

Статья подводит итоги исследования имперфективируемости русского приставочного глагола с исполь зованием таких источников языкового материала, как словарь (МАС) корпус (НКРЯ) и Рунет. В фокусе вни мания находятся те подмножества совокупности приставочных перфективов, которые обнаруживают специ фику по отношению к суффиксальной имперфективации, отличаясь повышенной (рекордсмены) и понижен ной (аутсайдеры) имперфективируемостью в сопоставлении со средним уровнем. Первые репрезентированы отыменными и отперфективными дериватами, а вторые – большинством способов действия и глаголами с формантом -и(зи)рова-. Предложены системные и морфонологические объяснения специфике этих подмно жеств.

Gusev I.

Russian Texts Detoxification with Levenshtein Editing

Text detoxification is a style transfer task of creating neutral versions of toxic t exts. In this paper, we use the concept of text editing to build a two-step tagging-based detoxification model using a parallel corpus of Russian texts. With this model, we achieved the best style transfer accuracy among all models in the RUSSE Detox shared task, surpassing larger sequence-to-sequence models.

Inkova O.Y., Nuriev V., Popkova N.

The Role of Paragraph in the Corpora of Annotated Texts

The paper focuses on the function of paragraph both in text organization and in text annotation from the point of view of coherence. Taking as examples three major types of corpora (the RST, ANNODIS, and PDTB corpora), it shows whether and to what extent the existing approaches account for the paragraph when a discourse relation gets annotated. Then it presents the theoretical principles underlying text annotation in two databases: the Supracorpora database of connectives and the Supracorpora database of hierarchical logical-semantic relations (a new linguistic resource). Text coherence is shown to result from the interaction of various discourse phenomena, acting at the level of local and global structures. In this approach, the paragraph is assigned to the meso-level, positioned between local and global levels. The researcher may analyze the internal organization of the paragraph, limiting oneself to the intersentential level. Yet, to analyze and describe how paragraphs follow one another in the text, it is necessary to operate at the supra-sentential level, adopting a conceptual apparatus fundamentally different from the one for the description of local text structure.

Knyazev S.V., Evstigneeva M.Y.

“Word-by-word” melodic contour in Russian dialects: quantitative approach

The paper presents results of quantitative analysis of phrasal tonal structure in two Northern Russian dialects with different types of “word-by-word” melodic contour. These dialects differ from Modern Standard Russian by the quantity of pitch accents since their 60% of words bear pitch accent, thus the prosodic unit in them is not a (phonological) word, but an accent group. In addition, the dialects differ from Standard Russian by regular presence of even tone on the accented vowel (in Arkhangelsk dialect 86% of all accents have it; in Vologda dialect it is less frequent: 33%) and higher frequency of pitch accents with increased interval. The main differences between Arkhangelsk and Vologda dialects are 1) the ratio of rising and falling pitch accents: 2.6% falling in Arkhangelsk dialect and 56% in Vologda dialect, it brings the latter closer to Standard Russian (53%) and 2) the level of the base tone on which the main tonal changes occur (high and medium, respectively). Thus “word-by-word” melodic contour exists at least in two varieties: with rising tonal movement and with a falling tone in the function of an ornamental accent. In general, the intonation system of Vologda dialect, though there are a lot of significant differences, is much closer to Modern Standard Russian than to Arkhangelsk dialect.

Kolesnikova A., Kuratov Y., Konovalov V., Burtsev M.

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from 17× to 49×, while maintaining quality of 1.7× compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.

Колмогорова А.В., Калинин А.А.

Эмоциональный анализ постов ВКонтакте: классификатор или регрессор?

В статье обсуждаются результаты решения двух задач машинного обучения: задачи классификации тек стов социальных сетей на русском языке по критерию доминирующей эмоции и задачи регрессии, в рамках которой эмоции в тех же текстах социальных сетей предсказываются. В основе экспериментов – сформиро ванный авторами датасет из 3879 текстов из пабликов ВКонтакте, размеченный 2000 асессорами на краудсор синговой платформе Толока. Аннотирование проводилось с использованием разработанного интерфейса для недискретной эмоциональной разметки текстов.

Korzun V., Gadecky D., Berzin V., Ilin A.

Speaker-agnostic mouth blendshape prediction from speech

This paper describes a simple end-to-end deep learning approach for automated 3D lip animation from audio. Our solution is speaker-independent, which means that once trained on one voice, the model can be applied to any voice without need for retraining. This solution only requires a small amount of data, which can be easily obtained with a modern iPhone. Along with that we also propose a new combined approach for evaluating blendshape prediction models

Kotelnikov E., Loukachevitch N., Nikishina I., Panchenko A.

RuArg-2022: Argument Mining Evaluation

Argumentation analysis is a field of computational linguistics that studies methods for extracting arguments from texts and the relationships between them, as well as building argumentation structure of texts. This paper is a report of the organizers on the first competition of argumentation analysis systems dealing with Russian language texts within the framework of the Dialogue conference. During the competition, the participants were offered two tasks: stance detection and argument classification. A corpus containing 9,550 sentences (comments on social media posts) on three topics related to the COVID-19 pandemic (vaccination, quarantine, and wearing masks) was prepared, annotated, and used for training and testing. The system that won the first place in both tasks used the NLI (Natural Language Inference) variant of the BERT architecture, automatic translation into English to apply a specialized BERT model, retrained on Twitter posts discussing COVID-19, as well as additional masking of target entities. This system showed the following results: for the stance detection task an F1-score of 0.6968, for the argument classification task an F1-score of 0.7404. We hope that the prepared dataset and baselines will help to foster further research on argument mining for the Russian language.

Кротова Е.Б., Цветаева Е.Н., Шарандин А.В., Добровольский Д.О.

Вариативность грамматической нормы: возможности корпусного и количественного анализа (на материале немецкого предлога wegen)

Изучение вариативности грамматической нормы с использованием корпусных данных в современных условиях не является сложной задачей, однако разметка корпуса опирается на определенные лингвистические теории и не всегда в должной мере учитывает эмпирический материал. В статье на примере немецкого предлога wegen ‘из-за’ и вариативности его управления показано, почему имеющиеся средства в корпусных менеджерах оказываются недостаточными; представлен эксперимент, в котором данные, полученные из корпусов, автоматически размечаются с помощью разработанного алгоритма. Анализ корпусного материала показал, что генитив остается базовым управлением предлога wegen (около 60% от всех употреблений), а датив, признаваемый современной лингвистикой узаконенным вариантом управления, встречается относительно редко (5%). Анализ выявил, что в 35% случаев у существительного после wegen отсутствует указание на падеж, что не упоминается в соответствующих словарных статьях всех известных лексикографических источников. Выявить долю подобных случаев, пользуясь стандартной разметкой рассмотренных корпусов, не представлялось возможным, так как разметка не предполагала категории беспадежного управления.

Кустова Г.И.

Сентенциальные актанты ментальных предикатов с союзом когда (по данным Национального корпуса русского языка)

Известно, что матричные предикаты со значением эмоции и оценки присоединяют сентенциальные актанты не только с союзом что (Обидно / плохо, что команда проиграла), но и с союзами когда и если: Плохо, когда / если команда проигрывает. Данные НКРЯ показывают, что предикаты других семантических классов, которые не упоминаются в грамматиках, тоже могут присоединять клаузы с союзами когда и если. В статье обсуждаются примеры когда-предложений с ментальными предикатами (знать, помнить, понимать / понятно): Я помню, когда по Бородинскому мосту ходили трамваи; Понятно, когда клетки формируются в ходе развития зародыша, но во взрослом организме?.

Левонтина И.Б.

Милый идеал

Работа посвящена семантике, структуре многозначности, модели управления слова идеал. Слово идеал на первый взгляд не кажется сложным ни с точки зрения семантики и структуры многозначности, ни с точки зрения актантной структуры. Х является идеалом Y-а с точки зрения Z – значит, что объект Х, реальный или воображаемый, принадлежит к классу Y и полностью соответствует представлению Z о том, каким должен быть объект класса Y. Причем это настолько полное соответствие, что в жизни такого практически не бывает. В работе рассматриваются некоторые нетривиальные особенности этого слова. В частности, обнаруживается, что форма родительного падежа при слове идеал может замещать три позиции: не только идеал Пети (=Петин идеал) и идеал жены (представление о том, какой должна быть жена), но и идеал служения – в том смысле, что служение и есть содержание идеала. Кроме того, выясняется, что семантика и структура многозначности слова идеал за время его существования в русском языке изменились. В пушкинской формулировке Татьяны милый идеал подразумевается не то, что у Татьяны есть какой-то идеал, а то, что идеал – это сама Татьяна. Точнее, идеал Татьяны здесь – это то же, что образ Татьяны. У разных слов с этим корнем структура многозначности складывалась по-разному.

Li Bin, Weng Yixuan, Song Qiya, Deng Hanjun

Artificial Text Detection with Multiple Training Strategies

As the deep learning rapidly promote, the artificial texts created by generative models are commonly used in news and social media. However, such models can be abused to generate product reviews, fake news, and even fake political content. The paper proposes a solution for the Russian Artificial Text Detection in the Dialogue shared task 2022 (RuATD 2022) to distinguish which model within the list is used to generate this text. We introduce the DeBERTa pre-trained language model with multiple training strategies for this shared task. Extensive experiments conducted on the RuATD dataset validate the effectiveness of our proposed method. Moreover, our submission ranked second place in the evaluation phase for RuATD 2022 (Multi-Class).

Lobanov B.M., Zhitko V.A.

Method and Software Model for Evaluating the Statistical Characteristics of a Speech Melody

A method for estimating the statistical characteristics of speech melody is proposed. The procedure of con- structing histograms of the frequency distribution of the pitch frequency over sufficiently long intervals of speech is described. A distinctive feature of the method is that the discrete values of the pitch of speech are measured only at intervals of vowels. Two options of the pitch scales used for analysis of the melody characteristics are selected, namely: linear for speech and logarithmic for singing. A method for estimating three parameters of the histogram is proposed: register - R, range - D, asymmetry - A. Numerous examples are given showing the effectiveness of the proposed method in assessing the individuality of the melody of the speaker’s speech, as well as his emotional state. A description of the prototype of the SpeechMelodyMeter (SMM) system is given (see also: https://intontrainer.by. SMM is a software implementation of the proposed method for assessing the statistical characteristics of speech melody.

Maloyan N., Nutfullin B., Ilyshin E.

DIALOG-22 RuATD Generated Text Detection

Text Generation Models (TGMs) succeed in creating text that matches human language style reasonably well. Detectors that can distinguish between TGM-generated text and human-written ones play an important role in pre- venting abuse of TGM. In this paper, we describe our pipeline for the two DIALOG-22 RuATD tasks: detecting generated text (binary task) and classification of which model was used to generate text (multiclass task) (Shamar- dina et al., 2022). We achieved 1st place on the binary classification task with an accuracy score of 0.82995 on the private test set and 4th place on the multiclass classification task with an accuracy score of 0.62856 on the private test set. We proposed an ensemble method of different pre-trained models based on the attention mechanism.

Movsesyan A.A.

Russian neural morphological tagging: do not merge tagsets

There are multiple morphologically annotated corpora of Russian available. They have different tagsets and annotation guidelines, which makes them difficult to use together. We proposed a neural morphological tagger for Russian based on multitask learning technique which is able to predict morphological tags of words for different tagsets. We evaluated our model on various corpora and showed that utilising multiple corpora without merging them not only improves tagging performance but allows for scalable indirect conversion between multiple tagsets in all directions. Furthermore, we also showed that treating each corpus separately is more efficient than merging the corpora even if they share the same tagset.

Orzhenovskii M.V.

Detecting Auto-generated Texts with Language Model and Attacking the Detector

We propose a simple approach to the detection of automatically generated texts. A pre-trained language model, fine-tuned on the shared task’s dataset, achieved 3rd place on the binary task leaderboard with 82.6% accuracy. In the multi-task leaderboard, the language model achieved an F1 score of 64.5% after being fine-tuned with the same procedure. In order to investigate the weaknesses of this approach, we explore two possible attacks on the detector: selecting from language model outputs and directed beam search. These attacks reduce the likelihood of detecting the generated texts without significant loss in quality. Both attacks do not require retraining the generative model and are applied at inference time.

Пекелис О.Е.

Русские временные клаузы на шкале семантико-синтаксической интеграции (на примере сочинительного союза когда)

В статье обосновывается существование в русском языке сочинительного временного союза когда. Из-за зыбкости семантического различия между сочинительным и подчинительным когда их трудно разграничить, используя общепринятые критерии сочинения и подчинения. В работе эта трудность решается следующим образом: сначала выделяются контексты, в которых два когда отчетливо различимы на основе семантических и формальных признаков, затем в этих контекстах когда анализируется на основе критериев. В синтаксической литературе временные клаузы обычно считаются более тесно интегрированными с главной клаузой, чем, например, причинные и уступительные. Вывод о существовании сочинительных временных союзов требует пересмотра этой точки зрения.

Petrova M., Ponomareva M., Ivoylova A.

The Pilot Corpus of the English Semantic Sketches

The paper is devoted to the creation of the semantic sketches for English verbs. The pilot corpus consists of the English-Russian sketch pairs and is aimed to show what kind of contrastive studies the sketches help to conduct. Special attention is paid to the cross-language differences between the sketches with similar semantics. Moreover, we discuss the process of building a semantic sketch, and analyse the mistakes that could give insight to the linguistic nature of sketches.

Pletenev S.

Between Denoising and Translation: Experiments in Text Detoxification

This paper describes a solution for the RUSSE Detoxification competition held as part of the Dialogue 2022 conference. The paper presents experiments based on autoregressive and non-autoregressive models. The following approaches are described in this paper: 1) Detoxification as a special case of the text style-transfer problem and the use of modern approaches to solve this task in Russian. 2) Using the Automatic Post-Editing algorithm as a task of translation from toxic to normative Russian text. The article provides an analysis of the listed models, their results in detoxification of sentences, as well an analysis of errors and reasons why the models gave such a diverse result.

Подлесская В.И.

«Потому что больше никто не читает прозу»: грамматика и просодия автономных причинных придаточных по корпусным данным

На материале мультимедийного подкорпуса НКРЯ рассматриваются синтаксические, прагматические и просодические свойства придаточных с союзом потому что в автономных употреблениях. Количественный анализ показал, что в устной речи автономные придаточные, т.е. придаточные, формирующие отдельное высказывание, составляют больше 30% всех вхождений клауз с союзом потому что. Качественный анализ показал, что такие употребления (1) располагаются после фрагмента, реализованного с интонацией завершенности, и отделены просодическим швом; (2) могут иметь иллокутивную силу, не совпадающую с иллокутивной силой смежных дискурсивных фрагментов, и формировать самостоятельную реплику в диалоге; и (3) допускают дислокацию союза вправо, что в неавтономных употреблениях оказывается невозможным.

Posokhov P., Skrylnikov S., Makhnytkina O.

Artificial text detection in Russian language: a BERT-based Approach

This paper describes our solution for the RuATD (Russian Artificial Text Detection) competition held within the Dialog 2022 conference. Our approach is based on the idea of transfer learning, using pre-trained RuRoBERTa, RuBERT, RuGPT3, RuGPT2 models. The final solution included Byte-level Byte-Pair Encoding tokenization, and a fine-tuned model RuRoBERTa model. The system got Accuracy metric value of 0.65 and took first place in the multi- class classification task.

Post M.

Spoken corpora of spontaneous speech as a source to study polar question intonation in Russian dialects

The emergence of several online spoken corpora of Russian regional speech opens new possibilities for the study of regional Russian intonation. The Russian dialect corpora of the Linguistic Convergence Laboratory [32; 1–10] were used to study the intonation of polar (yes/no) questions in regional rural speech. Although using spontaneous speech to study intonation is a challenge, the corpora are large enough to show general tendencies. The typical rising- falling pitch accent of most polar questions in Central Standard Russian is predominant in the regional corpora as well, but with possible variation in phonetic implementation and in the association of the fall. This accent is the most common even in the majority of question utterances with lowered questionhood, and dominates even in the regions known for rising accents in questions. The corpora show that tag questions are frequent in these interview data, unlike the question particles li, ti and či. Not only the dialectal particles ti and či, but also the Standard Russian question particle li shows a varying regional distribution.

Rozhkov I.S., Loukachevitch N.V.

Machine Reading Comprehension Model in RuNNE Competition

The paper studies machine reading comprehension model (MRC) (Li et al., 2020) in its application to extracting nested named entities (nested NER) in the RuNNE-2022 evaluation (Artemova et al., 2022). The model transforms named entity recognition tasks to a question-answering task. In this paper we compare several approaches to formu- lating ”questions” for the MRC model such as entity type names (keywords), entity type definitions, most frequent examples for the train set, combinations of definitions and examples. We found that using two most frequent ex- amples from the training set is comparable in quality of nested NER with gathering qualitative definitions from different dictionaries, which is much more complicated. In the RuNNE evaluation, the MRC model obtained the best results among models without any manual work (rules or additional manual annotation of texts). Keywords: Nested named entities, RuNNE evaluation, Machine reading comprehension.

Shamardina T., Mikhailov V., Chernianskii D., Fenogenova A., Saidov M., Valeeva A., Shavrina S., Smurov I., Tutubalina E., Artemova E.

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

We present the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022 The shared task dataset includes texts from 14 text generators, i.e., one hu- man writer and 13 text generative models fine-tuned for one or more of the following generation tasks: machine translation, paraphrase generation, text summarization, text simplification. We also consider back-translation and zero-shot generation approaches. The human-written texts are collected from publicly available resources across multiple domains. The shared task consists of two sub-tasks: (i) to determine if a given text is automatically generated or written by a human; (ii) to identify the author of a given text. The first task is framed as a binary classification problem. The second task is a multi-class classification problem. We provide count-based and BERT-based baselines, along with the human evaluation on the first sub-task. A total of 30 and 8 systems have been submitted to the binary and multi-class sub-tasks, correspondingly. Most teams outperform the baselines by a wide margin. We publicly release our codebase, human evaluation results, and other materials in our GitHub repository.

Шерстинова Т. Ю., Москвина А. Д., Кирина М. А., Карышева А. С., Колпащикова Е. О.

Тематическое моделирование русского рассказа 1900–1930: наиболее частотные темы и их динамика

В статье описаны результаты эксперимента по построению тематических моделей малой русской прозы (русского рассказа) трех последовательных исторических периодов начала XX века: 1) начала XX века до 1913 г. включительно, 2) военно-революционного периода (1914–1922) и 3) раннесоветского периода (1923– 1930). С помощью алгоритма латентного размещения Дирихле (LDA), построено 9 моделей (по 3 выборки разного размера для каждого из периодов – по 100, 500 и 1000 рассказов). Оказалось, что в каждой из моделей присутствуют весьма частотные «темы» (топики), характеризующие довольно существенную долю текстов каждой выборки с высокой вероятностью, а также наблюдается содержательная динамика этих частотных тем по разным временным периодам, что позволяет считать их тематико-стилистическим маркерами анализируемых коллекций текстов наряду с более традиционными квантитативными мерами анализа текстов. Разнообразие частотных топиков оказалось выше во втором и третьем периоде (для выборок в 500 и 1000 рассказов), что можно объяснить большим лексико-стилистическим разнообразием прозы «эпохи перемен».

Татевосов С.Г., Киселева К.Л.

Русский делимитатив: линейный порядок или движение к кульминации?

В статье обсуждаются ограничения на дистрибуцию делимитативов от глаголов, описывающих кульми- нирующие процессы. Опираясь на идею Х. Р. Мелига о гомогенизации таких процессов как необходимом условии образования делимитатива, мы предлагаем модальную интерпретацию понятия гомогенности. Гомо- генными оказываются такие процессы, при осуществлении которых их позиция на шкале, отражающей объем множества метафизически доступных некульминирующих миров, существенно не меняется.

Trofimchuk D.

Distilled Model for Russian News Clustering: much lighter and faster, still accurate

This paper explores abilities of knowledge distillation for the purposes of News clustering which also can be generalized as an event detection task. We used a BERT-based clustering model as a teacher and tested various student networks based on different architectures (RNN, FFN, convolutional and Transformer-based networks) in order to get a faster lightweight analogue that is more likely to be deployed in real products. We tried two distillation strategies: the first one combined an original loss function from the initial model with a distillation objective, for the second one we used only a specific distillation loss. This approach turned out to be more successful. It let us extend training and validation datasets and gave significantly better results. One of our distilled models scored about 1% lower than the teacher network, but is more than 20 times smaller and 5 times faster by inference.

Voloshina E., Serikov O., Shavrina T.

Is neural language acquisition similar to natural? A chronological probing study

The probing methodology allows one to obtain a partial representation of linguistic phenomena stored in the inner layers of the neural network, using external classifiers and statistical analysis. Pre-trained transformer-based language models are widely used both for natural language understanding (NLU) and natural language generation (NLG) tasks making them most commonly used for downstream applications. However, little analysis was carried out, whether the models were pre-trained enough or contained knowledge correlated with linguistic theory. We are presenting the chronological probing study of transformer English models such as MultiBERT and T5. We sequentially compare the information about the language learned by the models in the process of training on corpora. The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language, including morphology, syntax, and even discourse, while they also can inconsistently fail on tasks that are perceived as easy. We also introduce the open-source framework for chronological probing research, compatible with other transformer-based models. https://github.com/EkaterinaVoloshina/chronological_probing

Vychegzhanin S.V., Kotelnikov E.V.

Collocation2Text: Controllable Text Generation from Guide Phrases in Russian

Large pre-trained language models are capable of generating varied and fluent texts. Starting from the prompt, these models generate a narrative that can develop unpredictably. The existing methods of controllable text genera- tion, which guide the narrative in the text in the user-specified direction, require creating a training corpus and an additional time-consuming training procedure. The paper proposes and investigates Collocation2Text, a plug-and- play method for automatic controllable text generation in Russian, which does not require fine-tuning. The method is based on two interacting models: the autoregressive language ruGPT-3 model and the autoencoding language ru- RoBERTa model. The idea of the method is to shift the output distribution of the autoregressive model according to the output distribution of the autoencoding model in order to ensure a coherent transition of the narrative in the text towards the guide phrase, which can contain single words or collocations. The autoencoding model, which is able to take into account the left and right contexts of the token, “tells” the autoregressive model which tokens are the most and least logical at the current generation step, increasing or decreasing the probabilities of the corresponding tokens. The experiments on generating news articles using the proposed method showed its effectiveness for automatically generated fluent texts which contain coherent transitions between user-specified phrases.

Янко Т.Е.

Метод поиска просодических данных по ключевым словам

К поиску информации в массиве звучащих данных может быть применен метод ключевых слов. В рабо- те анализируются коммуникативные значения, их композиции, которые выражаются просодией, и структура сегментного материала, несущего эти значения. Значения и их композиции, выражающиеся просодически, могут иметь свои сегментные корреляты, лексемы и другие единицы языка. Эти корреляты и используются в качестве ключевых слов поиска. Результаты поиска служат материалом для анализа синтаксической струк- туры коммуникативных компонентов предложений, таких, как темы и ремы сообщений, компоненты особых типов иллокуций (мечтаний, воспоминаний, обоснований), а также композиции иллокуций с дискурсивной незавершенностью. В качестве источника звучащего материала использован Мультимодальный подкорпус Национального Корпуса русского языка НКРЯ, корпус «Рассказы о сновидениях и другие корпуса звучащей речи» (Spokencorpora.ru) и видео-хостинг Youtube. Инструментальный анализ данных проведен с помощью компьютерной системы анализа звучащей речи Praat. Работа иллюстрирована графиками изменений частоты звуковых данных.

Zimmerling A.V.

Historical Text Corpora and the Conclusiveness of Linguistic Analysis

I discuss the methodology and conclusiveness of the corpus-based historical linguistics and analyze two formal models predicting the language-internal variation in Early Old Russian syntax. Linguistic models claiming a rigid distribution of grammatical features like ± overt realization of agreement markers activate hidden corpus character- istics such as profiles of text genres, chronology, vector of change, ± impact of L2, ± presence of supra-dialect fea- tures. In this case they can be valued and checked on text samples, where genre features are stable, while location and time vary.

Zinina A., Kotov A., Zaidelman L., Arinkin N.

Human Communicative Responses to Different Modes of Gaze Management by the Robot

We investigated communicative reactions of people (N = 46), while telling stories to two companion robots, who reacted differently to the human gaze (head turning). In response to a human gaze the “aversive” robot averted its gaze away from the user, while the “responsive” robot, lifted its head and showed a responsive gaze. We found that users with high level of emotional intelligence prefer the gaze responsive robot and better recognize the difference between the robots. Thus, these users constitute the core group for the technology. In this paper, we further examine behavioral patterns of people in the experiment situation: (a) shift of attention to the story; (b) shift of attention to the robot; (c) joint attention. We also distinguish the communicative reactions of people, mainly from the core group, to the aversive and responsive gazes of the robots: positive responses to gaze contact and negative responses to gaze aversion. We show that for some users the responsive gaze behavior of the robot may serve as positive feedback, increasing the number of human iconic gestures, while telling a story to the responsive robot, and decreasing the number of iconic gestures in a story to the aversive robot.

Сборник 2022

Содержание (SCOPUS)

Формат PDF (SCOPUS)

Коллекция сборников