Сборник 2023

Содержание (SCOPUS)

Основной том (SCOPUS)

Формат PDF

Дополнительный том (РИНЦ)

Здесь

Begaev A., Orlov E.

Receipt-AVQA-2023 Challenge

In this work, we introduce a new challenging Document VQA dataset, named Receipt AVQA, and present the results of the associated RECEIPT-AVQA-2023 shared task. Receipt AVQA is comprised of 21835 questions in English over 1957 receipt images. The receipts contain a lot of numbers, which means discrete reasoning capability is required to answer the questions. The associated shared task has attracted 4 teams that have managed to beat an extractive VQA baseline in the final phase of the competition. We hope that the published dataset and promising results of the contestants will inspire further research on understanding documents in scenarios that require discrete reasoning.

Boguslavsky I., Dikonov V., Inshakova E., Iomdin L., Lazursky A., Rygaev I., Timoshenko S., Frolova T.

Constructing a Semantic Corpus for Russian: SemOntoCor

The SemOntoCor project focuses on creating a semantic corpus of Russian based on linguistic and ontological resources. It is a satellite project with regard to a semantic parser (SemETAP) being developed, the latter aiming at producing semantic structures and drawing various types of inferences. SemETAP is used to annotate SemOntoCor in a semi-automatic mode, whereupon SemOntoCor, when reaching sufficient maturity, will help create new parsers and other semantic applications. SemOntoCor can be viewed as a further step in the development of SynTagRus with its several layers of annotation. SemOntoCor builds on top of the morpho-syntactic annotation of SynTagRus and assigns each sentence a Basic Semantic Structure (BSemS). BSemS represents the direct layer of meaning of the sentence in terms of ontological concepts and semantic relations between them. It abstracts away from lexico-syntactic variation and in many cases decomposes lexical meanings into smaller elements. The first phase of SemOntoCor consists in annotating a Russian translation of the novel “The Little Prince” by Antoine de Saint-Exupery (1532 sentences, 13120 tokens).

Bolshakov V., Mikhaylovskiy N.

Pseudo-Labelling for Autoregressive Structured Prediction in Coreference Resolution

Coreference resolution is an important task in natural language processing, since it can be applied to such vital tasks as information retrieval, text summarization, question answering, sentiment analysis and machine translation. In this paper, we present a study on the effectiveness of several approaches to coreference resolution, focusing on the RuCoCo dataset as well as results of participation in the Dialogue Evaluation 2023. We explore ways to increase the dataset size by using pseudo-labelling and data translated from another language. Using such technics we managed to triple the size of dataset, make it more diverse and improve performance of autoregressive structured prediction (ASP) on coreference resolution task. This approach allowed us to achieve the best results on RuCoCo private test with increase of F1-score by 1.8, Precision by 0.5 and Recall by 3.0 points compared to the second-best leaderboard score. Our results demonstrate the potential of the ASP model and the importance of utilizing diverse training data for coreference resolution.

Chistova E., Smirnov I.

Light Coreference Resolution for Russian with Hierarchical Discourse Features

Coreference resolution is the task of identifying and grouping mentions referring to the same real-world entity. Previous neural models have mainly focused on learning span representations and pairwise scores for coreference decisions. However, current methods do not explicitly capture the referential choice in the hierarchical discourse, an important factor in coreference resolution. In this study, we propose a new approach that incorporates rhetorical information into neural coreference resolution models. We collect rhetorical featuresfrom automated discourse parses and examine their impact. As a base model, we implement an end-to-end span-based coreference resolver using a partially fine-tuned multilingual entity-aware language model LUKE. We evaluate our method on the RuCoCo-23 Shared Task for coreference resolution in Russian. Our best model employing rhetorical distance between mentions has ranked 1st on the development set (74.6% F1) and 2nd on the test set (73.3% F1) of the Shared Task¹. We hope that our work will inspire further research on incorporating discourse information in neural coreference resolution models.

Chuikova O.

Partitive genitive in Russian: dictionary and corpus data

The paper aims at comprehensive analysis of the verbs compatible with the partitive genitive object. Based on the Dictionary of Russian Language, the list of perfective verbal lexemes that are able to take the genitive object is compiled and semantic features that unite these verbs are revealed. The features are divided into two groups: aspectually relevant features and aspectually irrelevant features. The corpus-based analysis of the use of the verbs that take both genitive and accusative objects makes it possible to identify features that increase the likelihood of certain object case-marking.

Dvoynikova A., Karpov A.

Bimodal sentiment and emotion classification with multi-head attention fusion of acoustic and linguistic information

This article describes solutions to couple of problems: CMU-MOSEI database preprocessing to improve data quality and bimodal multitask classification of emotions and sentiments. With the help of experimental studies, representative features for acoustic and linguistic information are identified among pretrained neural networks with Transformer architecture. The most representative features for the analysis of emotions and sentiments are EmotionHuBERT and RoBERTa for audio and text modalities respectively. The article establishes a baseline for bimodal multitask recognition of sentiments and emotions – 63.2% and 61.3%, respectively, measured with macro F-score. Experiments were conducted with different approaches to combining modalities – concatenation and multi-head attention. The most effective architecture of neural network with early concatenation of audio and text modality and late multi-head attention for emotions and sentiments recognition is proposed. The proposed neural network is combined with logistic regression, which achieves 63.5% and 61.4% macro F-score by bimodal (audio and text) multitasking recognition of 3 sentiment classes and 6 emotion binary classes

Fedorova O.

Introduction model in Russian «Pear reportages»: The role of common ground

In this study, the peculiarities of the character introduction in the genre of live reportage were studied. The participants were 25 students oh the Lomonosov Moscow State University. Speech production was elicited by means of the “Pears Film” by W. Chafe. Different types of the collective common ground were considered. It turned out that, unlike narratives of other genres, the chronological scale is more important for the introduction than the status scale. It was also shown that the collected reportages from the point of view of the introduction peculiarities are more similar to classical retellings than to the sports reportages.

Filimonova E.

Foreground and background in Russian Sign Language narratives: the role of aspect and actionality

The paper explores the role of aspect and actionality in foregrounding and backgrounding of clauses in Russian Sign Language narratives. Corpus study shows similarities to functions of aspectual markers and actionality in spoken languages. Besides grammatical markers and predicate types, non-manual marking and prosodic features of verbal sign can contribute to clause foregrounding and backgrounding.

Galitsky B., Ilvovsky D., Goncharova E.

Multimodal Discourse Trees in Forensic Linguistics

We extend the concept of a discourse tree (DT) in the discourse representation of text towards data of various forms and natures. The communicative DT to include speech act theory, extended DT to ascend to the level of multiple documents, entity DT to track how discourse covers various entities were defined previously in computational linguistics, we now proceed to the next level of abstraction and formalize discourse of not only text and textual documents but also various kinds of accompanying data. We call such discourse representation Multimodal Discourse Trees (MMDTs). The rational for that is that the same rhetorical relations that hold between text fragments also hold between data values, sets and records, such as Reason, Cause, Enablement, Contrast, Temporal sequence. MMDTs are evaluated with respect to the accuracy of recognition of criminal cases when both text and data records are available. MMDTs are shown to contribute significantly to the recognition accuracy in cases where just keywords and syntactic signals are insufficient for classification and discourse-level information needs to be involved.

Gerasimenko N., Chernyavskiy A., Nikiforova M., Ianina A., Vorontsov K.

Incremental Topic Modeling for Scientific Trend Topics Extraction

Rapid growth of scientific publications and intensive emergence of new directions and approaches poses a challenge to the scientific community to identify trends in a timely and automatic manner. We denote trend as a semantically homogeneous theme that is characterized by a lexical kernel steadily evolving in time and a sharp, often exponential, increase in the number of publications. In this paper, we investigate recent topic modeling approaches to accurately extract trending topics at an early stage. In particular, we customize the standard ARTM-based approach and propose a novel incremental training technique which helps the model to operate on data in real-time. We further create the Artificial Intelligence Trends Dataset (AITD) that contains a collection of early-stage articles and a set of key collocations for each trend. The conducted experiments demonstrate that the suggested ARTM-based approach outperforms the classic PLSA, LDA models and a neural approach based on BERT representations. Our models and dataset are open for research purposes.

Glazkova A.

Fine-tuning Text Classification Models for Named Entity Oriented Sentiment Analysis of Russian Texts

The paper presents an approach to named entity oriented sentiment analysis of Russian news texts proposed during the RuSentNE evaluation. The approach is based on RuRoBERTa-large, a pre-trained RoBERTa model for Russian. We compared several types of entity representation in the input text, and evaluated strategies for handling class imbalance and resampling entity tags in the training set. We demonstrated that some strategies improve the results of pre-trained models obtained on the dataset presented by the organizers of the evaluation.

Goloviznina V., Fishcheva I., Peskisheva T., Kotelnikov E.

Aspect-based Argument Generation in Russian

The paper explores the argument generation in Russian based on given aspects. An aspect refers to one of the sides or property of the target object. Five aspects were considered: "Safety", "Impact on health", "Reliability", "Money", "Convenience and comfort". Various approaches were used for aspect-based generation: fine-tuning, prompt-tuning and few-shot learning. The ruGPT-3Large model was used for experiments. The results show that traditionally trained model (with fine-tuning) generates 51.6% of the arguments on given aspects, with the prompttuning approach – 33.9%, and with few-shot learning – 10.6%. The model also demonstrated the ability to generate arguments on new, previously unknown aspects.

Golubev A., Rusnachenko N., Loukachevitch N.

RuSentNE-2023: Evaluating Entity-Oriented Sentiment Analysis on Russian News Texts

The paper describes the RuSentNE-2023 evaluation devoted to targeted sentiment analysis in Russian news texts. The task is to predict sentiment towards a named entity in a single sentence. The dataset for RuSentNE-2023 evaluation is based on the Russian news corpus RuSentNE having rich sentiment-related annotation. The corpus is annotated with named entities and sentiments towards these entities, along with related effects and emotional states. The evaluation was organized using the CodaLab competition framework. The main evaluation measure was macro-averaged measure of positive and negative classes. The best results achieved were of 66% Macro Fmeasure (Positive+Negative classes). We also tested ChatGPT on the test set from our evaluation and found that the zero-shot answers provided by ChatGPT reached 60% of the F-measure, which corresponds to 4th place in the evaluation. ChatGPT also provided detailed explanations of its conclusion. This can be considered as quite high for zero-shot application.

Gorbova E., Chuikova O.

Frequency dynamics as a criterion for differentiating inflection and word formation (in relation to Russian aspectual pairs)

The paper reports the results of the critical evaluation of the quantitative approach to the distinction between inflection and word formation through the analysis of the trends in the frequency of word forms. The possibility of such analysis is provided by voluminous corpus data and tools for visualizing these trends. Both theoretical foundations of the proposed approach and the results of the pilot study of its applying to Russian aspectual triplets were considered. These cast doubt on the validity of distinguishing between inflection and word formation based on the trends in the frequency of word forms as a reliable tool used to reveal the unity or difference of lexical semantics and thus to define textual units as belonging to the same or different language units.

Gruntov I., Rykov E.

Computer-assisted detection of typologically relevant semantic shifts in world languages

The paper contains the description of a semi-authomatic method for the detection of typologically relevant semantic shifts in the world’s languages. The algorithm extracts colexified pairs of meanings from polysemous words in digitised bilingual dictionaries. A machine learning classifier helps to separate those semantic shifts that are relevant to the lexical typology. Clustering is applied to group similar pairs of meanings into semantic shifts.

Iriskhanova I., Kiose M., Leonteva A., Agafonova O.

Vague reference in expository discourse: multimodal regularities of speech and gesture

The paper looks into the vague reference expressed in speech and gesture distribution in expository discourse. The research data are the monologues of 19 participants with total length of 2 hours 38 minutes. In these monologues, the use of vague reference (expressed in placeholders and approximators, with total amount of 2528) and functional gesture types (deictic, representational, pragmatic and adaptors, with total amount of 2309) was explored, with the aim of identifying the regular patterns of speech and gesture distribution and co-occurrence. The multimodal regularities include 1) the proportional frequency of four gesture types use equal to 6.8 / 14.4 / 28.7 / 50.1, which manifests overall distribution of co-speech gesture in expository discourse, 2) the significant difference in co-speech gesture use with placeholders and approximators which manifests itself in the use of three gesture types, adaptors, representational and pragmatic gestures, 3) the individually maintained significant difference in co-speech gesture use with placeholders and approximators which manifests itself in adaptors. These regularities can serve as predictors for identifying the specifics of vague reference in multimodal expository discourse.

Ivanov V., Elbayoumi Mohamed Gamal

A new dataset for sentence-level complexity in Russian

Text complexity prediction is a well-studied task. Predicting complexity sentence-level has attracted less research interest in Russian. One possible application of sentence-level complexity prediction is more precise and fine-grained modeling of text complexity. In the paper we present a novel dataset with sentence-level annotation of complexity. The dataset is open and contains 1,200 Russian sentences extracted from SynTagRus treebank. Annotations were collected via Yandex Toloka platform using 7-point scale. The paper presents various linguistic features that can contribute to sentence complexity as well as a baseline linear model.

Ivoylova A., Dyachkova D., Petrova M., Michurina M.

The problem of linguistic markup conversion: the transformation of the Compreno markup into the UD format

The linguistic markup is an important NLP task. Currently, there are several popular formats of the markup (Universal Dependencies, Prague Dependencies, and so on), which are mostly focused on morphology and syntax. Full semantic markup can be found in the ABBYY Compreno model. However, the structure of the format differs significantly from the models mentioned above. In the given work, we convert the Compreno markup into the UD format, which is rather popular among NLP researchers, and enrich it with the semantical pattern. Compreno and UD present morphology and syntax differently as far as tokenization, POS-tagging, ellipsis, coordination, and some other things are concerned, which makes the conversion of one format into another more complicated. Nevertheless, the conversion allowed us to create the UD-markup containing not only morpho-syntactic information but also the semantic one.

Karpov D., Konovalov V.

Knowledge Transfer Between Tasks and Languages in the Multi-task Encoder-agnostic Transformer-based Models

We explore the knowledge transfer in the simple multi-task encoder-agnostic transformer-based models on five dialog tasks: emotion classification, sentiment classification, toxicity classification, intent classification, and topic classification. We show that these mo dels’ accuracy differs from the analogous single-task models by ∼0.9%. These results hold for the multiple transformer backbones. At the same time, these models have the same backbone for all tasks, which allows them to have about 0.1% more parameters than any analogous single-task model and to support multiple tasks simultaneously. We also found that if we decrease the dataset size to a certain extent, multi-task models outperform singletask ones, especially on the smallest datasets. We also show that while training multilingual models on the Russian data, adding the English data from the same task to the training sample can improve model performance for the multi-task and single-task settings. The improvement can reach 4-5% if the Russian data are scarce enough. We have integrated these models to the DeepPavlov library and to the DREAM dialogue platform.

Kataeva V., Khodorchenko M.

Attention-based estimation of topic model quality

Topic modeling is an essential instrument for exploring and uncovering latent patterns in unstructured textual data, that allows researchers and analysts to extract valuable understanding of a particular domain. Nonetheless, topic modeling lacks consensus on the matter of its evaluation. The estimation of obtained insightful topics is complicated by several obstacles, the majority of which are summarized by the absence of a unified system of metrics, the one-sidedness of evaluation, and the lack of generalization. Despite various approaches proposed in the literature, there is still no consensus on the aspects of effective examination of topic quality. In this research paper, we address this problem and propose a novel framework for evaluating topic modeling results based on the notion of attention mechanism and Layer-wise Relevance Propagation as tools for discovering the dependencies between text tokens. One of our proposed metrics achieved a 0.71 Pearson correlation and 0.74 𝜑𝜑𝐾𝐾 correlation with human assessment. Additionally, our score variant outperforms other metrics on the challenging Amazon Fine Food Reviews dataset, suggesting its ability to capture contextual information in shorter texts.

Kiose M., Rzheshevskaya A., Izmalkova A., Makeev S.

Foregrounding and accessibility effects in the gaze behavior of the readers with different cognitive style

This paper explores accessibility effects in the gaze behavior of readers with different cognitive style, impulsive and reflective, as mediated by graphological and linguistic foregrounding in the discursive acts in 126 areas of interest (AOIs). The study exploits 1890 gaze behavior probes available at open access Multimodal corpus of oculographic reactions MultiCORText. We identified that while graphological foregrounding makes initial or final components of discursive act more accessible for the impulsive readers, reflective readers also observe the components within the act. Linguistic foregrounding produces higher access with impulsive readers in case the linguistic form is visually focalized (phonological foregrounding and parallel structures); meanwhile, with reflective readers this is the information density appearing in elliptical and one-component sentences which maintains higher access.

Klokova K., Krongauz M., Shulginov V., Yudina T.

Towards a Russian Multimedia Politeness Corpus

Communication involves an exchange of information as well as the use of linguistic means to begin, sustain, and end conversations. Politeness is seen as one of the major language tools that facilitate smooth communication. In English, politeness has been an area of great interest in pragmatics, with various theories and corpus annotation approaches used to understand the relationship between politeness and social categories like power and gender, and to build Natural Language Processing applications. In Russian linguistics, politeness research has largely focused on lexical markers and speech strategies. This paper introduces the ongoing work on the development of the Russian Multimedia Politeness Corpus and discusses an annotation framework for oral communicative interaction, with an emphasis on adapting politeness theories for discourse annotation. The proposed approach lies in the identification of frames that encompass contextual information and the selection of relevant spatial, social, and relational features for the markup. The frames are then used to describe standard situations, which are marked by typical intentions and politeness formulae and paraverbal markers.

Knyazev M.

An experimental study of argument extraction from presuppositional clauses in Russian

The paper discusses two acceptability rating studies testing wh-interrogative and relative extractions of arguments from ˇcto-clauses of presuppositional predicates like žalet’ ‘regret’, as contrasted with nonpresuppositional predicates like nadejat’sja ‘hope’ and nominalized (to ˇcto) clauses. The results show a difference in extraction between bare and nominalized clauses but no difference between presuppositional and nonpresuppositional clauses, raising potential doubts about the analysis of presuppositional clauses as DPs with a silent D.

Korotaev N.

Collaborative constructions in Russian conversations: A multichannel perspective

The talk provides a multichannel description of how interlocutors co-construct utterances in conversation. Using data from the “Russian Pears Chats & Stories”, I propose for a tripartite sequential scheme of collaborative constructions. When the scheme is fully realized, its first step not only includes the initial component of the construction, but also presupposes that the first participant makes a request for a co-operative action; the final component of the construction is provided by the second participant during the second step; while the third step consists of the first participant’s reaction. On each step, the participants combine vocal and non-vocal resources to achieve their goals. In some cases, non-vocal phenomena provide an essential clue to what is actually happening during co-construction, including whether the participants act in a truly co-operative manner. I distinguish between three types of communicative patterns that may take place during co-construction: “Requested Cooperation”, “Unplanned Cooperation”, and “Non-realized Interaction”. The data suggest that these types can be influenced by the way the knowledge of the discussed events is distributed among the participants.

Kozlova A., Shevelev D., Fenogenova A.

Fact-checking benchmark for the Russian Large Language Models

Modern text-generative language models are rapidly developing. They produce text of high quality and are used in many real-world applications. However, they still have several limitations, for instance, the length of the context, degeneration processes, lack of logical structure, and facts consistency. In this work, we focus on the fact-checking problem applied to the output of the generative models on classical downstream tasks, such as paraphrasing, summarization, text style transfer, etc. We define the task of internal fact-checking, set the criteria for factual consistency, and present the novel dataset for this task for the Russian language. The benchmark for internal fact-checking and several baselines are also provided. We research data augmentation approaches to extend the training set and compare classification methods on different augmented data sets.

Laposhina A.

Text complexity as a non-discrete value: Russian L2 text complexity dataset annotation based on Elo rating system

The task of assessing text complexity for L2 learners can be approached as either a classification or regression problem, depending on the chosen scale. The primary bottleneck in such research lies in the limited availability of appropriate data samples. This study presents a combined approach to create a dataset of Russian texts for L2 learners, placed on a continuous scale of complexity, involving expert pairwise comparisons and the Elo rating system. For this pilot dataset, 104 texts from Russian L2 textbooks, TORFL tests, and authentic sources were selected and annotated. The resulting data is useful for evaluation of the automated models for assessing text complexity.

Levontina I., Shmeleva E.

Whose word? Problems of lexicographic representation of ideologically marked words (the lexicon of the Russian-Ukrainian conflict)

The article deals with the problems of presenting ideologically marked words in the dictionary. It is based on the analysis of the words that appeared in the Russian language or received new meanings during the Russian-Ukrainian conflict. The difficulty of the lexicographic representation of such words is that their evaluative potential is mobile, for example, offensive nicknames can be assimilated by “offended” ones and become neutral words. Ideologically marked words can either exist in the lexicon for a long time or be quickly replaced by other lexical units. Therefore, in the interpretation of ideologically marked words, it is advisable to indicate the approximate time of their existence. In addition to temporary indicators, in the dictionary entry of such words, it is necessary to indicate whose word it is, that is, on whose behalf an assessment is given to a person or event. Since we believe that explanatory dictionaries should contain not only common names, but also proper names, the article also discusses geographical names.

Lukichev D., Kryanina D., Bystrova A., Fenogenova A., Tikhonova M.

Parameter-Efficient Tuning of Transformer Models for Anglicism Detection and Substitution in Russian

This article is devoted to the problem of Anglicisms in texts in Russian: the tasks of detection and automatic rewriting of the text with the substitution of Anglicisms by their Russian-language equivalents. Within the framework of the study, we present a parallel corpus of Anglicisms and models that identify Anglicisms in the text and replace them with the Russian equivalent, preserving the stylistics of the original text.

Lyashevskaya O., Afanasev I., Rebrikov S., Shishkina Y., Suleymanova E., Trofimov I., Vlasova N.

Disambiguation in context in the Russian National Corpus: 20 yeas later

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology and lexicon) and UD-SynTagRus (syntax). The paper discusses the challenges in applying the models to texts of different registers, orthographies, and time periods, on the one hand, and making the new version convenient for users accustomed to the old search practices, on the other. The re-annotated corpus data form the basis for the enhancement of the RNC tools such as word and n-gram frequency lists, collocations, corpus comparison, and Word at a glance.

Malkina M., Zinina A., Arinkin N., Kotov A.

Multimodal Hedges for Companion Robots: A Politeness Strategy or an Emotional Expression?

We examine the use of multimodal hedges (a politeness strategy, like saying A kind of!) by companion robots in two symmetric situations: (a) user makes a mistake and the robot affects user’s social face by indicating this mistake, (b) robot makes a mistake, loses its social face and may compensate it with a hedge. Within our first hypothesis we test the politeness theory, applied to robots: the robot with hedges should be perceived as more polite, threat to its social face should be reduced. Within our second hypothesis we test the assumption that multimodal hedges, as the expression (or simulation) of internal confusion, may make the robot more emotional and attractive. In our first experiment two robots assisted users in language learning and indicated their mistakes by saying Incorrect! The first robot used hedges in speech and gestures, while the second robot used gestures, supporting the negation. In our second experiment two robots answered university exam questions and made minor mistakes. The first robot used hedges, while the second robot used addressive strategy in speech and gestures, e. g. moved its hand to the user and said That’s it! We have discovered that the use of hedges as the politeness strategy in both situations makes the robot comfortable to communicate with. But robot with hedges looks more polite only in the experiment, where it affects user’s social face, and not when the robot makes mistakes. However, the usage of hedges as an emotional cue works in both cases: the robot with hedges seems to be cute and sympathy provoking both when it attacks user’s social face or loses its own social face. This spectrum of hedge usage can demonstrate its transition from an expressive cue of a negative emotion (nervousness) to a marker of speaker’s friendliness and competence.

Martynov N., Baushenko M., Abramov A., Fenogenova A.

Augmentation methods for spelling corruptions

The problem of automatic spelling correction is vital to applications such as search engines, chatbots, spellchecking in browsers and text editors. The investigation of spell-checking problems can be divided into several parts: error detection, emulation of the error distribution on the new data for model training, and automatic spelling correction. As the data augmentation technique, the adversarial training via error distribution emulation increases a model’s generalization capabilities; it can address many other challenges: from overcoming a limited amount of training data to regularizing the training objectives of the models. In this work, we propose a novel multi-domain dataset for spelling correction. On this basis, we provide a comparative study of augmentation methods that can be used to emulate the automatic error distribution. We also compare the distribution of the single-domain dataset with the errors from the multi-domain and present a tool that can emulate human misspellings.

Mikhaylovskiy N., Churilov I.

Autocorrelations Decay in Texts and Applicability Limits of Language Models

We show that the laws of autocorrelations decay in texts are closely related to applicability limits of language models. Using distributional semantics we empirically demonstrate that autocorrelations of words in texts decay according to a power law. We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages. The autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts. We conclude that language models exhibiting Markovian behavior, including large autoregressive language models, may have limitations when applied to long texts, whether analysis or generation.

Moloshnikov I., Skorokhodov M., Naumov A., Rybka R., Sboev A.

Named Entity-Oriented Sentiment Analysis with text2text Generation Approach

This paper describes methods for sentiment analysis targeted toward named entities in Russian news texts. These methods are proposed as a solution for the Dialogue Evaluation 2023 competition in the RuSentNE shared task. This article presents two types of neural network models for multi-class classification. The first model is a recurrent neural network model with an attention mechanism and word vector representation extracted from language models. The second model is a neural network model for text2text generation. High accuracy is demonstrated by the generative model fine-tuned on the competition dataset and CABSAR open dataset. The proposed solution achieves 59.33 over two sentiment classes and 68.71 for three-class classification by f1-macro.

Nikolaeva Y.

“Pears are big green”: gestures with concrete objects

The paper examines hand gestures when referring to inanimate referents. The aim of the study was to explore which factors determine the features of a gesture within the framework of modes of representation. Four main types of modes of representation were considered: drawing or shaping the form of the referent, acting, pointing, and presentation (PUOH); in addition, a new category of beat gestures was added. As a result, it was shown that communicative dynamism or other referent characteristics such as control of the object or its inferability from the previous context do not fully determine the use of gestures with the referent. As an alternative hipothesis, we propose a notion of gesture information hierarchy, where discursive factors, such as previous mentions of the referent and the introduction or change of the protagonist along with the way an object is used determines the form of the gesture.

Orlov A., Butenko Z., Demidova D., Starchenko V., Rakhilina E., Lyashevskaya O..,

Russian Constructicon 2.0: New Features and New Perspectives of the Biggest Constructicon Ever Built

Russian constructicon is an open-access linguistic database containing detailed descriptions of over 3,800 Russian grammatical constructions. In this paper we present a new, enlarged and updated version of Russian Constructicon (RusCxn) as well as new trajectories of development which were opened for the resource after the update. Since its first release, RusCxn, has undergone many significant changes. Our team has expanded the number of constructions present in the database 1,5 times, introduced new meta-information features such as glosses, significantly reworked the architecture and the design of Russian Constructicon’s website, and improved the search facilities. The above-mentioned changes not only make RusCxn more attractive and convenient-to-use, but they can also greatly facilitate typological research in the field of Construction Grammar and improve the mapping between constructicography-orinented resources for different languages.

Ostyakova L., Petukhova K., Smilga V., Zharikova D.

Linguistic Annotation Generation with ChatGPT: a Synthetic Dataset of Speech Functions for Discourse Annotation of Casual Conversations

This paper is devoted to examining the hierarchical and multilayered taxonomy of Speech Functions, encompassing pragmatics, turn-taking, feedback, and topic switching in open-domain conversations. To evaluate the distinctiveness of closely related pragmatic classes, we conducted comparative analyses involving both expert annotators and crowdsourcing workers. We then carried out classification experiments on a manually annotated dataset and a synthetic dataset generated using ChatGPT. We looked into the viability of using ChatGPT to produce data for such complex topics as discourse. Our findings contribute to the field of prompt engineering techniques for linguistic annotation in large language models, offering valuable insights for the development of more sophisticated dialogue systems.

Panysheva D.

Poly-predication in informal monological discourse (according to «What I saw» corpus)

The article discusses the relationship between the mode of discourse and quantitative metrics of poly-predication. Based on the material of the corpus "What I Saw", oral and written versions of stories are compared according to the relative frequency of polypredicative constructions and the representation of certain types of polypredication, the features of semantics and grammatical labeling of such structures are described. Using the nonparametric Wilcoxon criterion, the absence of statistical significance between the density of poly-predication in the oral and written parts of the corpus is proved.

Pekelis O.

Russian additive markers takže and tože: a synchronic and diachronic perspective

It is well known that Russian additive markers takže and tože differ in terms of information structure: the scope of takže is focus, while the scope of tože is topic. Based on data of several corpora of Russian, this paper shows that in modern Russian, takže and tože are opposed on other language levels as well, namely syntactically (in terms of word order), lexically (a variant of takže that is synonymous with tože including at the level of the information structure, is going out of use), stylistically and as far as their involvement in grammaticalization processes is concerned (takže but not tože developed into a coordinate conjunction and a discourse marker). However, as evidenced by Russian National Corpus data, most of these contrasts were absent or less pronounced in the Russian language of the 18th-19th centuries. Thus, in the last two centuries takže and tože evolved toward their consistent differentiation.

Petrova M., Ivoylova A., Bayuk I., Dyachkova D., Michurina M.

The CoBaLD Annotation Project: the Creation and Application of the full Morpho-Syntactic and Semantic Markup Standard

The current paper is devoted to the Compreno-Based Linguistic Data (CoBaLD) Annotation Project aimed at creating text corpora annotated with full morphological, syntactic and semantic markup. The first task of the project is to suggest a standard for the full universal markup which would include both morphosyntactic and semantic patterns. To solve this problem, one needs the markup model, which includes all necessary markup levels and presents the markup in a format convenient for users. The latter implies not only the fullness of the markup, but also its structural simplicity and homogeneity. As a base for the markup, we have chosen the simplified version of the Compreno model1 , and as data presentation format, we have taken Universal Dependencies. At the second stage of the project, the Russian corpus with 400 thousand tokens (CoBaLD-Rus) has been created, which is annotated according to the given standard. The third stage is devoted to the testing of the new format. For this purpose, we have held the SEMarkup Shared Task aimed at creating parsers which would produce full morpho-syntactic and semantic markup. Within this task, we have elaborated neural network-based parser trained on our dataset, which allows one to annotate new texts with the CoBaLD-standard. Our further plans are to create fully annotated corpora for other languages and to carry out the experiments on language transfers of the current markup to other languages.

Podberezko P., Kaznacheev A., Abdullayeva S., Kabaev А.

HAlf-MAsked Model for Named Entity Sentiment analysis

Named Entity Sentiment analysis (NESA) is one of the most actively developing application domains in Natural Language Processing (NLP). Social media NESA is a significant field of opinion analysis since detecting and tracking sentiment trends in the news flow is crucial for building various analytical systems and monitoring the media image of specific people or companies. In this paper, we study different transformers-based solutions NESA in RuSentNE-23 evaluation. Despite the effectiveness of the BERT-like models, they can still struggle with certain challenges, such as overfitting, which appeared to be the main obstacle in achieving high accuracy on the RuSentNE-23 data. We present several approaches to overcome this problem, among which there is a novel technique of additional pass over given data with masked entity before making the final prediction so that we can combine logits from the model when it knows the exact entity it predicts sentiment for and when it does not. Utilizing this technique, we ensemble multiple BERTlike models trained on different subsets of data to improve overall performance. Our proposed model achieves the best result on RuSentNE-23 evaluation data and demonstrates improved consistency in entity-level sentiment analysis.

Podlesskaya V.

Prosodic portrait of the Russian connector PRICHOM in the mirror of the multimedia corpus

Based on data from the multimedia subcorpus of the Russian National Corpus, the paper addresses prosodic features of discourse fragments introduced by the connector prichom ‘and besides’. The data of instrumental and perceptual analysis show that the fragment with prichom has communicative-prosodic autonomy: firstly, it has an internal thematic structure with an obligatory rheme and an optional theme; and secondly, there is a prosodic break before this fragment. The autonomy of the fragment introduced by prichom is preserved in a variety of contexts: (i) both in cases where this fragment is a complete clause and when it is a fragmented clause; (ii) both in those cases when the previous fragment is prosodically realized as final (projecting no continuation), and when it is realized as non-final (projecting continuation); (iii) both in those cases when the fragment introduced by prichom is an element of the main narrative chain, and when it is inserted parenthetically inside another fragment. In addition to the above, a fragment with prichom can form a separate turn in the conversation. Thus, the detected prosodic features of the fragment with prichom make it possible to objectify the idea earlier expressed in the literature (Kiselyova 1971, Vinogradov 1984, Inkova 2018, inter alia): that structures with prichom are built in two "communicative steps", or that they are used to express "concomitance established at the level of speech acts ". Clauses connected by the relationship of syntactic subordination quite often lose their prosodic autonomy (Podlesskaya 2014 a, b), and vice versa, clauses in coordinated constructions tend to retain prosodic autonomy. Therefore, the prosodic autonomy of the components of the construction with prichom, retained in various contexts, speaks in favor of its coordinated status, while a number of syntactic tests proper speak of the opposite.

Potyashin I., Kaprielova M., Chekhovich Y., Kildyakov A., Seil T., Finogeev E., Grabovoy A.

HWR200: New open access dataset of handwritten texts images in Russian

Handwritten text image datasets are highly useful for solving many problems using machine learning. Such problems include recognition of handwritten characters and handwriting, visual question answering, near-duplicate detection, search for text reuse in handwriting and many auxiliary tasks: highlighting lines, words, other objects in the text. The paper presents new dataset of handwritten texts images in Russian created by 200 writers with different handwriting and photographed in different environment1 . We described the procedure for creating this dataset and the requirements that were set for the texts and photos. The experiments with the baseline solution on fraud search and text reuse search problems showed results of results of 60% and 83% recall respectively and 5% and 2% false positive rate respectively on the dataset.

Sidorova E., Akhmadeeva I., Kononenko I., Chagina P.

The role of Indicators in Argumentative Relation Prediction

The article presents a comparative study of methods for argumentative relation prediction based on a neural network approach. The distinctive feature of the study is the use of argumentative indicators in the preparation of the training sample. The indicators are generated based on the discourse marker dictionary. The experiments were carried out using an annotated corpus of scientific and popular science texts, including 162 articles available on the ArgNetBank Studio web platform. A set of all argumentative relations is described by internal connections of arguments and include the conclusion and the premise. In the first stage of training set construction, fragments of text that included two consecutive sentences were examined. In the second stage, indicators were retrieved from the corpus texts and, for each indicator, statements presumably corresponding to the premise and conclusion of the argument were extracted. In total, 4.2 thousand indicator-based training contexts and 13.6 thousand pairs of sentences were obtained from the corpus with annotation of the presence of an argumentative relation. Based on this training sample, four classifiers were built: without indicators, with marking indicators in sentences using tags, taking into account segmentation of text based on indicators, with segmentation and tags. The results of the experiments on argumentative relation prediction are presented.

Shmelev A.

Is it possible to make the Russian punctuation rules more explicit?

This paper deals with some issues related to the Russian punctuation rules and their account in computer checkers and correctors (both “analytic” and “synthetic”). It also discusses variation of punctuation. The paper offers a critical assessment of reference books devoted to punctuation and makes special reference to certain verbs of propositional attitude and their parenthetical use (in particular, dumat’ ‘to think,’ videt’ ‘to see,’ and slyshat’ ‘to hear). It claims that the inherent characteristics of the verbs under consideration influence the punctuation, and therefore every verb deserves a detailed description (lexicographic portrait). In particular, videt’ and slyshat’ behave quite differently when used as parenthetical verbs. A step towards making the punctuation rules more explicit may consist in providing an index of words mentioned in the rules together with a subject index.

Surkov V., Evseev D.

Text VQA with Token Classification of Recognized Text and Rule-Based Numerical Reasoning

In this paper, we describe a question answering system on document images which is capable of numerical reasoning over extracted structured data. The system performs optical character recognition, detection of key attributes in text, generation of a numerical reasoning program, and its execution with the values of key attributes as operands. OCR includes the steps of bounding boxes detection and recognition of text from bounding boxes. The extraction of key attributes, such as quantity and price of goods, total etc., is based on the BERT token classification model. For expression generation we investigated the rule-based approach and the T5-base model and found that T5 is capable of generalization to expression types unseen in the training set. The proposed architecture of the question answering system utilizes the structure of independent blocks, each of which can be enhanced or replaced while keeping other components unchanged. The proposed model was evaluated in the Receipt-AVQA competition and on FUNSD dataset.

Sanochkin L., Bolshina A., Cheloshkina K., Galimzianova D., Malafeev A.

Simple Yet Effective Named Entity Oriented Sentiment Analysis

Sentiment analysis, i.e. the automatic evaluation of the emotional tone of a text, is a common task in natural language processing. Entity-Oriented Sentiment Analysis (EOSA) predicts the sentiment of entities mentioned in a given text. In this paper, we focus on the EOSA task for the Russian news. We propose a text classification pipeline to solve this task and show its potential in such tasks. Moreover, in general, EOSA implies labeling both named entities and their sentiment, which can require a lot of annotator labour and time and, thus, presents a major obstacle to the development of a production-ready EOSA system. To help alleviate this, we analyse the potential of applying an Active learning approach to EOSA tasks. We demonstrate that by actively selecting instances for labeling in EOSA the annotation effort required for training machine learning models can be significantly reduced.

Tatevosov S., Kisseleva X.

Scalar structure for polu- ‘half’

This paper explores restrictions on the distribution of polu- ‘half’ in combination with adjectival stems in Russian. Relying on the literature on degree semantics, we analyze polu- as a degree modifier that specifies the degree to which the adjective maps an individual as ½ of the maximal degree. This correctly predicts that polu- can only combine with upper closed scales. We argue that unlike half in English, polu- does not require a scale be lower closed

Tikhonova M., Fenogenova A.

Text simplification as a controlled text style transfer task

The task of text simplification is to reduce the complexity of the given piece of text while preserving its original meaning to improve readability and understanding. In this paper, we consider the simplification task as a subfield of the general text style transfer problem and apply methods of controllable text style to rewrite texts in a simpler manner preserving their meaning. Namely, we use a paraphrase model guided by another style-conditional language model. In our work, we perform a series of experiments and compare this approach with the standard fine-tuning of an autoregressive model.

Uryson E.

An attempt to determine a preposition and delimit the class of derived prepositions in Russian

The object of the paper are Russian words traditionally described as derived prepositions. The problem is that there is no formal definition of preposition in theoretical or applied linguistics. Non-derivative, or primitive prepositions are given in grammar by the closed list, so strictly speaking there is no need to define this class of words. However. we must have criteria for determining derived prepositions. I suggest a set of necessary conditions that a preposition must satisfy. I demonstrate that so called adverbial prepositionsin Russian do not satisfy them and should be described as adverbs. Similarly, some Russian verbal prepositions, and some Russian denominative prepositions should not be described as prepositions.

Veselov A., Eremeev M., Vorontsov K.

Estimating cognitive text complexity with aggregation of quantile-based models

In this paper, we introduce a novel approach to estimating the cognitive complexity of a text at different levels of language: phonetic, morphemic, lexical, and syntactic. The proposed method detects tokens with an abnormal frequency of complexity scores. The frequencies are taken from the empirical distributions calculated over the reference corpus of texts. We use the Russian Wikipedia for this purpose. Ensemble models are combined from individual models from different language levels. We created datasets of pairs of text fragments taken from social studies textbooks of different grades to train the ensembles. Empirical evidence shows that the proposed approach outperforms existing methods, such as readability indices, in estimating text complexity in terms of accuracy. The purpose of this study is to create one of the important components of the system of recommendation of scientific and educational content.

Vychegzhanin S., Kotelnikova A., Sergeev A., Kotelnikov E.

MaxProb: Controllable Story Generation from Storyline

Controllable story generation towards keywords or key phrases is one of the purposes of using language models. Recent work has shown that various decoding strategies prove to be effective in achieving a high level of language control. Such strategies require less computational resources compared to approaches based on fine-tuning pre-trained language models. The paper proposes and investigates the method MaxProb of controllable story generation in Russian, which works at the decoding stage in the process of text generation. The method uses a generative language model to estimate the probability of its tokens in order to shift the content of the text towards the guide phrase. The idea of the method is to generate a set of different small sequences of tokens from the language model vocabulary, estimate the probability of following the guide phrase after each sequence, and choose the most probable sequence. The method allows evaluating the consistency of the token sequence for the transition from the prompt to the guide phrase. The study was carried out using the Russian-language corpus of stories with extracted events that make up the plot of the story. Experiments have shown the effectiveness of the proposed method for automatically creating stories from a set of plot phrases.

Yanko T.

The prosody of the Russian question

The analysis of Russian interrogative prosody is based on a model of a question as consisting of the two components: the illocutionary proper component and the illocutionary improper component. The illocutionary improper component includes the data for information retrieval. The illocutionary proper component can be formed both by segmental means of expression (by an interrogative word or a particle) or solely by prosody (as in Russian yes-no questions). The prosody of Russian questions having the interrogative words or the interrogative particle li is highly variable, whereas the prosody of Russian yes-no questions expressed by prosody is stable. The latter is the Russian rising accent, which has a rise on the tonic syllable of the accent-bearer followed by a fall on the post-tonics if any. The illocutionary improper component can be located sentence initially and carry a specific falling accent (namely, a late fall). A specific type of a question with the interrogative proper component omitted is recognized. Such questions carry a late fall, or a falling-rising accent on the accent-bearer. The analysis is exemplified by the frequency tracings of the sound sentencestaken from the Russian National Corpus and other open sources. As the instrument for verifying the acoustic data, we used the computer system Praat. The paper is illustrated throughout with pitch contours of sound records.

Zalizniak A., Dobrovol’skij D.

Parallel corpus as a tool for semantic analysis: The Russian discourse marker stalo byt' (‘consequently’)

The article examines the semantics of the Russian discourse marker stalo byt’, using the data obtained by analyzing translational correspondences extracted from parallel corpora of the Russian National Corpus (RNC). Typically, this discourse marker is an indicator of inferential evidentiality, by which the speaker marks the fact that the given statement is a conclusion made by the speaker on the basis of the information they received and accepted as true by default. In addition, stalo byt’ has two secondary types of usage – “rhetorical” and “narrative” – where the basic semantics of this discourse marker is subject to certain modifications. One of the key points of analysis is the reconstruction of semantic mechanisms providing the actual semantics of stalo byt’.

Zimmerling A.

Russian Predicatives and Frequency Metrics

This paper introduces five metrics for measuring the frequencies of dative predicatives in Russian.А dative predicative is a word or multiword expression licensing the dative-predicative-structure, where the semantic subject of the non-agreeing non-verbal predicate is marked by the dative case. I measure the frequencies of the predicatives in the contact position <-1;1> with the same-clause dative subject pronouns in 1Sg (m-metrics) and 3Sg (e-metrics). The m-metrics is applied for retrieving a list of dative predicatives from a corpus. I argue that for each large text collection there is a minimal m-value confirming that an item belongs to the core of the dative-predicative structure. The m/e score makes up the third metrics that shows whether an element is oriented towards the use in the 1 st person or not. Basing on the m-metrics, I retrieved 3 lists of predicatives in the subcorpus of 2000–2021 texts included in the Russian National Corpus. The A list includes 87 items with m  10, the B list includes 44 items with m  50, the C list includes 24 items with m  100. 72-79% of items in each list have an m/e value  1,25. A linguistic interpretation of this result is that for each list of dative predicatives it is true that the majority of its elements are autoreferential expressions oriented towards the use in the 1st person present indicative tense in the direct speech. The fourth metrics shows the total number of occurrences of a word or multiword expression in the corpus (N). I argue that the N score must be measured before POS tagging, and lemmatization. The fifth and the last metrics is the m/N score. The RNC data suggest an inverse correlation between the score of an item in the context specific for dative-predicative structures (m) and its overall frequency in the corpus (N). This effect is explained by the regular homonymy of high frequent predicatives with high frequent adverbials and parenthetical expressions.

Сборник 2023

Содержание (SCOPUS)

Основной том (SCOPUS)

Дополнительный том (РИНЦ)

Коллекция сборников