In this work, we introduce a new challenging Document VQA dataset, named Receipt AVQA, and present the
results of the associated RECEIPT-AVQA-2023 shared task. Receipt AVQA is comprised of 21835 questions in
English over 1957 receipt images. The receipts contain a lot of numbers, which means discrete reasoning capability
is required to answer the questions. The associated shared task has attracted 4 teams that have managed to beat an
extractive VQA baseline in the final phase of the competition. We hope that the published dataset and promising
results of the contestants will inspire further research on understanding documents in scenarios that require discrete
The SemOntoCor project focuses on creating a semantic corpus of Russian based on linguistic and ontological
resources. It is a satellite project with regard to a semantic parser (SemETAP) being developed, the latter aiming at
producing semantic structures and drawing various types of inferences. SemETAP is used to annotate SemOntoCor
in a semi-automatic mode, whereupon SemOntoCor, when reaching sufficient maturity, will help create new parsers
and other semantic applications. SemOntoCor can be viewed as a further step in the development of SynTagRus with
its several layers of annotation. SemOntoCor builds on top of the morpho-syntactic annotation of SynTagRus and
assigns each sentence a Basic Semantic Structure (BSemS). BSemS represents the direct layer of meaning of the
sentence in terms of ontological concepts and semantic relations between them. It abstracts away from lexico-syntactic variation and in many cases decomposes lexical meanings into smaller elements. The first phase of SemOntoCor
consists in annotating a Russian translation of the novel “The Little Prince” by Antoine de Saint-Exupery (1532 sentences, 13120 tokens).
Coreference resolution is an important task in natural language processing, since it can be applied to such vital
tasks as information retrieval, text summarization, question answering, sentiment analysis and machine translation.
In this paper, we present a study on the effectiveness of several approaches to coreference resolution, focusing on the
RuCoCo dataset as well as results of participation in the Dialogue Evaluation 2023. We explore ways to increase the
dataset size by using pseudo-labelling and data translated from another language. Using such technics we managed
to triple the size of dataset, make it more diverse and improve performance of autoregressive structured prediction
(ASP) on coreference resolution task. This approach allowed us to achieve the best results on RuCoCo private test
with increase of F1-score by 1.8, Precision by 0.5 and Recall by 3.0 points compared to the second-best leaderboard
score. Our results demonstrate the potential of the ASP model and the importance of utilizing diverse training data for
Coreference resolution is the task of identifying and grouping mentions referring to the same real-world entity.
Previous neural models have mainly focused on learning span representations and pairwise scores for coreference
decisions. However, current methods do not explicitly capture the referential choice in the hierarchical discourse, an
important factor in coreference resolution. In this study, we propose a new approach that incorporates rhetorical information into neural coreference resolution models. We collect rhetorical featuresfrom automated discourse parses
and examine their impact. As a base model, we implement an end-to-end span-based coreference resolver using a
partially fine-tuned multilingual entity-aware language model LUKE. We evaluate our method on the RuCoCo-23
Shared Task for coreference resolution in Russian. Our best model employing rhetorical distance between mentions
has ranked 1st on the development set (74.6% F1) and 2nd on the test set (73.3% F1) of the Shared Task¹. We hope
that our work will inspire further research on incorporating discourse information in neural coreference resolution
The paper aims at comprehensive analysis of the verbs compatible with the partitive genitive object. Based on
the Dictionary of Russian Language, the list of perfective verbal lexemes that are able to take the genitive object is
compiled and semantic features that unite these verbs are revealed. The features are divided into two groups: aspectually relevant features and aspectually irrelevant features. The corpus-based analysis of the use of the verbs that take
both genitive and accusative objects makes it possible to identify features that increase the likelihood of certain object
This article describes solutions to couple of problems: CMU-MOSEI database preprocessing to improve data
quality and bimodal multitask classification of emotions and sentiments. With the help of experimental studies, representative features for acoustic and linguistic information are identified among pretrained neural networks with
Transformer architecture. The most representative features for the analysis of emotions and sentiments are EmotionHuBERT and RoBERTa for audio and text modalities respectively. The article establishes a baseline for bimodal
multitask recognition of sentiments and emotions – 63.2% and 61.3%, respectively, measured with macro F-score.
Experiments were conducted with different approaches to combining modalities – concatenation and multi-head attention. The most effective architecture of neural network with early concatenation of audio and text modality and
late multi-head attention for emotions and sentiments recognition is proposed. The proposed neural network is combined with logistic regression, which achieves 63.5% and 61.4% macro F-score by bimodal (audio and text) multitasking recognition of 3 sentiment classes and 6 emotion binary classes
In this study, the peculiarities of the character introduction in the genre of live reportage were studied. The
participants were 25 students oh the Lomonosov Moscow State University. Speech production was elicited by means
of the “Pears Film” by W. Chafe. Different types of the collective common ground were considered. It turned out
that, unlike narratives of other genres, the chronological scale is more important for the introduction than the status
scale. It was also shown that the collected reportages from the point of view of the introduction peculiarities are more
similar to classical retellings than to the sports reportages.
The paper explores the role of aspect and actionality in foregrounding and backgrounding of clauses in Russian
Sign Language narratives. Corpus study shows similarities to functions of aspectual markers and actionality in spoken
languages. Besides grammatical markers and predicate types, non-manual marking and prosodic features of verbal
sign can contribute to clause foregrounding and backgrounding.
We extend the concept of a discourse tree (DT) in the discourse representation of text towards data of various
forms and natures. The communicative DT to include speech act theory, extended DT to ascend to the level of
multiple documents, entity DT to track how discourse covers various entities were defined previously in computational linguistics, we now proceed to the next level of abstraction and formalize discourse of not only text and
textual documents but also various kinds of accompanying data. We call such discourse representation Multimodal
Discourse Trees (MMDTs). The rational for that is that the same rhetorical relations that hold between text fragments also hold between data values, sets and records, such as Reason, Cause, Enablement, Contrast, Temporal
sequence. MMDTs are evaluated with respect to the accuracy of recognition of criminal cases when both text and
data records are available. MMDTs are shown to contribute significantly to the recognition accuracy in cases where
just keywords and syntactic signals are insufficient for classification and discourse-level information needs to be
Rapid growth of scientific publications and intensive emergence of new directions and approaches poses a
challenge to the scientific community to identify trends in a timely and automatic manner. We denote trend as a
semantically homogeneous theme that is characterized by a lexical kernel steadily evolving in time and a sharp, often
exponential, increase in the number of publications. In this paper, we investigate recent topic modeling approaches to
accurately extract trending topics at an early stage. In particular, we customize the standard ARTM-based approach
and propose a novel incremental training technique which helps the model to operate on data in real-time. We further
create the Artificial Intelligence Trends Dataset (AITD) that contains a collection of early-stage articles and a set of
key collocations for each trend. The conducted experiments demonstrate that the suggested ARTM-based approach
outperforms the classic PLSA, LDA models and a neural approach based on BERT representations. Our models and
dataset are open for research purposes.
The paper presents an approach to named entity oriented sentiment analysis of Russian news texts proposed
during the RuSentNE evaluation. The approach is based on RuRoBERTa-large, a pre-trained RoBERTa model for
Russian. We compared several types of entity representation in the input text, and evaluated strategies for handling
class imbalance and resampling entity tags in the training set. We demonstrated that some strategies improve the
results of pre-trained models obtained on the dataset presented by the organizers of the evaluation.
The paper explores the argument generation in Russian based on given aspects. An aspect refers to one of the
sides or property of the target object. Five aspects were considered: "Safety", "Impact on health", "Reliability",
"Money", "Convenience and comfort". Various approaches were used for aspect-based generation: fine-tuning,
prompt-tuning and few-shot learning. The ruGPT-3Large model was used for experiments. The results show that
traditionally trained model (with fine-tuning) generates 51.6% of the arguments on given aspects, with the prompttuning approach – 33.9%, and with few-shot learning – 10.6%. The model also demonstrated the ability to generate
arguments on new, previously unknown aspects.
The paper describes the RuSentNE-2023 evaluation devoted to targeted sentiment analysis in Russian news
texts. The task is to predict sentiment towards a named entity in a single sentence. The dataset for RuSentNE-2023
evaluation is based on the Russian news corpus RuSentNE having rich sentiment-related annotation. The corpus
is annotated with named entities and sentiments towards these entities, along with related effects and emotional
states. The evaluation was organized using the CodaLab competition framework. The main evaluation measure
was macro-averaged measure of positive and negative classes. The best results achieved were of 66% Macro Fmeasure (Positive+Negative classes). We also tested ChatGPT on the test set from our evaluation and found that
the zero-shot answers provided by ChatGPT reached 60% of the F-measure, which corresponds to 4th place in the
evaluation. ChatGPT also provided detailed explanations of its conclusion. This can be considered as quite high
for zero-shot application.
The paper reports the results of the critical evaluation of the quantitative approach to the distinction between
inflection and word formation through the analysis of the trends in the frequency of word forms. The possibility of
such analysis is provided by voluminous corpus data and tools for visualizing these trends. Both theoretical foundations of the proposed approach and the results of the pilot study of its applying to Russian aspectual triplets were
considered. These cast doubt on the validity of distinguishing between inflection and word formation based on the
trends in the frequency of word forms as a reliable tool used to reveal the unity or difference of lexical semantics and
thus to define textual units as belonging to the same or different language units.
The paper contains the description of a semi-authomatic method for the detection of typologically relevant
semantic shifts in the world’s languages. The algorithm extracts colexified pairs of meanings from polysemous
words in digitised bilingual dictionaries. A machine learning classifier helps to separate those semantic shifts that
are relevant to the lexical typology. Clustering is applied to group similar pairs of meanings into semantic shifts.
The paper looks into the vague reference expressed in speech and gesture distribution in expository discourse.
The research data are the monologues of 19 participants with total length of 2 hours 38 minutes. In these monologues,
the use of vague reference (expressed in placeholders and approximators, with total amount of 2528) and functional
gesture types (deictic, representational, pragmatic and adaptors, with total amount of 2309) was explored, with the
aim of identifying the regular patterns of speech and gesture distribution and co-occurrence. The multimodal
regularities include 1) the proportional frequency of four gesture types use equal to 6.8 / 14.4 / 28.7 / 50.1, which
manifests overall distribution of co-speech gesture in expository discourse, 2) the significant difference in co-speech
gesture use with placeholders and approximators which manifests itself in the use of three gesture types, adaptors,
representational and pragmatic gestures, 3) the individually maintained significant difference in co-speech gesture
use with placeholders and approximators which manifests itself in adaptors. These regularities can serve as predictors
for identifying the specifics of vague reference in multimodal expository discourse.
Text complexity prediction is a well-studied task. Predicting complexity sentence-level has attracted less research interest in Russian. One possible application of sentence-level complexity prediction is more precise and
fine-grained modeling of text complexity. In the paper we present a novel dataset with sentence-level annotation
of complexity. The dataset is open and contains 1,200 Russian sentences extracted from SynTagRus treebank.
Annotations were collected via Yandex Toloka platform using 7-point scale. The paper presents various linguistic
features that can contribute to sentence complexity as well as a baseline linear model.
The linguistic markup is an important NLP task. Currently, there are several popular formats of the markup
(Universal Dependencies, Prague Dependencies, and so on), which are mostly focused on morphology and syntax.
Full semantic markup can be found in the ABBYY Compreno model. However, the structure of the format differs
significantly from the models mentioned above. In the given work, we convert the Compreno markup into the UD
format, which is rather popular among NLP researchers, and enrich it with the semantical pattern.
Compreno and UD present morphology and syntax differently as far as tokenization, POS-tagging, ellipsis, coordination, and some other things are concerned, which makes the conversion of one format into another more complicated. Nevertheless, the conversion allowed us to create the UD-markup containing not only morpho-syntactic
information but also the semantic one.
We explore the knowledge transfer in the simple multi-task encoder-agnostic transformer-based
models on five dialog tasks: emotion classification, sentiment classification, toxicity classification, intent
classification, and topic classification. We show that these mo dels’ accuracy differs from the analogous
single-task models by ∼0.9%. These results hold for the multiple transformer backbones. At the same
time, these models have the same backbone for all tasks, which allows them to have about 0.1% more
parameters than any analogous single-task model and to support multiple tasks simultaneously. We
also found that if we decrease the dataset size to a certain extent, multi-task models outperform singletask ones, especially on the smallest datasets. We also show that while training multilingual models
on the Russian data, adding the English data from the same task to the training sample can improve
model performance for the multi-task and single-task settings. The improvement can reach 4-5% if the
Russian data are scarce enough. We have integrated these models to the DeepPavlov library and to
the DREAM dialogue platform.
Topic modeling is an essential instrument for exploring and uncovering latent patterns in unstructured textual
data, that allows researchers and analysts to extract valuable understanding of a particular domain. Nonetheless,
topic modeling lacks consensus on the matter of its evaluation. The estimation of obtained insightful topics is
complicated by several obstacles, the majority of which are summarized by the absence of a unified system of
metrics, the one-sidedness of evaluation, and the lack of generalization. Despite various approaches proposed in
the literature, there is still no consensus on the aspects of effective examination of topic quality. In this research
paper, we address this problem and propose a novel framework for evaluating topic modeling results based on the
notion of attention mechanism and Layer-wise Relevance Propagation as tools for discovering the dependencies
between text tokens. One of our proposed metrics achieved a 0.71 Pearson correlation and 0.74 𝜑𝜑𝐾𝐾 correlation
with human assessment. Additionally, our score variant outperforms other metrics on the challenging Amazon
Fine Food Reviews dataset, suggesting its ability to capture contextual information in shorter texts.
This paper explores accessibility effects in the gaze behavior of readers with different cognitive style, impulsive
and reflective, as mediated by graphological and linguistic foregrounding in the discursive acts in 126 areas of interest
(AOIs). The study exploits 1890 gaze behavior probes available at open access Multimodal corpus of oculographic
reactions MultiCORText. We identified that while graphological foregrounding makes initial or final components of
discursive act more accessible for the impulsive readers, reflective readers also observe the components within the
act. Linguistic foregrounding produces higher access with impulsive readers in case the linguistic form is visually
focalized (phonological foregrounding and parallel structures); meanwhile, with reflective readers this is the information density appearing in elliptical and one-component sentences which maintains higher access.
Communication involves an exchange of information as well as the use of linguistic means to begin, sustain,
and end conversations. Politeness is seen as one of the major language tools that facilitate smooth communication.
In English, politeness has been an area of great interest in pragmatics, with various theories and corpus annotation
approaches used to understand the relationship between politeness and social categories like power and gender, and
to build Natural Language Processing applications. In Russian linguistics, politeness research has largely focused
on lexical markers and speech strategies. This paper introduces the ongoing work on the development of the Russian
Multimedia Politeness Corpus and discusses an annotation framework for oral communicative interaction, with an
emphasis on adapting politeness theories for discourse annotation. The proposed approach lies in the identification
of frames that encompass contextual information and the selection of relevant spatial, social, and relational features
for the markup. The frames are then used to describe standard situations, which are marked by typical intentions
and politeness formulae and paraverbal markers.
The paper discusses two acceptability rating studies testing wh-interrogative and relative extractions of arguments from ˇcto-clauses of presuppositional predicates like žalet’ ‘regret’, as contrasted with nonpresuppositional
predicates like nadejat’sja ‘hope’ and nominalized (to ˇcto) clauses. The results show a difference in extraction
between bare and nominalized clauses but no difference between presuppositional and nonpresuppositional clauses,
raising potential doubts about the analysis of presuppositional clauses as DPs with a silent D.
The talk provides a multichannel description of how interlocutors co-construct utterances in conversation. Using
data from the “Russian Pears Chats & Stories”, I propose for a tripartite sequential scheme of collaborative
constructions. When the scheme is fully realized, its first step not only includes the initial component of the
construction, but also presupposes that the first participant makes a request for a co-operative action; the final
component of the construction is provided by the second participant during the second step; while the third step
consists of the first participant’s reaction. On each step, the participants combine vocal and non-vocal resources to
achieve their goals. In some cases, non-vocal phenomena provide an essential clue to what is actually happening
during co-construction, including whether the participants act in a truly co-operative manner. I distinguish between
three types of communicative patterns that may take place during co-construction: “Requested Cooperation”,
“Unplanned Cooperation”, and “Non-realized Interaction”. The data suggest that these types can be influenced by the
way the knowledge of the discussed events is distributed among the participants.
Modern text-generative language models are rapidly developing. They produce text of high quality and are
used in many real-world applications. However, they still have several limitations, for instance, the length of
the context, degeneration processes, lack of logical structure, and facts consistency. In this work, we focus on
the fact-checking problem applied to the output of the generative models on classical downstream tasks, such as
paraphrasing, summarization, text style transfer, etc. We define the task of internal fact-checking, set the criteria
for factual consistency, and present the novel dataset for this task for the Russian language. The benchmark for
internal fact-checking and several baselines are also provided. We research data augmentation approaches to extend
the training set and compare classification methods on different augmented data sets.
The task of assessing text complexity for L2 learners can be approached as either a classification or regression
problem, depending on the chosen scale. The primary bottleneck in such research lies in the limited availability of
appropriate data samples. This study presents a combined approach to create a dataset of Russian texts for L2 learners,
placed on a continuous scale of complexity, involving expert pairwise comparisons and the Elo rating system. For
this pilot dataset, 104 texts from Russian L2 textbooks, TORFL tests, and authentic sources were selected and annotated. The resulting data is useful for evaluation of the automated models for assessing text complexity.
The article deals with the problems of presenting ideologically marked words in the dictionary. It is based on the
analysis of the words that appeared in the Russian language or received new meanings during the Russian-Ukrainian
conflict. The difficulty of the lexicographic representation of such words is that their evaluative potential is mobile,
for example, offensive nicknames can be assimilated by “offended” ones and become neutral words. Ideologically
marked words can either exist in the lexicon for a long time or be quickly replaced by other lexical units. Therefore,
in the interpretation of ideologically marked words, it is advisable to indicate the approximate time of their existence.
In addition to temporary indicators, in the dictionary entry of such words, it is necessary to indicate whose word it is,
that is, on whose behalf an assessment is given to a person or event. Since we believe that explanatory dictionaries
should contain not only common names, but also proper names, the article also discusses geographical names.
This article is devoted to the problem of Anglicisms in texts in Russian: the tasks of detection and automatic
rewriting of the text with the substitution of Anglicisms by their Russian-language equivalents. Within the framework of the study, we present a parallel corpus of Anglicisms and models that identify Anglicisms in the text and
replace them with the Russian equivalent, preserving the stylistics of the original text.
An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC)
features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency
types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based
on the manually disambiguated subcorpus of the Main corpus (morphology and lexicon) and UD-SynTagRus (syntax). The paper discusses the challenges in applying the models to texts of different registers, orthographies, and
time periods, on the one hand, and making the new version convenient for users accustomed to the old search
practices, on the other. The re-annotated corpus data form the basis for the enhancement of the RNC tools such as
word and n-gram frequency lists, collocations, corpus comparison, and Word at a glance.
We examine the use of multimodal hedges (a politeness strategy, like saying A kind of!) by companion robots in
two symmetric situations: (a) user makes a mistake and the robot affects user’s social face by indicating this mistake,
(b) robot makes a mistake, loses its social face and may compensate it with a hedge. Within our first hypothesis we
test the politeness theory, applied to robots: the robot with hedges should be perceived as more polite, threat to its
social face should be reduced. Within our second hypothesis we test the assumption that multimodal hedges, as the
expression (or simulation) of internal confusion, may make the robot more emotional and attractive. In our first experiment two robots assisted users in language learning and indicated their mistakes by saying Incorrect! The first
robot used hedges in speech and gestures, while the second robot used gestures, supporting the negation. In our second
experiment two robots answered university exam questions and made minor mistakes. The first robot used hedges,
while the second robot used addressive strategy in speech and gestures, e. g. moved its hand to the user and said That’s
it! We have discovered that the use of hedges as the politeness strategy in both situations makes the robot comfortable
to communicate with. But robot with hedges looks more polite only in the experiment, where it affects user’s social
face, and not when the robot makes mistakes. However, the usage of hedges as an emotional cue works in both cases:
the robot with hedges seems to be cute and sympathy provoking both when it attacks user’s social face or loses its
own social face. This spectrum of hedge usage can demonstrate its transition from an expressive cue of a negative
emotion (nervousness) to a marker of speaker’s friendliness and competence.
The problem of automatic spelling correction is vital to applications such as search engines, chatbots, spellchecking in browsers and text editors. The investigation of spell-checking problems can be divided into several
parts: error detection, emulation of the error distribution on the new data for model training, and automatic spelling
correction. As the data augmentation technique, the adversarial training via error distribution emulation increases
a model’s generalization capabilities; it can address many other challenges: from overcoming a limited amount of
training data to regularizing the training objectives of the models. In this work, we propose a novel multi-domain
dataset for spelling correction. On this basis, we provide a comparative study of augmentation methods that can
be used to emulate the automatic error distribution. We also compare the distribution of the single-domain dataset
with the errors from the multi-domain and present a tool that can emulate human misspellings.
We show that the laws of autocorrelations decay in texts are closely related to applicability limits of language
models. Using distributional semantics we empirically demonstrate that autocorrelations of words in texts decay according to a power law. We show that distributional semantics provides coherent autocorrelations decay exponents
for texts translated to multiple languages. The autocorrelations decay in generated texts is quantitatively and often
qualitatively different from the literary texts. We conclude that language models exhibiting Markovian behavior, including large autoregressive language models, may have limitations when applied to long texts, whether analysis or
This paper describes methods for sentiment analysis targeted toward named entities in Russian news texts.
These methods are proposed as a solution for the Dialogue Evaluation 2023 competition in the RuSentNE shared
task. This article presents two types of neural network models for multi-class classification. The first model is a recurrent neural network model with an attention mechanism and word vector representation extracted from language
models. The second model is a neural network model for text2text generation. High accuracy is demonstrated by
the generative model fine-tuned on the competition dataset and CABSAR open dataset. The proposed solution
achieves 59.33 over two sentiment classes and 68.71 for three-class classification by f1-macro.
The paper examines hand gestures when referring to inanimate referents. The aim of the study was to explore
which factors determine the features of a gesture within the framework of modes of representation. Four main types
of modes of representation were considered: drawing or shaping the form of the referent, acting, pointing, and presentation (PUOH); in addition, a new category of beat gestures was added.
As a result, it was shown that communicative dynamism or other referent characteristics such as control of the
object or its inferability from the previous context do not fully determine the use of gestures with the referent. As an
alternative hipothesis, we propose a notion of gesture information hierarchy, where discursive factors, such as previous mentions of the referent and the introduction or change of the protagonist along with the way an object is used
determines the form of the gesture.
Russian constructicon is an open-access linguistic database containing detailed descriptions of over 3,800 Russian grammatical constructions. In this paper we present a new, enlarged and updated version of Russian Constructicon (RusCxn) as well as new trajectories of development which were opened for the resource after the update. Since
its first release, RusCxn, has undergone many significant changes. Our team has expanded the number of constructions present in the database 1,5 times, introduced new meta-information features such as glosses, significantly reworked the architecture and the design of Russian Constructicon’s website, and improved the search facilities. The
above-mentioned changes not only make RusCxn more attractive and convenient-to-use, but they can also greatly
facilitate typological research in the field of Construction Grammar and improve the mapping between constructicography-orinented resources for different languages.
This paper is devoted to examining the hierarchical and multilayered taxonomy of Speech Functions, encompassing pragmatics, turn-taking, feedback, and topic switching in open-domain conversations. To evaluate the
distinctiveness of closely related pragmatic classes, we conducted comparative analyses involving both expert annotators and crowdsourcing workers. We then carried out classification experiments on a manually annotated
dataset and a synthetic dataset generated using ChatGPT. We looked into the viability of using ChatGPT to produce
data for such complex topics as discourse. Our findings contribute to the field of prompt engineering techniques for
linguistic annotation in large language models, offering valuable insights for the development of more sophisticated
The article discusses the relationship between the mode of discourse and quantitative metrics of poly-predication.
Based on the material of the corpus "What I Saw", oral and written versions of stories are compared according to the
relative frequency of polypredicative constructions and the representation of certain types of polypredication, the
features of semantics and grammatical labeling of such structures are described. Using the nonparametric Wilcoxon
criterion, the absence of statistical significance between the density of poly-predication in the oral and written parts
of the corpus is proved.
It is well known that Russian additive markers takže and tože differ in terms of information structure: the scope of
takže is focus, while the scope of tože is topic. Based on data of several corpora of Russian, this paper shows that in modern
Russian, takže and tože are opposed on other language levels as well, namely syntactically (in terms of word order), lexically
(a variant of takže that is synonymous with tože including at the level of the information structure, is going out of use),
stylistically and as far as their involvement in grammaticalization processes is concerned (takže but not tože developed into
a coordinate conjunction and a discourse marker). However, as evidenced by Russian National Corpus data, most of these
contrasts were absent or less pronounced in the Russian language of the 18th-19th centuries. Thus, in the last two centuries
takže and tože evolved toward their consistent differentiation.
The current paper is devoted to the Compreno-Based Linguistic Data (CoBaLD) Annotation Project aimed at
creating text corpora annotated with full morphological, syntactic and semantic markup. The first task of the project
is to suggest a standard for the full universal markup which would include both morphosyntactic and semantic
patterns. To solve this problem, one needs the markup model, which includes all necessary markup levels and
presents the markup in a format convenient for users. The latter implies not only the fullness of the markup, but
also its structural simplicity and homogeneity. As a base for the markup, we have chosen the simplified version of
the Compreno model1
, and as data presentation format, we have taken Universal Dependencies.
At the second stage of the project, the Russian corpus with 400 thousand tokens (CoBaLD-Rus) has been
created, which is annotated according to the given standard. The third stage is devoted to the testing of the new
format. For this purpose, we have held the SEMarkup Shared Task aimed at creating parsers which would produce
full morpho-syntactic and semantic markup. Within this task, we have elaborated neural network-based parser
trained on our dataset, which allows one to annotate new texts with the CoBaLD-standard. Our further plans are
to create fully annotated corpora for other languages and to carry out the experiments on language transfers of the
current markup to other languages.
Named Entity Sentiment analysis (NESA) is one of the most actively developing application domains in Natural
Language Processing (NLP). Social media NESA is a significant field of opinion analysis since detecting and
tracking sentiment trends in the news flow is crucial for building various analytical systems and monitoring the
media image of specific people or companies.
In this paper, we study different transformers-based solutions NESA in RuSentNE-23 evaluation. Despite
the effectiveness of the BERT-like models, they can still struggle with certain challenges, such as overfitting,
which appeared to be the main obstacle in achieving high accuracy on the RuSentNE-23 data. We present several
approaches to overcome this problem, among which there is a novel technique of additional pass over given data
with masked entity before making the final prediction so that we can combine logits from the model when it knows
the exact entity it predicts sentiment for and when it does not. Utilizing this technique, we ensemble multiple BERTlike models trained on different subsets of data to improve overall performance. Our proposed model achieves
the best result on RuSentNE-23 evaluation data and demonstrates improved consistency in entity-level sentiment
Based on data from the multimedia subcorpus of the Russian National Corpus, the paper addresses prosodic
features of discourse fragments introduced by the connector prichom ‘and besides’. The data of instrumental and
perceptual analysis show that the fragment with prichom has communicative-prosodic autonomy: firstly, it has an
internal thematic structure with an obligatory rheme and an optional theme; and secondly, there is a prosodic break
before this fragment. The autonomy of the fragment introduced by prichom is preserved in a variety of contexts: (i)
both in cases where this fragment is a complete clause and when it is a fragmented clause; (ii) both in those cases
when the previous fragment is prosodically realized as final (projecting no continuation), and when it is realized as
non-final (projecting continuation); (iii) both in those cases when the fragment introduced by prichom is an element
of the main narrative chain, and when it is inserted parenthetically inside another fragment. In addition to the above,
a fragment with prichom can form a separate turn in the conversation. Thus, the detected prosodic features of the
fragment with prichom make it possible to objectify the idea earlier expressed in the literature (Kiselyova 1971,
Vinogradov 1984, Inkova 2018, inter alia): that structures with prichom are built in two "communicative steps", or
that they are used to express "concomitance established at the level of speech acts ". Clauses connected by the
relationship of syntactic subordination quite often lose their prosodic autonomy (Podlesskaya 2014 a, b), and vice
versa, clauses in coordinated constructions tend to retain prosodic autonomy. Therefore, the prosodic autonomy of
the components of the construction with prichom, retained in various contexts, speaks in favor of its coordinated
status, while a number of syntactic tests proper speak of the opposite.
Handwritten text image datasets are highly useful for solving many problems using machine learning. Such
problems include recognition of handwritten characters and handwriting, visual question answering, near-duplicate
detection, search for text reuse in handwriting and many auxiliary tasks: highlighting lines, words, other objects
in the text. The paper presents new dataset of handwritten texts images in Russian created by 200 writers with
different handwriting and photographed in different environment1
. We described the procedure for creating this
dataset and the requirements that were set for the texts and photos. The experiments with the baseline solution on
fraud search and text reuse search problems showed results of results of 60% and 83% recall respectively and 5%
and 2% false positive rate respectively on the dataset.
The article presents a comparative study of methods for argumentative relation prediction based on a neural
network approach. The distinctive feature of the study is the use of argumentative indicators in the preparation of the
training sample. The indicators are generated based on the discourse marker dictionary. The experiments were carried
out using an annotated corpus of scientific and popular science texts, including 162 articles available on the ArgNetBank Studio web platform. A set of all argumentative relations is described by internal connections of arguments and
include the conclusion and the premise. In the first stage of training set construction, fragments of text that included
two consecutive sentences were examined. In the second stage, indicators were retrieved from the corpus texts and,
for each indicator, statements presumably corresponding to the premise and conclusion of the argument were extracted. In total, 4.2 thousand indicator-based training contexts and 13.6 thousand pairs of sentences were obtained
from the corpus with annotation of the presence of an argumentative relation. Based on this training sample, four
classifiers were built: without indicators, with marking indicators in sentences using tags, taking into account segmentation of text based on indicators, with segmentation and tags. The results of the experiments on argumentative
relation prediction are presented.
This paper deals with some issues related to the Russian punctuation rules and their account in computer checkers
and correctors (both “analytic” and “synthetic”). It also discusses variation of punctuation. The paper offers a critical
assessment of reference books devoted to punctuation and makes special reference to certain verbs of propositional
attitude and their parenthetical use (in particular, dumat’ ‘to think,’ videt’ ‘to see,’ and slyshat’ ‘to hear). It claims that
the inherent characteristics of the verbs under consideration influence the punctuation, and therefore every verb deserves a detailed description (lexicographic portrait). In particular, videt’ and slyshat’ behave quite differently when
used as parenthetical verbs. A step towards making the punctuation rules more explicit may consist in providing an
index of words mentioned in the rules together with a subject index.
In this paper, we describe a question answering system on document images which is capable of numerical
reasoning over extracted structured data. The system performs optical character recognition, detection of key
attributes in text, generation of a numerical reasoning program, and its execution with the values of key attributes
as operands. OCR includes the steps of bounding boxes detection and recognition of text from bounding boxes. The
extraction of key attributes, such as quantity and price of goods, total etc., is based on the BERT token classification
model. For expression generation we investigated the rule-based approach and the T5-base model and found that
T5 is capable of generalization to expression types unseen in the training set. The proposed architecture of the
question answering system utilizes the structure of independent blocks, each of which can be enhanced or replaced
while keeping other components unchanged. The proposed model was evaluated in the Receipt-AVQA competition
and on FUNSD dataset.
Sentiment analysis, i.e. the automatic evaluation of the emotional tone of a text, is a common task in natural
language processing. Entity-Oriented Sentiment Analysis (EOSA) predicts the sentiment of entities mentioned in
a given text. In this paper, we focus on the EOSA task for the Russian news. We propose a text classification
pipeline to solve this task and show its potential in such tasks. Moreover, in general, EOSA implies labeling both
named entities and their sentiment, which can require a lot of annotator labour and time and, thus, presents a major
obstacle to the development of a production-ready EOSA system. To help alleviate this, we analyse the potential
of applying an Active learning approach to EOSA tasks. We demonstrate that by actively selecting instances for
labeling in EOSA the annotation effort required for training machine learning models can be significantly reduced.
This paper explores restrictions on the distribution of polu- ‘half’ in combination with adjectival stems in Russian. Relying on the literature on degree semantics, we analyze polu- as a degree modifier that specifies the degree to
which the adjective maps an individual as ½ of the maximal degree. This correctly predicts that polu- can only combine with upper closed scales. We argue that unlike half in English, polu- does not require a scale be lower closed
The task of text simplification is to reduce the complexity of the given piece of text while preserving its original
meaning to improve readability and understanding. In this paper, we consider the simplification task as a subfield of the general text style transfer problem and apply methods of controllable text style to rewrite texts in a
simpler manner preserving their meaning. Namely, we use a paraphrase model guided by another style-conditional
language model. In our work, we perform a series of experiments and compare this approach with the standard
fine-tuning of an autoregressive model.
The object of the paper are Russian words traditionally described as derived prepositions. The problem is that there
is no formal definition of preposition in theoretical or applied linguistics. Non-derivative, or primitive prepositions
are given in grammar by the closed list, so strictly speaking there is no need to define this class of words. However.
we must have criteria for determining derived prepositions. I suggest a set of necessary conditions that a preposition
must satisfy. I demonstrate that so called adverbial prepositionsin Russian do not satisfy them and should be described
as adverbs. Similarly, some Russian verbal prepositions, and some Russian denominative prepositions should not be
described as prepositions.
In this paper, we introduce a novel approach to estimating the cognitive complexity of a text at different levels
of language: phonetic, morphemic, lexical, and syntactic. The proposed method detects tokens with an abnormal
frequency of complexity scores. The frequencies are taken from the empirical distributions calculated over the
reference corpus of texts. We use the Russian Wikipedia for this purpose. Ensemble models are combined from
individual models from different language levels. We created datasets of pairs of text fragments taken from social
studies textbooks of different grades to train the ensembles. Empirical evidence shows that the proposed approach
outperforms existing methods, such as readability indices, in estimating text complexity in terms of accuracy. The
purpose of this study is to create one of the important components of the system of recommendation of scientific
and educational content.
Controllable story generation towards keywords or key phrases is one of the purposes of using language models.
Recent work has shown that various decoding strategies prove to be effective in achieving a high level of language
control. Such strategies require less computational resources compared to approaches based on fine-tuning pre-trained
language models. The paper proposes and investigates the method MaxProb of controllable story generation in Russian, which works at the decoding stage in the process of text generation. The method uses a generative language
model to estimate the probability of its tokens in order to shift the content of the text towards the guide phrase. The
idea of the method is to generate a set of different small sequences of tokens from the language model vocabulary,
estimate the probability of following the guide phrase after each sequence, and choose the most probable sequence.
The method allows evaluating the consistency of the token sequence for the transition from the prompt to the guide
phrase. The study was carried out using the Russian-language corpus of stories with extracted events that make up
the plot of the story. Experiments have shown the effectiveness of the proposed method for automatically creating
stories from a set of plot phrases.
The analysis of Russian interrogative prosody is based on a model of a question as consisting of the two components: the illocutionary proper component and the illocutionary improper component. The illocutionary improper
component includes the data for information retrieval. The illocutionary proper component can be formed both by
segmental means of expression (by an interrogative word or a particle) or solely by prosody (as in Russian yes-no
questions). The prosody of Russian questions having the interrogative words or the interrogative particle li is highly
variable, whereas the prosody of Russian yes-no questions expressed by prosody is stable. The latter is the Russian
rising accent, which has a rise on the tonic syllable of the accent-bearer followed by a fall on the post-tonics if any.
The illocutionary improper component can be located sentence initially and carry a specific falling accent (namely, a
late fall). A specific type of a question with the interrogative proper component omitted is recognized. Such questions
carry a late fall, or a falling-rising accent on the accent-bearer. The analysis is exemplified by the frequency tracings
of the sound sentencestaken from the Russian National Corpus and other open sources. As the instrument for verifying
the acoustic data, we used the computer system Praat. The paper is illustrated throughout with pitch contours of sound
The article examines the semantics of the Russian discourse marker stalo byt’, using the data obtained by analyzing translational correspondences extracted from parallel corpora of the Russian National Corpus (RNC). Typically, this discourse
marker is an indicator of inferential evidentiality, by which the speaker marks the fact that the given statement is a conclusion
made by the speaker on the basis of the information they received and accepted as true by default. In addition, stalo byt’ has
two secondary types of usage – “rhetorical” and “narrative” – where the basic semantics of this discourse marker is subject to
certain modifications. One of the key points of analysis is the reconstruction of semantic mechanisms providing the actual
semantics of stalo byt’.
This paper introduces five metrics for measuring the frequencies of dative predicatives in Russian.А dative
predicative is a word or multiword expression licensing the dative-predicative-structure, where the semantic subject
of the non-agreeing non-verbal predicate is marked by the dative case. I measure the frequencies of the predicatives
in the contact position <-1;1> with the same-clause dative subject pronouns in 1Sg (m-metrics) and 3Sg (e-metrics).
The m-metrics is applied for retrieving a list of dative predicatives from a corpus. I argue that for each large text
collection there is a minimal m-value confirming that an item belongs to the core of the dative-predicative structure.
The m/e score makes up the third metrics that shows whether an element is oriented towards the use in the 1
person or not. Basing on the m-metrics, I retrieved 3 lists of predicatives in the subcorpus of 2000–2021 texts
included in the Russian National Corpus. The A list includes 87 items with m 10, the B list includes 44 items
with m 50, the C list includes 24 items with m 100. 72-79% of items in each list have an m/e value 1,25. A
linguistic interpretation of this result is that for each list of dative predicatives it is true that the majority of its
elements are autoreferential expressions oriented towards the use in the 1st person present indicative tense in the
direct speech. The fourth metrics shows the total number of occurrences of a word or multiword expression in the
corpus (N). I argue that the N score must be measured before POS tagging, and lemmatization. The fifth and the
last metrics is the m/N score. The RNC data suggest an inverse correlation between the score of an item in the
context specific for dative-predicative structures (m) and its overall frequency in the corpus (N). This effect is
explained by the regular homonymy of high frequent predicatives with high frequent adverbials and parenthetical