The paper is a corpus study of pragmatic factors involved in disambiguating sentences with negation and universal quantifier in written Russian and
English, such as Ja ne pozval vseh svoih dal’nih rodstvennikov, ‘I haven’t invited all of my distant relatives.’ Ambiguity results from differences in scope.
If negation scopes over the quantifier, we get partial negation: ‘I have invited
some, but not all of my distant relatives.’ If negation scopes over the verb,
we get total negation: ‘I haven’t invited any of my distant relatives.’ Our study
is based on Russian and English data extracted from a variety of corpora.
The paper aims at contributing to a typology of implicatures via their analysis in news headlines. By implicatures we mean cancellable implicit senses,
irrespectively of whether they are inherent in lexical meanings or occur
in certain contextual conditions. While generally implicatures are difficult
to tie to a particular type of lexical environment, our analysis of headlines
allows us to make a step in this direction. Headlines often use implicatures
instead of assertions to convey information about the content of the article.
Causal implicatures are the most frequent type in our sample. We study two
types of causal implicatures. The first occurs in sentences with predicates
that have a semantic argument of Cause, syntactically unexpressed in the
sentence. If either the noun attribute or the noun itself contains an element
of value judgment, it can be interpreted as filling the Cause argument of the
predicate: to reward the hero (= ‘to reward a person for heroism’), to punish the criminal (= ‘to punish a person for the crime’). When Cause is thus
expressed, it is an implicature and is cancellable: He rewarded the winner
of the sports contest, yet not for the victory, but for volunteer work in a hospice. Another type of causal implicatures occurs in utterances with expressions of temporal sequence, such as after: After their quarrel she called
it quits (= ‘Because of their quarrel, she decided to break up with him’).
While in some languages causal implicatures of temporal prepositions are
grammaticalized as new lexical meanings, Russian temporal prepositions
do not develop separate causal senses. This makes them an ideal context
for causal implicatures, and headlines use posle ‘after’ to imply a causal relationship between the events described in the article, without committing
the author to a definite statement to this effect. We also consider qualitative
and factual implicatures which occur in certain specific contexts.
Discourse structures provide a way to extract deep semantic information
from text, e.g., about relations conveying causal and temporal information
and topical organization, which can be gainfully employed in NLP tasks such
as summarization, document classification, sentiment analysis. But the task
of automatically learning discourse structures is difficult: the relations that
make up the structures are very sparse relative to the number of possible
semantic connections that could be made between any two segments within
a text; furthermore, the existence of a relation between two segments depends not only on “local” features of the segments, but also on “global” contextual information, including which relations have already been instantiated
in the text and where. It is natural to try to leverage the power of deep learning
methods to learn the complex representations discourse structures require.
However, deep learning methods demand a large amount of labeled data,
which becomes prohibitively expensive in the case of expertly-annotated
discourse corpora. One recent advance in the resolution of this “training
data bottleneck”, data programming, allows for the implementation of expert knowledge in weak supervision system for data labeling. In this article,
we present the results of our application of the data programming paradigm
to the problem of discourse structure learning for multi-party dialogues.
Рассматривается гипотеза о том, что дискурсивные слова характеризуют авторский стиль писателя. В качестве объекта исследования выбрано устойчивое словосочетание одним словом на материале представительных корпусов Достоевского, Толстого, Салтыкова-Щедрина,
Тургенева и Гончарова. Проведенный анализ позволяет сделать вывод
о том, что Достоевский и Салтыков-Щедрин отличаются от других писателей-современников как частотой использования одним словом
в дискурсивной функции, так и разнообразием семантики этого выражения. Особенно интересен в этом отношении Достоевский, в прозе
которого представлены все дискурсивные функции одним словом: интерпретация (собственно интерпретация, вывод, уточнение/пояснение), новая идея, регулятивные употребления (прерывание дискурса,
маркирование трудностей в выборе номинации, маркирование смены
номинации (изменение номинации может быть на базовую, альтернативную и обобщающую), введение чужой речи: как в виде прямой, так
и не собственно прямой речи. Что касается недискурсивных употреблений выражения одним словом, то они распределены у рассматриваемых авторов более или менее равномерно.
Sentiment analysis is one of the most popular natural language processing tasks. In this paper we introduce pre-trained Russian language models which are used to extract embeddings (ELMo) to improve accuracy for
classification of short conversational texts. The first language model was
trained on Russian Twitter dataset containing 102 million sentences, while
two others were trained on 57.5 million sentences of Russian News and
23.9 million sentences of Russian Wikipedia articles. Although classifiers
trained on top of language models perform better than in the case of utilizing of fastText embeddings of the same language style, we show that
domain of language model also has a significant impact on accuracy. This
paper establishes state-of-the-art results for RuSentiment dataset improving weighted F1-score from 72.8 to 78.5. All our models are available online
as well as the source code which allows everyone to apply them or fine-tune
on domain-specific data.
This paper reports our participation in the Automatic Gapping Resolution for
Russian shared task (AGRR-2019) within Dialogue Evaluation 2019. Our team
took the first place among other nine teams in all subtasks which includes
gapping presence-absence classification, gap resolution and full annotation.
The phenomenon of gapping is well theoretically studied. However, the
problem of automatic gapping resolution is new and there is no baseline for
it. We found it possible to bring this task into sentence classification and token tagging problems and solve them using recent advances in Natural Language Processing and deep learning. Training large language models with
millions of parameters on small data became possible with the development
of transfer learning methods. Using pretrained models for computer vision
problems is straightforward and since BERT language model was realized
it became possible to benefit from transfer learning in NLP. Our solution
is heavily based on BERT, but we found that parsing gapping constructions,
which are very structured, benefit from special postprocessing which includes modeling a gapping in the form of a directed graph. Our solution may
be considered as the first public baseline for the task of automatic gapping
resolution which is based on NLP modern practices.
В статье описывается опыт аннотирования прагматических маркеров (ПМ) в двух русских речевых корпусах: «Один речевой день» (ОРД;
диалоги) и «Сбалансированная аннотированная текстотека» (САТ;
монологи). Для подготовки сплошной разметки ПМ было проведено
4 пилотных аннотирования на выборках из ОРД и САТ, что позволило
сформировать итоговый список ПМ: 450 единиц, представляющих собой варианты 53 базовых структурных типов. В ходе обработки результатов пилотного аннотирования удалось получить предварительные
данные о частоте встречаемости отдельных прагматических маркеров
и их типов, а также о зависимости употребления ПМ от пола и уровня
речевой компетенции говорящего. В результате обработки данных
были получены частотные списки как самих ПМ, так и выполняемых
We propose a method to resolve anaphoric pronouns in the framework of Winograd Schema Challenge (WSC) by means of SemETAP—a knowledge-based
semantic analyzer. WSC is a modern version of the famous Turing test. Its objective is to check a machine’s ability to exhibit intelligent behavior indistinguishable from that of a human. In contrast to other approaches to WSC, which
are based on machine learning, our method uses explicit knowledge. An important advantage of this approach is that it gives an opportunity to provide
an explanation of the result understandable for humans. SemETAP interprets
the text using both linguistic and extralinguistic (background) knowledge. The
former is stored in the grammar and the dictionary of the ETAP-4 system, and
the latter is provided by the SemETAP ontology, inference rules and the repository of individuals. We show how this knowledge is used for resolving WSC.
At the moment, the performance of the algorithm is not high—54%. This is due
to the incompleteness of the background knowledge supplied to the system.
It is shown, however, that if the background knowledge is complete and accurate enough, the WSC test is resolved well and it is easily understandable why
the system arrived at a particular conclusion.
The paper reports on the experimental comparison of several machine
learning models proposed in recent years for automatic morpheme segmentation of Russian words, including conditional random fields (CRF),
sequence-to-sequence neural network (Seq2seq), convolutional neural network (CNN) model, as well as a new model we have developed with
the aid of gradient boosted decision trees (GBDT). For more complete research, in our experiments we have also evaluated the semi-supervised
method of Morfessor. All the morpheme analysis models being compared
are briefly described in the paper, some of them perform only segmentation
of words into morphs, the other produce segmentation with classification
of resulted morphs. Since for Russian language linguistics rules for splitting words into morphs (and also the classification of some morphs) may
differ, the experiments were performed for two data sets differing in labeling, which are obtained respectively from CrossLexica’s dictionary and Tikhonov’s dictionary. The experimental evaluation has shown that two best
models of morpheme segmentation with classification, namely GBDT and
CNN models have comparable quality, giving about 86–94% of word-level
Multilingual parallel corpora make possible the application of quantitative
methods in cross-linguistic research. Due to the lack of appropriate resources,
this has not become a widespread technique among linguists, but the studies based on this idea tend to emerge. In our work, we focus on the application of logistic regression for the research of passive voice constructions with
an overtly expressed agent. The study is conducted on the data extracted
from a multilingual parallel corpus that was created for this purpose. The issue we find noteworthy about voice alternation is the motivation for choosing
active instead of passive, i.e. when a person would say ‘This essay was written by Mary’ instead of ‘Mary wrote this essay’. Relying on theoretical studies,
we selected a bunch of features claimed to be important for this kind of choice
and used them for training logistic regression models. As a result, based on the
model coefficients we can detect which features appear to be passive triggers.
This article deals with an application of referential markup to a large multimodal resource “Russian Pear Chats and Stories”, annotated for vocal, oculomotor, manual and cephalic channels. Despite a large number of works
on referential choice, it has never been investigated within the framework
of multimodal communication. For this purpose, a special annotation
scheme in the ELAN environment is proposed, allowing one to annotate
different types of referential units and to conduct a simultaneous tracking
of referential expressions (full NPs, pronouns, demonstratives, zeroes, etc)
with accompanying verbal and non-verbal units. The analysis of three recordings (overall duration equals to 141 minute), where the new referential
annotation was introduced in addition to the existing multimodal markup,
reveals a range of understudied peculiarities of the referential choice.
It was found that the role of the Commentator in the conversation entails
a significantly larger amount of constructions with a zero subject pronoun,
compared to the monologue discourse of the Narrator and the Reteller.
The analysis of referential expressions and accompanying pointing gestures complied with more general data previously obtained on the English
material and showed that nouns are significantly more often accompanied
by a pointing stroke than personal pronouns, while demonstratives occupy
an intermediate position between nouns and personal pronouns as units
potentially accompanied by a gesture.
This paper addresses the task of automatic genre classification for Russian
within the Functional Text Dimensions (FTD) framework. Our aim in this study
was to build the optimate FTD classification model to annotate web texts
from the GICR corpus. For training data, we used an extended GICR dataset.
We used the Support Vector Machine method with linear kernel for classification and converted training data to lower case to increase accuracy. During our research we experimented with several classification parameters,
such as types of features, C-value and feature filtering to determine the best
option for the classification model of the GICR dataset. The resulting model
was able to achieve satisfactory classification accuracy and was used for
GICR annotation. We also looked at the most significant features for each
FTD in our best performing model and compared them to the most frequent
words in which these features occur. Finally, we applied our model to segments of the GICR and looked at the FTD components in these segments.
The paper reports a method to create a speaker’s prosodic fingerprint
based on the global characteristics of the pitch movement. Prosodic fingerprint is the distribution of f0 in the low, middle, and high ranges and the distribution of pitch movements from one range into other [Šimko et al. 2017].
This fully automated method can be used to classify the records and to provide the reference level for more sophisticated analysis of the pitch movement and intonation strategies. We evaluate the method by applying it to the
spontaneous Russian spoken data recorded in different regions. We model
the correlation between the fingerprint and sociolinguistic features such
as age, gender, and region. The results of this analysis allow to formulate
several sociolinguistic hypotheses that can further be tested with a more
detailed analytic technique.
The paper considers the task of automatic discourse parsing of texts in Russian. Discourse parsing is a well-known approach to capturing text semantics across boundaries of single sentences. Discourse annotation was found
to be useful for various tasks including summarization, sentiment analysis, question-answering. Recently, the release of manually annotated RuRSTreebank corpus unlocked the possibility of leveraging supervised machine learning techniques for creating such parsers for Russian language.
The corpus provides the discourse annotation in a widely adopted formalisation—Rhetorical Structure Theory. In this work, we develop feature sets
for rhetorical relation classification in Russian-language texts, investigate
importance of various types of features, and report results of the first experimental evaluation of machine learning models trained on Ru-RSTreebank
corpus. We consider various machine learning methods including gradient
boosting, neural network, and ensembling of several models by soft voting.
This paper introduces a knowledge-based semantic approach towards
bridging annotation of Russian texts. Our method simulates human
background knowledge by using compact domain descriptions based
on an extended version of SUMO ontology and lexical-semantic data from
the “Universal Dictionary of Concepts”. Our approach supports a wide and
extensible range of bridging relations. The tagger that implements it can
build complex bridges with multiple arcs, supports making assumptions
and can be adapted to annotate other languages supported by the underlying dictionary of concepts.
Nowadays the majority of tasks in NLP field are solved by means of neural network language models. These models already have shown state-ofthe-art results in classification, translation, named entity recognition and
so on. Pre-trained models are accessible in the internet, but the real life
problem’s domain could differ from the origin domain which the network
was learned. In this paper an approach to vocabulary expansion for neural
network language model by means of hierarchical clustering is presented.
This technique allows to adopt pre-trained language model to a different
domain. In the experimental part the proposed approach is demonstrated
on specific domain of textual artifacts of software development process.
This field is actively studied this days due the expensiveness of the process
and its impact on the modern world and society.
The paper introduces manually annotated test sets for the task of tracing diachronic (temporal) semantic shifts in Russian. The two test sets are
complementary in that the first one covers comparatively strong semantic
changes occurring to nouns and adjectives from pre-Soviet to Soviet times,
while the second one covers comparatively subtle socially and culturally determined shifts occurring in years from 2000 to 2014. Additionally, the second test set offers more granular classification of shifts degree, but is limited to only adjectives.
The introduction of the test sets allowed us to evaluate several well-established algorithms of semantic shifts detection (posing this as a classification problem), most of which have never been tested on Russian material.
All of these algorithms use distributional word embedding models trained
on the corresponding in-domain corpora. The resulting scores provide solid
comparison baselines for future studies tackling similar tasks. We publish
the datasets, code and the trained models in order to facilitate further research in automatically detecting temporal semantic shifts for Russian
words, with time periods of different granularities.
News headline generation is an essential problem of text summarization because it is constrained, well-defined, and is still hard to solve. Models with
a limited vocabulary can not solve it well, as new named entities can appear regularly in the news and these entities often should be in the headline.
News articles in morphologically rich languages such as Russian require
model modifications due to a large number of possible word forms. This
study aims to validate that models with a possibility of copying words from
the original article performs better than models without such an option. The
proposed model achieves a mean ROUGE score of 23 on the provided test
dataset, which is 8 points greater than the result of a similar model without
a copying mechanism. Moreover, the resulting model performs better than
any known model on the new dataset of Russian news.
The annotation of parallel corpora, as well as building of supracorpora databases, challenges linguists with the question of how to define a functional
equivalent of the linguistic units that serve as an object of a given study. The
paper discusses the concept of divergent translation and whetherit is theoretically important for the analysis of logical-semantic relations (LSR). It is shown
that relations between states of things can be expressed not only by connectives but also by lexical means (referred to as “alternative lexicalizations”
in the works of the Penn Discourse Treebank group) and grammatical tools
(syntactic constructions and morphological forms), and by marks of punctuation. While the two latter ways are mentioned in grammars, they are usually not taken into account when the alternative ways of tagging LSR are described, nor are they annotated in corpora or databases. The supracorpora
database of connectives, built on the basis of the French and Italian parallel subcorpora of the Russian National Corpus, introduces new functional capabilities. It stores a representative array of annotations tagged as “divergent
translation” (more than 1,250, i.e. 7.7 per cent of the total number), which allows users to collect various statistical data. With these data, one could establish: (1) which LSR tend to be expressed by alternative means and how often
they occur compared to connectives, (2) what these alternative means are,
(3) which divergent translations may be used to render a given marker of LSR
and how often each of them is used, (4) which alternative markers of LSR are
specifically employed to convey one or another relation and which of them are
able to express several LSR. The conclusive part of the paper suggests that,
for the analysis of divergent equivalents, it is central that one and the same
alternative means is used by different translators when translating one and the
same textual fragment into one and the same language as well as into several languages, which speaks for its productivity. The further development
of multi-language and polyvariant parallel corpora and databases would let
us find outto what extentthe means conveying LSRdifferin various languages.
The paper presents a rule-based system of automated anaphora resolution
for Russian. The system is based on the resources of ETAP-4 linguistic processor: the Russian combinatorial dictionary (RCD), the ETAP parser, and
the ontology OntoEtap. In this paper, I describe the ordered algorithms for
resolution of different pronouns and provide the results of their evaluation.
The paper continues a series of research studies into the microsyntax
of Russian. Two constructions that are sufficiently close to each other in syntactic structure and semantics are considered in detail: these are linguistic
units of the type kak moźno lučše ≈ ‘in the best way possible’ and kak nelzja
lučše ‘≈ ‘it can never be better’. In both constructions, the first two elements
are determined lexically while the third one is fixed grammatically since
it can be instantiated by (almost) any comparative form. It is demonstrated
that the two units possess substantial semantic differences; in particular,
the former unit is oriented prospectively (cf. sygraj kak možno lučše ‘play
as well as you possibly can’ but hardly ?
sygral kak možno lučše ≈ ‘he has
played as well as he possibly could’) while the latter unit is, rather, oriented
respectively (cf. vse složilos’ kak nel’zja lučše ≈ ‘everything turned out
in a way that could never be better’ but hardly ?
Reši etu zadaču kak nel’zja
lučše, čtoby sdat’ ekzamen ≈ ‘solve this problem in a way that could never
be better, to pass the exam’. The material under consideration is also used
to discuss certain general subtleties of the Russian comparative.
The paper presents a spoken corpus of contact-influenced Russian, which
consists of oral spontaneous Russian speech of bilingual speakers of indigenous languages of Northern Siberia and the Russian Far East (Samoyedic,
Tungusic, Chukotko-Kamchatkan). The texts included in the corpus were transcribed in ELAN in Standard Russian orthography and provided with a special
system of manual annotation of contact-induced features developed for the
corpus. The paper focuses mainly on this system of annotation, which is relevant in a wider context of annotating any kind of speech with “deviations” from
the standard language variety (bilinguals’, learners’, dialectal speech etc.).
The annotation tags are grouped in several separate levels: contact-induced
morphological, syntactic, phonetic, lexical features etc. The exact meanings
for the annotation tags were proposed on empirical grounds. Transcribed and
annotated texts gain morphological annotation and search implementation
based on the Tsakorpus platform. The aim of the project is to provide a useful
resource for linguistic studies on language contact.
This paper contributes to the research field of multichannel discourse analysis. Multichannel discourse analysis explores numerous channels involved
in natural communication, such as verbal structure, prosody, manual gesticulation, head movements, eye gaze, torso postures, etc., and treats them
as parts of an integrated process. For the purposes of investigating the way
participants interact with one another and the way different communication
channel correlate, we introduce the notion of an integrated multichannel
annotation created with ELAN software. In particular, we consider three
topics: (1) temporal alignment between participants’ speech and manual
gesticulation; (2) distribution of participants’ visual attention as they watch
their interlocutors talking and gesticulating manually; (3) interrelationship
between participants’ torso postures and head movements.
The paper deals with evolution of one part of dialectal phonetic system (neutralization of non-high unstressed vowels’ in different allophones as a function of stressed vowel’s length or/and quality) over the course of three
generations of speakers from one family, moved from a village to Moscow,
Russian capital city. We discuss some methods of phonetic analysis that
could be utilized in order to present sound changes observed and argue
that the result obtained from a large data volume could be not so informative as compared to those, achieved from thorough analysis of every token.
Our results show that the phonetic system starts to change immediately after
the resettlement of a family: in the first generation of a family moved. The
second and third generation displays yet more dramatic changes with only
few markers of previous dialectal peculiarities remaining; along with this, the
qualitative dissimilation survives somewhat longer than the quantitative one.
This paper discusses the problems and results of a comparative analysis
of two fundamentally different types of prosodic phrasing labeling realized for
some literary Russian texts. The introduction examines the theoretical basis
of the study and formulates specific tasks, the solution of which was necessary for comparative analysis and the achievement of the final goal of the
study. The first section of the paper describes the experimental material, methods of research and the basic principles of experimental data processing. In the second, central section of the work, a detailed description of the parameters of comparative analysis of introspective labeling and perceptual one
is given. The following parameters were taken into account in the comparative
analysis: the general distribution of frequency of occurrence of text spaces
with different indexes of word boundary strength; their contextual distribution
with respective frequency data; relationship of prosodic breaks’ strength with
pauses. This section also contains many illustrations that demonstrate the
main results of the comparative analysis of the target prosodic labeling of the
experimental text material. Section 3 analyzes the relationship between the
prosodic breaks’ strength and pauses’ duration in both types of labeling analyzed. In conclusion results of the study are summarized and promising areas
for further research on the relevant topics are noted.
In the present work, we consider the possibility of multilingual to monolingual
transfer. We use Russian as a target language for transfer. We show that it is possible
to train the monolingual model using multilingual initialization. To show this, we evaluated the multilingual model on a number of common NLP tasks from the target language. The model trained in a monolingual setting achieves substantially better performance compared to the multilingual model.
The paper introduces the opposition “level of the situation” vs. “level of the
story”. Within this opposition, features of the verbs denoting non-fully controlled situations are considered (to succeed vs. to happen): government
(infinitive vs. clause), combinability with negation and propositional pronouns. Propositional pronouns tak (‘so’) and eto (‘it’) and the matrix verbs
which they are combined with, imply a different conceptualization of the antecedent situation: My proigrali. Tak poluchilos’ (‘We lost. So it turned out’)
vs. My hoteli pobedit’, i nam eto udalos’ (‘We wanted to win, and we succeeded’). Tak is semantically related to the mode of action and in other
meanings implies a variable factor or aspect.
This paper presents the first results of a comparative corpusbased research of the modern Russian language textbooks for primary
school children. Volume and diversity statistics of textbooks’ vocabulary,
the results of the vocabulary’s analysis included in frequency and thematic
groups are given.
Coreference Resolution (CR) is one of the most difficult tasks in the field
of Natural Language Processing due to the lack of deeply and comprehensively understanding the semantic meaning of the mention in not only the
sentence-level context but also the entire document-level context. To the
best of our knowledge, the previous proposed models often address the
coreference resolution task in two steps: 1) detect all possible mention candidates, 2) score and cluster them into chains. We instead propose a new approach which reforms the coreference resolution task to the task of learning
sentence-level coreferential relations. Additionally, by leveraging the power
of state-of-the-art language representation models such as BERT, ELMo,
it was possible to achieve cutting edge results on Russian datasets.
The development of corpus linguistics quite often makes it necessary to revisit the items studied and comprehensively described in the “pre-corpus”
epoch. As a result we obtain a more voluminous or even radically different
picture of their functioning. This is especially true of linguistic units with bizarre compatibility, in a complex way motivated by their semantics, such
as the Russian particle -ka. It is a study of a large array of linguistic data
that makes it possible to notice relatively rare, but regularly arising types
of combinations that reveal the semantic potential of this particle. In the
present work, we used the Russian National Corpus, as well as Yandex
search, which allowed us to assess if this or that type of combination is relevant for nowaday live speech. The study of corpus data not only contributes to our understanding of the properties of linguistic units — in this case,
the distribution of a particle, but also makes it possible to observe the linguistic mechanisms involved in relaxing cooccurrence restrictions. Thus, the analysis of the corpus material allowed us to find two fairly common,
but very nontrivial types of combinations of -ка with non-imperative expressions: лучше-ка and знаешь-ка/знаете-ка. As we show, their occurrence
is due to the effect of completely different linguistic mechanisms.
Russian has an impressive set of psych-verbs with the general meaning of causing extreme irritation and exhausting one’s patience, which
we will henceforth refer to as EXASPERATE-verbs: достать; задолбать,
заколебать, замучить, бесить, etc. With these predicates, the experiencer is in the accusative, and the non-salient, inanimate or abstract
causer of irritation can be expressed by a noun phrase in the nominative,
or by an infinitival clause, e.g., Меня достало это выражение/разбирать
эти выражения. In addition, these verbs participate in another causative
construction, with a salient, agentive causer expressed by a noun phrase
in the nominative case, and the manner in which irritation is brought about
expressed by the instrumental phrase, with or without a preposition: Ты меня
достал (c) этими выражениями. In modern spoken Russian, we also find
a new agentive causative construction (NACC): Ты меня достал ныть! ‘You
drive me up the wall by your whining.’ The NACC is colloquial and is largely
used by younger speakers. Among the verbs that participate in the NACC
are vulgar lexical items, which further adds to its colloquial nature. (The
use of vulgar expressions to vent frustration is attested cross-linguistically,
so Russian is not exceptional in that regard.) We provide a detailed analysis
of the syntax of the NACC and argue that it instantiates obligatory adjunct
control by the subject. We hypothesize that the rise of the NACC is driven
by the analogy with the existing constructions with EXASPERATE-verbs
in standard Russian, and we address several other factors that contribute
to the development of the new construction.
Thesauri are one of the most widely used resources in natural language processing. At the same time, many of them are built manually, which takes a lot
of time and, due to human errors, can affect their quality and completeness.
We propose a procedure for automatic positioning of vocabulary in the ABBY Y Compreno thesaurus using large monolingual corpora, a regular bilingual dictionary and a subset of already positioned words.
The main results of the update of the IntonTrainer system for the purposes
of analyzing and studying the prosodic signs of emotional intonation are
described. A distinctive functional feature of the updated system is the
creation of an expanded set of prosodic signs of emotional intonation. The
paper presents preliminary assessments of their effectiveness using the
created experimental database of emotional phrases of Russian speech.
The paper discusses the standardization efforts to create a morphological
standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different
categories of corpus researchers as well as NLP developers, we consider
two styles of the morphological annotation (RNC schema and Universal
Dependencies schema). A number of specifications of the feature list proposed to facilitate data reusability, linking and conversion.
The paper addresses the issue of intralingual variation in Tatar postpositional phrases. The nominal in Tatar postpositional phrases demonstrates
differential case marking: the choice between genitive and unmarked case
form is determined by the morphosyntactic class of the nominal. With postpositions derived from nouns with locative or abstract semantics variation
in case assignment is accompanied by presence/absence of the ezafe
marker on the postposition. In this paper we use corpus-based and experimental methods to investigate the distribution of grammatical variants and
estimate the current status of the variation. We argue that the existing grammatical descriptions do not capture the current state of affairs.
We show that pronouns and nouns do not form a homogeneous class
with respect to case marking in the postpositional phrase. The genitive case
marking is common for 1st / 2nd person personal pronouns and 3rd person
singular personal pronoun. All other pronouns and nouns are primarily used
in an unmarked form, an observation supported by both corpus and experimental data.
We argue that the grammaticalization of denominal postpositions is not
complete. In both corpus and experimental studies, we observe a wide
range of features that unite postpositional phrases with nominal embedding ezafe constructions. First, genitive case marking for the complement is acceptable for non-personal pronouns and nouns. Second, the absence
of the ezafe marker is acceptable only with 1st / 2nd person personal pronouns and partially with 1st / 2nd person reflexive pronouns. Third, the case
marking of the nominal and the choice of the ezafe marker for the postposition are interrelated. When the complement is genitive, speakers prefer the
agreeing form of the postposition. When the complement is unmarked, the
postposition shows no agreement with the possessor. This contrast reflects
the opposition between ezafe-3 and ezafe-2 constructions, respectively.
Interestingly, the denominal postpositions demonstrate different degrees of grammaticalization. For instance, the postposition turɩnda ‘about’
is mostly used with a possessive affix that shows no agreement. We suppose
that the form with the non-agreeing ezafe affix is reanalyzed by the speakers
Another crucial observation concerns the reflexive pronoun üz. In both
experiments 1st / 2nd person reflexive pronouns show syntactic behavior
similar to the one of personal pronouns, while 3rd person singular reflexive
pronoun patterns with interrogative pronouns.
As the result of the study, we compare different methodologies for investigation of the intralingual variation. We suggest that the combination of different sources of data, both corpus-based and experimental, provides the fuller
description for cases of intralingual variation than a single method. The experimental methods that we used differ in sensitivity to various aspects of language
phenomena: the elicited production is better in distinguishing deviation from
the grammatical pattern; the acceptability judgements show to what extent
a grammatical innovation is used. Remarkably, the comparison of the different
sources of data allows us to determine the direction of language change and
estimate the current status of the variation.
The paper analyzes derivative meanings of the Russian indefinite adverb
kak-to, which are insufficiently described in the existing grammars and dictionaries. Besides its primary meaning of indefinite manner, cf. grabitel’
kak-to pronik v dom ‘the buglar somehow got into the house’, kak-to has
two derivative meanings. 1) It can refer to an indefinite moment in time, cf.
on kak-to mne rasskazal etu istoriju ‘he told me this story once’; 2) it can
function as a discursive marker of ‘general indefiniteness,’ which has two
varieties: a) kak-to can point to an underspecified aspect of a situation—
‘in some respect/in some mesure/kind of’ (ona kak-to stranno posmotrela
na menja, on kak-to smutilsja, on kak-to po-brastki obnjal menja ‘she gave
me an odd glance, he felt somewhat confused, he hugged me in a kind
of brotherly way’); b) it can accentuate the idea of uncontrollability of a situation (‘it happened so’): ja kak-to upustil iz vidu ‘I somehow overlooked’. Using data form the RNC, we have identified contexts correlating with each
of the meanings of kak-to. We have also demonstrated that its use as a discursive marker is much more frequent than its occurrences as an adverb
of manner proper. We used data from Russian-English and English-Russian
parallel subcorpora to demonstrate that in many instances, translators from
Russian leave the discursive kak-to without a translation, and, vice-versa,
translators into Russian frequently insert kak-to without a specific stimulus
for it in the original English text. We conclude that usage of kak-to is regulated by a highly language specific discursive strategy in Russian.
The paper examines the grammatical and semantic features of the word èto
when it precedes or follows a wh-word (cf. Gde èto ty byl?). In this context,
èto is usually considered to be a particle, with the only—and not clear-cut—
exception being a question with the wh-words kto and čto. However, the
data presented below suggest that as many as four different types of èto
used in an interrogative context have to be distinguished. It is demonstrated
that these types differ in their meaning, their syntactic distribution, and their
position within the “pronoun-particle” continuum.
The article regards the way in which the deictic gestures with the active index finger are executed in Russian body language and focuses on the role
of the tension of the index finger (slightly curved vs. extended). Using the
data retrieved from the Russian Multimedia Corpus, we discover the dependency between the tension of the index finger and the tension of the arm,
which is engaged in executing the deictic gestures. We also reveal correlations between the tension of the index finger and (a) the primary / secondary
reference to the pointed object, (b) the closest and the farthest distance
between the speaker and the pointed object. We examine the difference
in meaning and usage of the deictic gestures with the slightly curved vs. extended index finger. We argue that the choice between these types of pointing may be influenced both by physical and pragmatic factors.
We propose a hypothesis that a deception in text should be visible from its
discourse structure. The problem of deception detection is then formulated
as classification of a discourse tree of this text, according to the Rhetorical
Structure Theory. This discourse tree (DT) is extended by the speech acts
expressions attached as the labels for the edges. We employ what we call
an ultimate deception dataset: a set of customer complaints for English,
that includes descriptions of problems customers experienced with certain
businesses. It contains about 2,400 complaints about banks and provides
clear ground truth, based on available factual knowledge in the financial domain. The complaints are written by non-professional writers. We conduct
experiments to explore correlation between implicit cues of the rhetorical
structure of texts and how truthful/deceptive are these texts. The results
show that a deception in text can be detected reliably enough to assure industrial applications. Automated detection of text with misrepresentations
such as fake reviews is an important task for online reputation management.
The paper focuses on Russian constructions with clauses or VPs combined
by means of the conjunction I ‘and’. Prosodically, the construction may
come up in two forms: (a) integrated, i.e.—as a single illocution with the first
clause pronounced with a rising pitch that projects discourse continuation,
and (b) disintegrated, i.e. as two separate illocutions with the first clause
pronounced with a falling pitch that projects no continuation. Basing on the
data from the Prosodically Annotated Corpus of Spoken Russian, prosody and grammar of coordinate constructions with the conjunction I ‘and’ were
analyzed qualitatively and quantitatively. The results show that coordinated
clauses and VPs are more frequent than coordinated NPs and other types
of groups; in spoken narratives, coordinated clauses are more frequent
than VPs, while in written narratives, coordinated VPs are more frequent
than clauses; coordinated clauses and VPs more often come up as prosodically integrated than as prosodically disintegrated; the rate of integrated
constructions is higher in coordinated VPs than in coordinated clauses.
Self-initiated and other-initiated self-repairs (N=632) were investigated
in a subcorpus (1 h 14 min) extracted from the multichannel corpus “Russian
Pear Chats and Stories”. The subcorpus consists of three communication
sessions where participants retell and discuss the “Pear stories” film, hence
each session contains both monologue and dialogue discourse parts. The
overall rates of self-repairs and the distribution of their particular types
were compared in monologues and dialogues. The results show that while,
overall, speakers tend to repair more often in conversational than in retelling parts, particular types of repairs are distributed differently, e.g. (a) repetitions and restarts have higher rates in conversational parts, while corrections appear more often in retellings; (b) in retellings, reparandum and
reparans appear more often within the same discourse unit, while in conversational parts, they tend to appear in separate discourse units.
In this paper we present an unsupervised and resource-independent approach to the well-known task of discovery of multiword expressions (MWE)
in text corpora. We experimented on extracting Russian nominal phrases
(Adj-N and N-N.Gen) relevant for lexical resources (thesauri, WordNet,
etc.). Our approach is based on the assumption that idiosyncrasy of MWEs
can be due to different properties (morphosyntactic, semantic, pragmatic
and statistical), and thus, different types of measures (statistical, context,
distributional) are efficient at extracting different MWEs. We propose new
context measures as well as an unsupervised method of combining measures in which we cluster vectors of ranks assigned by individual measures.
The proposed method accounts for different properties of MWEs and allows
surpassing both individual measures and their simple sum/product.
This article launches a series of studies in which popular vector word2vec
models are considered not as an element of the architecture of an NLP
application, but as an independent object of linguistic research. The linguist's view on the surrogate of contexts on the corpus, as which vector models can be considered, makes it possible to reveal new information about
the distribution of individual semantic groups of vocabulary and new knowledge about the corpus from which these models are derived. In particular,
it is shown that such layers of English and Russian vocabulary, such as the
names of professions, nationalities, toponyms, personal qualities, time periods, have the greatest independence from changing the model and retain
their position relative to their neighbour words—that is, they have the most
stable contexts regardless of the corpus; it is shown that the vocabulary from
the Swadesh list is statistically more resistant to changing the model than the
frequency vocabulary is; it is shown which word2vec models for the Russian
language preserve best the ontological structures in vocabulary.
The paper discusses the problem of rendering Church Slavonic text in the
modern Russian script, which is a common practice at present. The relevant
procedure would include the following stages: spelling out words with titla, replacing the letter-based denotation of numerical values with Arabic numerals,
replacing characters that are absent from the Russian alphabet with characters
with the same phonetic value, removing breathings, replacing different accent marks with a unified stress accent. Certain semantic and grammatical information will be lost in the resulting text while the sound will be kept. In other words,
the resulting text may be regarded as a practical transcription of the original
text. At the next point, the procedure should aim at replacing the original punctuation with the common Russian punctuation (within certain limits) and at the
capitalization of certain words (the latter task might require a system of determining co-reference links). The need for a system of automatic punctuation
(when the input is a written text) and a system of automatic resolution of referential ambiguity poses challenges to computational linguistics.
The 2019 Shared Task on Automatic Gapping Resolution for Russian (AGRR2019) aims to tackle non-trivial linguistic phenomenon, gapping, that occurs in coordinated structures and elides a repeated predicate, typically
from the second clause.
In this paper we define the task and evaluation metrics, provide detailed
information on data preparation, annotation schemes and methodology,
analyze the results and describe different approaches of the participating
Nowadays the task of selecting key information from large amount of text
data is becoming more and more relevant. This article proposes a model
of deep neural network with phrase-based attentional mechanism used
for automatic generation of news headlines. The proposed architecture
achieves a new state-of-the-art on the RIA news dataset.
In this paper we describe rule-based and neural approaches to gapping resolution task for Russian language. Our study was conducted on the material
of AGRR-2019 Shared Task. We demonstrate that neural model definitively
outperforms the rule-based one even when only 2000 annotated sentences
are available. The rule-based model took the 6th place in AGRR-2019 competition (2nd in terms of precision), while the neural one was better than the
It this paper we study morphological parsing and lemmatization on the material of Evenk and Selkup language. We compare basic neural models with
their extensions that attempt to utilize additional linguistic information from
the training data. We show that the augmented model does not improve over
the baseline even decreasing performance for the task of lemmatization.
We hypothesize that to be helpful additional information should be extracted
from external resources, if available, not the corpus itself.
The study is focused on the detection of depression by processing and classification of short essays written by 316 volunteers. The set of 93 essays was
provided by two different teams of psychologists who asked patients with
clinically confirmed depression to write short essays on the neutral topic.
The other 223 essays on the same topic were written by volunteers who
completed questionnaires, which are designed to reveal depression status
and did not demonstrate any signs of mental illnesses. The study describes
psycholinguistic and classic text features which were calculated by utilizing
natural language processing tools and were used to perform on the classification task. The machine learning classification models achieved up to 73%
of f1-score for the task of revealing essays written by people with depression.
Headline generation is a task that has a good solution based on seq2seq
models with an attention mechanism. However, it is still quite challenging
to deal with morphologically rich languages, such as Russian, which have
many word forms and therefore larger vocabularies. To deal with complex
dependencies arising in such languages we propose several approaches
based on using stems and grammemes. We applied these approaches
to the pointer-generator network and took second place in the competition
on headline generation held by the conference Dialogue-2019.
The paper deals with some formal features of the completive prefix do-
(‘to finish, to complete’). It was claimed in previous studies, that this prefix
along with some others, has a range of formal properties that differ both
from formal properties of productive “superlexical” prefixes (such as the
cumulative na-, the distributive po-) and “lexical” (highly integrated) ones.
Two important features were mentioned among others. 1) It can attach
both to the perfective stem and to the imperfective one. 2) It cannot attach
to secondary imperfectives. In the paper, I verify and develop these claims
on corpus data. 1) I propose the rules of choice between the perfective vs.
imperfective stem and describe the pool of variation. 2) I show, that, contrary to expectations, in informal speech do- attaches to secondary imperfectives quite easily.
Following recent success of neural language models in various downstream
language understanding tasks, including common sense reasoning, we investigate possible utility of such models in domain specific reasoning task—
proposing of preliminary diagnosis based on patient complains, presented
as natural language text. We demonstrate that language model, trained on the
texts collected from online medical forums posses significant accuracy in this
task (73% at top 10 suggestions), when evaluated on dataset, constructed
from clinical case reports, published in specialized medical journals. While
preliminary, these findings indicate a possible new method that can be used
to augment online symptoms checkers and clinical decision support systems.
In this paper we study approaches to assessing the quality of student theses
in pedagogics. We consider a specific subtask in thesis scoring of estimating
its adherence to the thesis’s theme. The special document (theme header)
comprising the theme, aim, object, tasks of the thesis is formed. The theme
adherence is calculated as the similarity value between the theme header
and thesis segments. For evaluation we order theses in the increased value
of the calculated theme adherence and compare the ordering with expert
grades using the average precision measure. The best configuration for theses ranking is based on the weighted averaged sum of word embeddings
(word2vec) and keywords extracted from the theme header.
Исследование конкуренции русских лично- и возвратно-притяжательных местоимений в связанном употреблении (как в Я1 встретился с моими1 / со своими1 друзьями) ведётся достаточно давно, однако не все
аспекты этого явления были изучены квантитативными методами
и получили описание в рамках той или иной теории синтаксиса и семантики. В работе исследуется поведение местоименных посессоров
в прямообъектных именных группах, связанных местоимением 1 или
2 лица в позиции подлежащего. Акцент делается на связи выбора местоимения с возможностью или необходимостью коллективной интерпретации глагольной группы и отношения принадлежности.
Пользуясь данными Национального корпуса русского языка и корпуса Araneum Russicum Maximum, мы показываем, что выбор стратегии
выражения посессора связан с числом субъекта (как в целом по корпусу, так и для отдельных глаголов). Проведённое анкетирование позволяет установить, что предпочтение лично-притяжательного местоимения связано с коллективным прочтением, причём в отсутствие
такого прочтения при ед.ч. объекта любое выражение посессора затруднено (если лексема — вершина ИГ не является singulare tantum).
Предлагается интерпретация полученных данных, основанная на том,
что притяжательное местоимение имеет интерпретируемый признак
числа посессора (например, наш обозначает коллективную принадлежность множеству посессоров), а ИГ-дополнение без посессора может реанализироваться как часть предиката.
The Paper is devoted to a corpus study of the Contrast relation between
discourse units in Russian. It is based on the data of the Ru-RSTreebank
annotated within the framework of the Rhetorical Structure theory [Mann,
Thompson 1988]. The research question is what cue phrases and lexical and grammatical patterns are used to express the Contrast relation
as opposed to the Comparison relation. Since the simple connectives such
as conjunctions а or no “but” and others are ambiguous it may be useful
to single out specific cues for the Contrast relation and to find other linguistic features that can also help to differentiate Contrast and other relations,
such as Comparison. The investigation of cues signalling different types
of relations is an important issue for both automatic discourse mining and
the theoretical researches of text coherence. We test several hypotheses
presented in the reference literature on Russian against corpus data.
We describe a model for a robot that learns about the world and her companions through natural language communication. The model supports
open-domain learning, where the robot has a drive to learn about new concepts, new friends, and new properties of friends and concept instances.
The robot tries to fill gaps, resolve uncertainties and resolve conflicts. The
absorbed knowledge consists of everything people tell her, the situations
and objects she perceives and whatever she finds on the web. The results
of her interactions and perceptions are kept in an RDF triple store to enable
reasoning over her knowledge and experiences. The robot uses a theory
of mind to keep track of who said what, when and where. Accumulating
knowledge results in complex states to which the robot needs to respond.
In this paper, we look into two specific aspects of such complex knowledge states: 1) reflecting on the status of the knowledge acquired through
a new notion of thoughts and 2) defining the context during which knowledge is acquired. Thoughts form the basis for drives on which the robot
communicates. We capture episodic contexts to keep instances of objects
apart across different locations, which results in differentiating the acquired
knowledge over specific encounters. Both aspects make the communication more dynamic and result in more initiatives by the robot.
Russian dictionaries of idioms, winged words and quotations do not reflect
“the intertextual competence” of modern Russian speakers: on the one hand,
their vocabularies abound in obsolete, uncommon and even incomprehensible units; on the other hand, they are short of some well known and widely
used catchwords and Internet memes. The article deals with the structure
and principles for constructing a new dictionary, namely, “Intertextual Vocabulary of Modern Russian” (in paper and multimedia versions). The dictionary
will be based on corpus data and include over 1000 well-known catchphrases
from the 20th–21st centuries. The basic unit is a dictionary entry that will include the following parts: lexical input, meaning, source, examples, phraseological model and its transformations, comments; the last two parts are
optional. The arrangement is alphabetical by the first word; however, there
will be user-friendly indexes for locating all the catchphrases from the same
source, same topic, etc. The multimedia version is characterized by quantitative and qualitative increase in content: in addition to text information, the
dictionary will contain audio, video, photo fragments, graphics, animation,
etc. referring to the relevant “multimedia” sources of intertextual units (such
as movies, cartoons, paintings, songs, TV shows, etc.). Using hyperlinks, one
can easily find the required information related to a given entry.
The paper is aimed at the analysis of the prosody in the Russian yes-nowquestions with particle LI. The three basic patterns of the Russian LI-questions, which are construed as semantically minimal, are singled out. (The
semantically minimal sentences are considered here as such where the
prosodic structure brings minimal contribution into the semantic structure
of a sentence). Consequently, the prosody of the sentences composed with
contrast, or discourse continuity is viewed as being derived from the prosody of the basic types.
The illocutionary force in LI-questions is designated not by prosody
as in other Russian yes-no-questions but by a segmental means, namely —
by LI. Hence, the prosody in LI-questions is not a cue of the illocutionary
force but it forms the sentence as an autonomous prosodic unit and designates the non-illocutionary meanings: contrast and discourse continuity.
The accent on the first accented word can be either rising, or falling without
any reasonable difference in meaning.
In questions with particle LI, particle LI preserves its Wackernagel parameters, while the host of the clitic in the majority of cases serves as the
first, or the only one, accent-bearer of the sentence. However, in the context
of contrast, the first accent-bearer can be placed to the right from LI.
Within the discourse continuity, LI-questions have two accent-bearers,
the first of them could be either rising, or falling, and, at the same time, either contrastive, or non-contrastive, while the second one — is always the
The prosodic patterns of LI-questions are exemplified here by spoken
fragments taken from the Multimodal corpus of the Russian National corpus, and the minor working collection of the Russian speech recordings
specifically set up for this investigation. The software program Praat was
used in the process of analyzing the sounding data.
В докладе демонстрируется, что в русском языке имеется дискурсивное слово что-то, которое может выражать определенный спектр установок говорящего по отношению к некоторому (наблюдаемому им)
обстоятельству, отклоняющемуся от нормы. А именно, дискурсивное
что-то может маркировать: желание говорящего обратить внимание
слушающего на сообщаемый факт, не интересуясь специально его
причиной (ср. Что-то я на склоне лет стал сентиментален), желание
говорящего выразить осуждение (ср. Что-то она слишком вырядилась
сегодня) или просто сообщить о чем-то негативном (ср. Что-то сегодня
пасмурно, но ??Что-то сегодня светит солнце); выразить свою тревогу
или подозрение (ср. Что-то в детской слишком тихо); желание ослабить
категоричность негативного или потенциально обидного для собеседника высказывания, в частности — смягчить резкость отказа (ср. —
Давай чай пить! — Что-то не хочется) и др. Показано, что выделяемое
в словарях значение что-то ‘непонятно почему’ возникает лишь в определенных контекстных условиях. Выявлены условия возникновения
этого значения и его место в цепи семантической деривации, исходной точкой которой является значение неопределенного объекта. Исследование проведено на материале Национального корпуса русского
языка, в том числе его параллельных подкорпусов.
The paper is addressed the corpus grammar of Russian quantifier phrases
(QPs), with focus on two issues: (i) subject-predicate agreement patterns
in sentences with a QP in the position of a grammatical subject, (b) the choice
of the agreeing/non-agreeing form of the adjective in QPs with an embedded NP with the head noun in the feminine gender. QPs license both the
plural and the singular form of the predicate. I argue that the singular form
optionally shown on the predicate instantiates non-canonic agreement
controlled by the QP and does not pattern with the so called default agreement in 3Sg.N. The analysis is based on the complete statistics of all Russian cardinal numerals used in the RNC in QPs of the type ‘два человека/
пять человек’ in the Russian National Corpora. I show the correlations between plural/singular agreement forms, word order (QP―V ~ V―QP) and
communicative status of QP. The choice of the agreeing preposed NP-level
adjective as in dve interesnye knigi does not constrain the form of the predicate agreement, while agreeing DP-level elements as in eti dve knigi blocks
the singular form on the predicate. Russian subject QPs are non-canonic
arguments, since in the two thirds of the corpus data they lack the status
of a theme.
The role of oriented gestures is crucial while solving spatial problems.
We analyze the influence of a robot, using oriented gestures, on a human.
In an experimental situation robot F-2 was helping a human to solve a “tangram” puzzle. Robot was indicating in speech, which game element to take
and where to place it. In a half of the tasks the robot was using oriented communicative actions (hand gestures, head movements and gaze) to indicate
the required game element, and then—the game position to place it in. In the
other half of tasks, the robot was using non-oriented gestures. We show, that
the use of oriented gestures increases the attractiveness of a robot to human and rises the general satisfaction of the interaction with the robot.
In this paper, we present a dataset for cross-language (Russian-English) text
alignment subtask of plagiarism detection. We compare different models for
detecting translated plagiarism. One is based on different textual similarity
scores, which exploit word embeddings. Another model extends the previous
one with the features obtained via neural machine translation. The last model
is built on top of pre-trained language representation (Bert) via fine-tuning for
our task. The Bert model shows great performance and outperforms other
models. However, it requires much more computation resources than simpler
models. Therefore, it seems reasonable to use both context-free models and
contextual models together in modern plagiarism detection systems.