The paper is a corpus study of pragmatic factors involved in disambiguating sentences with negation and universal quantifier in written Russian and English, such as Ja ne pozval vseh svoih dal’nih rodstvennikov, ‘I haven’t invited all of my distant relatives.’ Ambiguity results from differences in scope. If negation scopes over the quantifier, we get partial negation: ‘I have invited some, but not all of my distant relatives.’ If negation scopes over the verb, we get total negation: ‘I haven’t invited any of my distant relatives.’ Our study is based on Russian and English data extracted from a variety of corpora.
The paper aims at contributing to a typology of implicatures via their analysis in news headlines. By implicatures we mean cancellable implicit senses, irrespectively of whether they are inherent in lexical meanings or occur in certain contextual conditions. While generally implicatures are difficult to tie to a particular type of lexical environment, our analysis of headlines allows us to make a step in this direction. Headlines often use implicatures instead of assertions to convey information about the content of the article. Causal implicatures are the most frequent type in our sample. We study two types of causal implicatures. The first occurs in sentences with predicates that have a semantic argument of Cause, syntactically unexpressed in the sentence. If either the noun attribute or the noun itself contains an element of value judgment, it can be interpreted as filling the Cause argument of the predicate: to reward the hero (= ‘to reward a person for heroism’), to punish the criminal (= ‘to punish a person for the crime’). When Cause is thus expressed, it is an implicature and is cancellable: He rewarded the winner of the sports contest, yet not for the victory, but for volunteer work in a hospice. Another type of causal implicatures occurs in utterances with expressions of temporal sequence, such as after: After their quarrel she called it quits (= ‘Because of their quarrel, she decided to break up with him’). While in some languages causal implicatures of temporal prepositions are grammaticalized as new lexical meanings, Russian temporal prepositions do not develop separate causal senses. This makes them an ideal context for causal implicatures, and headlines use posle ‘after’ to imply a causal relationship between the events described in the article, without committing the author to a definite statement to this effect. We also consider qualitative and factual implicatures which occur in certain specific contexts.
Discourse structures provide a way to extract deep semantic information from text, e.g., about relations conveying causal and temporal information and topical organization, which can be gainfully employed in NLP tasks such as summarization, document classification, sentiment analysis. But the task of automatically learning discourse structures is difficult: the relations that make up the structures are very sparse relative to the number of possible semantic connections that could be made between any two segments within a text; furthermore, the existence of a relation between two segments depends not only on “local” features of the segments, but also on “global” contextual information, including which relations have already been instantiated in the text and where. It is natural to try to leverage the power of deep learning methods to learn the complex representations discourse structures require. However, deep learning methods demand a large amount of labeled data, which becomes prohibitively expensive in the case of expertly-annotated discourse corpora. One recent advance in the resolution of this “training data bottleneck”, data programming, allows for the implementation of expert knowledge in weak supervision system for data labeling. In this article, we present the results of our application of the data programming paradigm to the problem of discourse structure learning for multi-party dialogues.
Рассматривается гипотеза о том, что дискурсивные слова характеризуют авторский стиль писателя. В качестве объекта исследования выбрано устойчивое словосочетание одним словом на материале представительных корпусов Достоевского, Толстого, Салтыкова-Щедрина, Тургенева и Гончарова. Проведенный анализ позволяет сделать вывод о том, что Достоевский и Салтыков-Щедрин отличаются от других писателей-современников как частотой использования одним словом в дискурсивной функции, так и разнообразием семантики этого выражения. Особенно интересен в этом отношении Достоевский, в прозе которого представлены все дискурсивные функции одним словом: интерпретация (собственно интерпретация, вывод, уточнение/пояснение), новая идея, регулятивные употребления (прерывание дискурса, маркирование трудностей в выборе номинации, маркирование смены номинации (изменение номинации может быть на базовую, альтернативную и обобщающую), введение чужой речи: как в виде прямой, так и не собственно прямой речи. Что касается недискурсивных употреблений выражения одним словом, то они распределены у рассматриваемых авторов более или менее равномерно.
Sentiment analysis is one of the most popular natural language processing tasks. In this paper we introduce pre-trained Russian language models which are used to extract embeddings (ELMo) to improve accuracy for classification of short conversational texts. The first language model was trained on Russian Twitter dataset containing 102 million sentences, while two others were trained on 57.5 million sentences of Russian News and 23.9 million sentences of Russian Wikipedia articles. Although classifiers trained on top of language models perform better than in the case of utilizing of fastText embeddings of the same language style, we show that domain of language model also has a significant impact on accuracy. This paper establishes state-of-the-art results for RuSentiment dataset improving weighted F1-score from 72.8 to 78.5. All our models are available online as well as the source code which allows everyone to apply them or fine-tune on domain-specific data.
This paper reports our participation in the Automatic Gapping Resolution for Russian shared task (AGRR-2019) within Dialogue Evaluation 2019. Our team took the first place among other nine teams in all subtasks which includes gapping presence-absence classification, gap resolution and full annotation. The phenomenon of gapping is well theoretically studied. However, the problem of automatic gapping resolution is new and there is no baseline for it. We found it possible to bring this task into sentence classification and token tagging problems and solve them using recent advances in Natural Language Processing and deep learning. Training large language models with millions of parameters on small data became possible with the development of transfer learning methods. Using pretrained models for computer vision problems is straightforward and since BERT language model was realized it became possible to benefit from transfer learning in NLP. Our solution is heavily based on BERT, but we found that parsing gapping constructions, which are very structured, benefit from special postprocessing which includes modeling a gapping in the form of a directed graph. Our solution may be considered as the first public baseline for the task of automatic gapping resolution which is based on NLP modern practices.
Аннотирование прагматических маркеров в русском речевом корпусе: проблемы, поиски, решения и результаты
В статье описывается опыт аннотирования прагматических маркеров (ПМ) в двух русских речевых корпусах: «Один речевой день» (ОРД; диалоги) и «Сбалансированная аннотированная текстотека» (САТ; монологи). Для подготовки сплошной разметки ПМ было проведено 4 пилотных аннотирования на выборках из ОРД и САТ, что позволило сформировать итоговый список ПМ: 450 единиц, представляющих собой варианты 53 базовых структурных типов. В ходе обработки результатов пилотного аннотирования удалось получить предварительные данные о частоте встречаемости отдельных прагматических маркеров и их типов, а также о зависимости употребления ПМ от пола и уровня речевой компетенции говорящего. В результате обработки данных были получены частотные списки как самих ПМ, так и выполняемых ими функций.
We propose a method to resolve anaphoric pronouns in the framework of Winograd Schema Challenge (WSC) by means of SemETAP—a knowledge-based semantic analyzer. WSC is a modern version of the famous Turing test. Its objective is to check a machine’s ability to exhibit intelligent behavior indistinguishable from that of a human. In contrast to other approaches to WSC, which are based on machine learning, our method uses explicit knowledge. An important advantage of this approach is that it gives an opportunity to provide an explanation of the result understandable for humans. SemETAP interprets the text using both linguistic and extralinguistic (background) knowledge. The former is stored in the grammar and the dictionary of the ETAP-4 system, and the latter is provided by the SemETAP ontology, inference rules and the repository of individuals. We show how this knowledge is used for resolving WSC. At the moment, the performance of the algorithm is not high—54%. This is due to the incompleteness of the background knowledge supplied to the system. It is shown, however, that if the background knowledge is complete and accurate enough, the WSC test is resolved well and it is easily understandable why the system arrived at a particular conclusion.
The paper reports on the experimental comparison of several machine learning models proposed in recent years for automatic morpheme segmentation of Russian words, including conditional random fields (CRF), sequence-to-sequence neural network (Seq2seq), convolutional neural network (CNN) model, as well as a new model we have developed with the aid of gradient boosted decision trees (GBDT). For more complete research, in our experiments we have also evaluated the semi-supervised method of Morfessor. All the morpheme analysis models being compared are briefly described in the paper, some of them perform only segmentation of words into morphs, the other produce segmentation with classification of resulted morphs. Since for Russian language linguistics rules for splitting words into morphs (and also the classification of some morphs) may differ, the experiments were performed for two data sets differing in labeling, which are obtained respectively from CrossLexica’s dictionary and Tikhonov’s dictionary. The experimental evaluation has shown that two best models of morpheme segmentation with classification, namely GBDT and CNN models have comparable quality, giving about 86–94% of word-level accuracy.
Multilingual Parallel Corpora As a Source for Quantitative Crosslinguistic Grammar Research (The Case Of Voice Constructions)
Multilingual parallel corpora make possible the application of quantitative methods in cross-linguistic research. Due to the lack of appropriate resources, this has not become a widespread technique among linguists, but the studies based on this idea tend to emerge. In our work, we focus on the application of logistic regression for the research of passive voice constructions with an overtly expressed agent. The study is conducted on the data extracted from a multilingual parallel corpus that was created for this purpose. The issue we find noteworthy about voice alternation is the motivation for choosing active instead of passive, i.e. when a person would say ‘This essay was written by Mary’ instead of ‘Mary wrote this essay’. Relying on theoretical studies, we selected a bunch of features claimed to be important for this kind of choice and used them for training logistic regression models. As a result, based on the model coefficients we can detect which features appear to be passive triggers.
This article deals with an application of referential markup to a large multimodal resource “Russian Pear Chats and Stories”, annotated for vocal, oculomotor, manual and cephalic channels. Despite a large number of works on referential choice, it has never been investigated within the framework of multimodal communication. For this purpose, a special annotation scheme in the ELAN environment is proposed, allowing one to annotate different types of referential units and to conduct a simultaneous tracking of referential expressions (full NPs, pronouns, demonstratives, zeroes, etc) with accompanying verbal and non-verbal units. The analysis of three recordings (overall duration equals to 141 minute), where the new referential annotation was introduced in addition to the existing multimodal markup, reveals a range of understudied peculiarities of the referential choice. It was found that the role of the Commentator in the conversation entails a significantly larger amount of constructions with a zero subject pronoun, compared to the monologue discourse of the Narrator and the Reteller. The analysis of referential expressions and accompanying pointing gestures complied with more general data previously obtained on the English material and showed that nouns are significantly more often accompanied by a pointing stroke than personal pronouns, while demonstratives occupy an intermediate position between nouns and personal pronouns as units potentially accompanied by a gesture.
This paper addresses the task of automatic genre classification for Russian within the Functional Text Dimensions (FTD) framework. Our aim in this study was to build the optimate FTD classification model to annotate web texts from the GICR corpus. For training data, we used an extended GICR dataset. We used the Support Vector Machine method with linear kernel for classification and converted training data to lower case to increase accuracy. During our research we experimented with several classification parameters, such as types of features, C-value and feature filtering to determine the best option for the classification model of the GICR dataset. The resulting model was able to achieve satisfactory classification accuracy and was used for GICR annotation. We also looked at the most significant features for each FTD in our best performing model and compared them to the most frequent words in which these features occur. Finally, we applied our model to segments of the GICR and looked at the FTD components in these segments.
The paper reports a method to create a speaker’s prosodic fingerprint based on the global characteristics of the pitch movement. Prosodic fingerprint is the distribution of f0 in the low, middle, and high ranges and the distribution of pitch movements from one range into other [Šimko et al. 2017]. This fully automated method can be used to classify the records and to provide the reference level for more sophisticated analysis of the pitch movement and intonation strategies. We evaluate the method by applying it to the spontaneous Russian spoken data recorded in different regions. We model the correlation between the fingerprint and sociolinguistic features such as age, gender, and region. The results of this analysis allow to formulate several sociolinguistic hypotheses that can further be tested with a more detailed analytic technique.
The paper considers the task of automatic discourse parsing of texts in Russian. Discourse parsing is a well-known approach to capturing text semantics across boundaries of single sentences. Discourse annotation was found to be useful for various tasks including summarization, sentiment analysis, question-answering. Recently, the release of manually annotated RuRSTreebank corpus unlocked the possibility of leveraging supervised machine learning techniques for creating such parsers for Russian language. The corpus provides the discourse annotation in a widely adopted formalisation—Rhetorical Structure Theory. In this work, we develop feature sets for rhetorical relation classification in Russian-language texts, investigate importance of various types of features, and report results of the first experimental evaluation of machine learning models trained on Ru-RSTreebank corpus. We consider various machine learning methods including gradient boosting, neural network, and ensembling of several models by soft voting.
This paper introduces a knowledge-based semantic approach towards bridging annotation of Russian texts. Our method simulates human background knowledge by using compact domain descriptions based on an extended version of SUMO ontology and lexical-semantic data from the “Universal Dictionary of Concepts”. Our approach supports a wide and extensible range of bridging relations. The tagger that implements it can build complex bridges with multiple arcs, supports making assumptions and can be adapted to annotate other languages supported by the underlying dictionary of concepts.
Nowadays the majority of tasks in NLP field are solved by means of neural network language models. These models already have shown state-ofthe-art results in classification, translation, named entity recognition and so on. Pre-trained models are accessible in the internet, but the real life problem’s domain could differ from the origin domain which the network was learned. In this paper an approach to vocabulary expansion for neural network language model by means of hierarchical clustering is presented. This technique allows to adopt pre-trained language model to a different domain. In the experimental part the proposed approach is demonstrated on specific domain of textual artifacts of software development process. This field is actively studied this days due the expensiveness of the process and its impact on the modern world and society.
Tracing Cultural Diachronic Semantic Shifts in Russian Using Word Embeddings: Test Sets and Baselines
The paper introduces manually annotated test sets for the task of tracing diachronic (temporal) semantic shifts in Russian. The two test sets are complementary in that the first one covers comparatively strong semantic changes occurring to nouns and adjectives from pre-Soviet to Soviet times, while the second one covers comparatively subtle socially and culturally determined shifts occurring in years from 2000 to 2014. Additionally, the second test set offers more granular classification of shifts degree, but is limited to only adjectives. The introduction of the test sets allowed us to evaluate several well-established algorithms of semantic shifts detection (posing this as a classification problem), most of which have never been tested on Russian material. All of these algorithms use distributional word embedding models trained on the corresponding in-domain corpora. The resulting scores provide solid comparison baselines for future studies tackling similar tasks. We publish the datasets, code and the trained models in order to facilitate further research in automatically detecting temporal semantic shifts for Russian words, with time periods of different granularities.
News headline generation is an essential problem of text summarization because it is constrained, well-defined, and is still hard to solve. Models with a limited vocabulary can not solve it well, as new named entities can appear regularly in the news and these entities often should be in the headline. News articles in morphologically rich languages such as Russian require model modifications due to a large number of possible word forms. This study aims to validate that models with a possibility of copying words from the original article performs better than models without such an option. The proposed model achieves a mean ROUGE score of 23 on the provided test dataset, which is 8 points greater than the result of a similar model without a copying mechanism. Moreover, the resulting model performs better than any known model on the new dataset of Russian news.
The annotation of parallel corpora, as well as building of supracorpora databases, challenges linguists with the question of how to define a functional equivalent of the linguistic units that serve as an object of a given study. The paper discusses the concept of divergent translation and whetherit is theoretically important for the analysis of logical-semantic relations (LSR). It is shown that relations between states of things can be expressed not only by connectives but also by lexical means (referred to as “alternative lexicalizations” in the works of the Penn Discourse Treebank group) and grammatical tools (syntactic constructions and morphological forms), and by marks of punctuation. While the two latter ways are mentioned in grammars, they are usually not taken into account when the alternative ways of tagging LSR are described, nor are they annotated in corpora or databases. The supracorpora database of connectives, built on the basis of the French and Italian parallel subcorpora of the Russian National Corpus, introduces new functional capabilities. It stores a representative array of annotations tagged as “divergent translation” (more than 1,250, i.e. 7.7 per cent of the total number), which allows users to collect various statistical data. With these data, one could establish: (1) which LSR tend to be expressed by alternative means and how often they occur compared to connectives, (2) what these alternative means are, (3) which divergent translations may be used to render a given marker of LSR and how often each of them is used, (4) which alternative markers of LSR are specifically employed to convey one or another relation and which of them are able to express several LSR. The conclusive part of the paper suggests that, for the analysis of divergent equivalents, it is central that one and the same alternative means is used by different translators when translating one and the same textual fragment into one and the same language as well as into several languages, which speaks for its productivity. The further development of multi-language and polyvariant parallel corpora and databases would let us find outto what extentthe means conveying LSRdifferin various languages.
The paper presents a rule-based system of automated anaphora resolution for Russian. The system is based on the resources of ETAP-4 linguistic processor: the Russian combinatorial dictionary (RCD), the ETAP parser, and the ontology OntoEtap. In this paper, I describe the ordered algorithms for resolution of different pronouns and provide the results of their evaluation.
Adding to the Treasury of Russian Microsyntactic Curiosities: Two Antonymic Syntactic Idioms with Comparatives
The paper continues a series of research studies into the microsyntax of Russian. Two constructions that are sufficiently close to each other in syntactic structure and semantics are considered in detail: these are linguistic units of the type kak moźno lučše ≈ ‘in the best way possible’ and kak nelzja lučše ‘≈ ‘it can never be better’. In both constructions, the first two elements are determined lexically while the third one is fixed grammatically since it can be instantiated by (almost) any comparative form. It is demonstrated that the two units possess substantial semantic differences; in particular, the former unit is oriented prospectively (cf. sygraj kak možno lučše ‘play as well as you possibly can’ but hardly ? sygral kak možno lučše ≈ ‘he has played as well as he possibly could’) while the latter unit is, rather, oriented respectively (cf. vse složilos’ kak nel’zja lučše ≈ ‘everything turned out in a way that could never be better’ but hardly ? Reši etu zadaču kak nel’zja lučše, čtoby sdat’ ekzamen ≈ ‘solve this problem in a way that could never be better, to pass the exam’. The material under consideration is also used to discuss certain general subtleties of the Russian comparative.
The paper presents a spoken corpus of contact-influenced Russian, which consists of oral spontaneous Russian speech of bilingual speakers of indigenous languages of Northern Siberia and the Russian Far East (Samoyedic, Tungusic, Chukotko-Kamchatkan). The texts included in the corpus were transcribed in ELAN in Standard Russian orthography and provided with a special system of manual annotation of contact-induced features developed for the corpus. The paper focuses mainly on this system of annotation, which is relevant in a wider context of annotating any kind of speech with “deviations” from the standard language variety (bilinguals’, learners’, dialectal speech etc.). The annotation tags are grouped in several separate levels: contact-induced morphological, syntactic, phonetic, lexical features etc. The exact meanings for the annotation tags were proposed on empirical grounds. Transcribed and annotated texts gain morphological annotation and search implementation based on the Tsakorpus platform. The aim of the project is to provide a useful resource for linguistic studies on language contact.
This paper contributes to the research field of multichannel discourse analysis. Multichannel discourse analysis explores numerous channels involved in natural communication, such as verbal structure, prosody, manual gesticulation, head movements, eye gaze, torso postures, etc., and treats them as parts of an integrated process. For the purposes of investigating the way participants interact with one another and the way different communication channel correlate, we introduce the notion of an integrated multichannel annotation created with ELAN software. In particular, we consider three topics: (1) temporal alignment between participants’ speech and manual gesticulation; (2) distribution of participants’ visual attention as they watch their interlocutors talking and gesticulating manually; (3) interrelationship between participants’ torso postures and head movements.
The paper deals with evolution of one part of dialectal phonetic system (neutralization of non-high unstressed vowels’ in different allophones as a function of stressed vowel’s length or/and quality) over the course of three generations of speakers from one family, moved from a village to Moscow, Russian capital city. We discuss some methods of phonetic analysis that could be utilized in order to present sound changes observed and argue that the result obtained from a large data volume could be not so informative as compared to those, achieved from thorough analysis of every token. Our results show that the phonetic system starts to change immediately after the resettlement of a family: in the first generation of a family moved. The second and third generation displays yet more dramatic changes with only few markers of previous dialectal peculiarities remaining; along with this, the qualitative dissimilation survives somewhat longer than the quantitative one.
Introspective and Perceptual Labeling of Prosodic Phrasing (A Comparative Analysis On The Material Of R. I. Avanesov Texts Collection)
This paper discusses the problems and results of a comparative analysis of two fundamentally different types of prosodic phrasing labeling realized for some literary Russian texts. The introduction examines the theoretical basis of the study and formulates specific tasks, the solution of which was necessary for comparative analysis and the achievement of the final goal of the study. The first section of the paper describes the experimental material, methods of research and the basic principles of experimental data processing. In the second, central section of the work, a detailed description of the parameters of comparative analysis of introspective labeling and perceptual one is given. The following parameters were taken into account in the comparative analysis: the general distribution of frequency of occurrence of text spaces with different indexes of word boundary strength; their contextual distribution with respective frequency data; relationship of prosodic breaks’ strength with pauses. This section also contains many illustrations that demonstrate the main results of the comparative analysis of the target prosodic labeling of the experimental text material. Section 3 analyzes the relationship between the prosodic breaks’ strength and pauses’ duration in both types of labeling analyzed. In conclusion results of the study are summarized and promising areas for further research on the relevant topics are noted.
In the present work, we consider the possibility of multilingual to monolingual transfer. We use Russian as a target language for transfer. We show that it is possible to train the monolingual model using multilingual initialization. To show this, we evaluated the multilingual model on a number of common NLP tasks from the target language. The model trained in a monolingual setting achieves substantially better performance compared to the multilingual model.
The paper introduces the opposition “level of the situation” vs. “level of the story”. Within this opposition, features of the verbs denoting non-fully controlled situations are considered (to succeed vs. to happen): government (infinitive vs. clause), combinability with negation and propositional pronouns. Propositional pronouns tak (‘so’) and eto (‘it’) and the matrix verbs which they are combined with, imply a different conceptualization of the antecedent situation: My proigrali. Tak poluchilos’ (‘We lost. So it turned out’) vs. My hoteli pobedit’, i nam eto udalos’ (‘We wanted to win, and we succeeded’). Tak is semantically related to the mode of action and in other meanings implies a variable factor or aspect.
This paper presents the first results of a comparative corpusbased research of the modern Russian language textbooks for primary school children. Volume and diversity statistics of textbooks’ vocabulary, the results of the vocabulary’s analysis included in frequency and thematic groups are given.
Coreference Resolution (CR) is one of the most difficult tasks in the field of Natural Language Processing due to the lack of deeply and comprehensively understanding the semantic meaning of the mention in not only the sentence-level context but also the entire document-level context. To the best of our knowledge, the previous proposed models often address the coreference resolution task in two steps: 1) detect all possible mention candidates, 2) score and cluster them into chains. We instead propose a new approach which reforms the coreference resolution task to the task of learning sentence-level coreferential relations. Additionally, by leveraging the power of state-of-the-art language representation models such as BERT, ELMo, it was possible to achieve cutting edge results on Russian datasets.
The development of corpus linguistics quite often makes it necessary to revisit the items studied and comprehensively described in the “pre-corpus” epoch. As a result we obtain a more voluminous or even radically different picture of their functioning. This is especially true of linguistic units with bizarre compatibility, in a complex way motivated by their semantics, such as the Russian particle -ka. It is a study of a large array of linguistic data that makes it possible to notice relatively rare, but regularly arising types of combinations that reveal the semantic potential of this particle. In the present work, we used the Russian National Corpus, as well as Yandex search, which allowed us to assess if this or that type of combination is relevant for nowaday live speech. The study of corpus data not only contributes to our understanding of the properties of linguistic units — in this case, the distribution of a particle, but also makes it possible to observe the linguistic mechanisms involved in relaxing cooccurrence restrictions. Thus, the analysis of the corpus material allowed us to find two fairly common, but very nontrivial types of combinations of -ка with non-imperative expressions: лучше-ка and знаешь-ка/знаете-ка. As we show, their occurrence is due to the effect of completely different linguistic mechanisms.
Russian has an impressive set of psych-verbs with the general meaning of causing extreme irritation and exhausting one’s patience, which we will henceforth refer to as EXASPERATE-verbs: достать; задолбать, заколебать, замучить, бесить, etc. With these predicates, the experiencer is in the accusative, and the non-salient, inanimate or abstract causer of irritation can be expressed by a noun phrase in the nominative, or by an infinitival clause, e.g., Меня достало это выражение/разбирать эти выражения. In addition, these verbs participate in another causative construction, with a salient, agentive causer expressed by a noun phrase in the nominative case, and the manner in which irritation is brought about expressed by the instrumental phrase, with or without a preposition: Ты меня достал (c) этими выражениями. In modern spoken Russian, we also find a new agentive causative construction (NACC): Ты меня достал ныть! ‘You drive me up the wall by your whining.’ The NACC is colloquial and is largely used by younger speakers. Among the verbs that participate in the NACC are vulgar lexical items, which further adds to its colloquial nature. (The use of vulgar expressions to vent frustration is attested cross-linguistically, so Russian is not exceptional in that regard.) We provide a detailed analysis of the syntax of the NACC and argue that it instantiates obligatory adjunct control by the subject. We hypothesize that the rise of the NACC is driven by the analogy with the existing constructions with EXASPERATE-verbs in standard Russian, and we address several other factors that contribute to the development of the new construction.
Thesauri are one of the most widely used resources in natural language processing. At the same time, many of them are built manually, which takes a lot of time and, due to human errors, can affect their quality and completeness. We propose a procedure for automatic positioning of vocabulary in the ABBY Y Compreno thesaurus using large monolingual corpora, a regular bilingual dictionary and a subset of already positioned words.
Analysis of Prosodic Features of the Emotional Intonation Using “Intontrainer” System (On The Example Of Russian Phrases)
The main results of the update of the IntonTrainer system for the purposes of analyzing and studying the prosodic signs of emotional intonation are described. A distinctive functional feature of the updated system is the creation of an expanded set of prosodic signs of emotional intonation. The paper presents preliminary assessments of their effectiveness using the created experimental database of emotional phrases of Russian speech.
The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and Universal Dependencies schema). A number of specifications of the feature list proposed to facilitate data reusability, linking and conversion.
The paper addresses the issue of intralingual variation in Tatar postpositional phrases. The nominal in Tatar postpositional phrases demonstrates differential case marking: the choice between genitive and unmarked case form is determined by the morphosyntactic class of the nominal. With postpositions derived from nouns with locative or abstract semantics variation in case assignment is accompanied by presence/absence of the ezafe marker on the postposition. In this paper we use corpus-based and experimental methods to investigate the distribution of grammatical variants and estimate the current status of the variation. We argue that the existing grammatical descriptions do not capture the current state of affairs. We show that pronouns and nouns do not form a homogeneous class with respect to case marking in the postpositional phrase. The genitive case marking is common for 1st / 2nd person personal pronouns and 3rd person singular personal pronoun. All other pronouns and nouns are primarily used in an unmarked form, an observation supported by both corpus and experimental data. We argue that the grammaticalization of denominal postpositions is not complete. In both corpus and experimental studies, we observe a wide range of features that unite postpositional phrases with nominal embedding ezafe constructions. First, genitive case marking for the complement is acceptable for non-personal pronouns and nouns. Second, the absence of the ezafe marker is acceptable only with 1st / 2nd person personal pronouns and partially with 1st / 2nd person reflexive pronouns. Third, the case marking of the nominal and the choice of the ezafe marker for the postposition are interrelated. When the complement is genitive, speakers prefer the agreeing form of the postposition. When the complement is unmarked, the postposition shows no agreement with the possessor. This contrast reflects the opposition between ezafe-3 and ezafe-2 constructions, respectively. Interestingly, the denominal postpositions demonstrate different degrees of grammaticalization. For instance, the postposition turɩnda ‘about’ is mostly used with a possessive affix that shows no agreement. We suppose that the form with the non-agreeing ezafe affix is reanalyzed by the speakers as uninflected. Another crucial observation concerns the reflexive pronoun üz. In both experiments 1st / 2nd person reflexive pronouns show syntactic behavior similar to the one of personal pronouns, while 3rd person singular reflexive pronoun patterns with interrogative pronouns. As the result of the study, we compare different methodologies for investigation of the intralingual variation. We suggest that the combination of different sources of data, both corpus-based and experimental, provides the fuller description for cases of intralingual variation than a single method. The experimental methods that we used differ in sensitivity to various aspects of language phenomena: the elicited production is better in distinguishing deviation from the grammatical pattern; the acceptability judgements show to what extent a grammatical innovation is used. Remarkably, the comparison of the different sources of data allows us to determine the direction of language change and estimate the current status of the variation.
The paper analyzes derivative meanings of the Russian indefinite adverb kak-to, which are insufficiently described in the existing grammars and dictionaries. Besides its primary meaning of indefinite manner, cf. grabitel’ kak-to pronik v dom ‘the buglar somehow got into the house’, kak-to has two derivative meanings. 1) It can refer to an indefinite moment in time, cf. on kak-to mne rasskazal etu istoriju ‘he told me this story once’; 2) it can function as a discursive marker of ‘general indefiniteness,’ which has two varieties: a) kak-to can point to an underspecified aspect of a situation— ‘in some respect/in some mesure/kind of’ (ona kak-to stranno posmotrela na menja, on kak-to smutilsja, on kak-to po-brastki obnjal menja ‘she gave me an odd glance, he felt somewhat confused, he hugged me in a kind of brotherly way’); b) it can accentuate the idea of uncontrollability of a situation (‘it happened so’): ja kak-to upustil iz vidu ‘I somehow overlooked’. Using data form the RNC, we have identified contexts correlating with each of the meanings of kak-to. We have also demonstrated that its use as a discursive marker is much more frequent than its occurrences as an adverb of manner proper. We used data from Russian-English and English-Russian parallel subcorpora to demonstrate that in many instances, translators from Russian leave the discursive kak-to without a translation, and, vice-versa, translators into Russian frequently insert kak-to without a specific stimulus for it in the original English text. We conclude that usage of kak-to is regulated by a highly language specific discursive strategy in Russian.
The paper examines the grammatical and semantic features of the word èto when it precedes or follows a wh-word (cf. Gde èto ty byl?). In this context, èto is usually considered to be a particle, with the only—and not clear-cut— exception being a question with the wh-words kto and čto. However, the data presented below suggest that as many as four different types of èto used in an interrogative context have to be distinguished. It is demonstrated that these types differ in their meaning, their syntactic distribution, and their position within the “pronoun-particle” continuum.
The article regards the way in which the deictic gestures with the active index finger are executed in Russian body language and focuses on the role of the tension of the index finger (slightly curved vs. extended). Using the data retrieved from the Russian Multimedia Corpus, we discover the dependency between the tension of the index finger and the tension of the arm, which is engaged in executing the deictic gestures. We also reveal correlations between the tension of the index finger and (a) the primary / secondary reference to the pointed object, (b) the closest and the farthest distance between the speaker and the pointed object. We examine the difference in meaning and usage of the deictic gestures with the slightly curved vs. extended index finger. We argue that the choice between these types of pointing may be influenced both by physical and pragmatic factors.
We propose a hypothesis that a deception in text should be visible from its discourse structure. The problem of deception detection is then formulated as classification of a discourse tree of this text, according to the Rhetorical Structure Theory. This discourse tree (DT) is extended by the speech acts expressions attached as the labels for the edges. We employ what we call an ultimate deception dataset: a set of customer complaints for English, that includes descriptions of problems customers experienced with certain businesses. It contains about 2,400 complaints about banks and provides clear ground truth, based on available factual knowledge in the financial domain. The complaints are written by non-professional writers. We conduct experiments to explore correlation between implicit cues of the rhetorical structure of texts and how truthful/deceptive are these texts. The results show that a deception in text can be detected reliably enough to assure industrial applications. Automated detection of text with misrepresentations such as fake reviews is an important task for online reputation management.
Просодия и грамматика предикативного сочинения: конструкции с союзом "и" по данным просодически размеченного корпуса
The paper focuses on Russian constructions with clauses or VPs combined by means of the conjunction I ‘and’. Prosodically, the construction may come up in two forms: (a) integrated, i.e.—as a single illocution with the first clause pronounced with a rising pitch that projects discourse continuation, and (b) disintegrated, i.e. as two separate illocutions with the first clause pronounced with a falling pitch that projects no continuation. Basing on the data from the Prosodically Annotated Corpus of Spoken Russian, prosody and grammar of coordinate constructions with the conjunction I ‘and’ were analyzed qualitatively and quantitatively. The results show that coordinated clauses and VPs are more frequent than coordinated NPs and other types of groups; in spoken narratives, coordinated clauses are more frequent than VPs, while in written narratives, coordinated VPs are more frequent than clauses; coordinated clauses and VPs more often come up as prosodically integrated than as prosodically disintegrated; the rate of integrated constructions is higher in coordinated VPs than in coordinated clauses.
Самоисправления говорящего в русском монологическом и диалогическом дискурсе: опыт корпусного исследования
Self-initiated and other-initiated self-repairs (N=632) were investigated in a subcorpus (1 h 14 min) extracted from the multichannel corpus “Russian Pear Chats and Stories”. The subcorpus consists of three communication sessions where participants retell and discuss the “Pear stories” film, hence each session contains both monologue and dialogue discourse parts. The overall rates of self-repairs and the distribution of their particular types were compared in monologues and dialogues. The results show that while, overall, speakers tend to repair more often in conversational than in retelling parts, particular types of repairs are distributed differently, e.g. (a) repetitions and restarts have higher rates in conversational parts, while corrections appear more often in retellings; (b) in retellings, reparandum and reparans appear more often within the same discourse unit, while in conversational parts, they tend to appear in separate discourse units.
In this paper we present an unsupervised and resource-independent approach to the well-known task of discovery of multiword expressions (MWE) in text corpora. We experimented on extracting Russian nominal phrases (Adj-N and N-N.Gen) relevant for lexical resources (thesauri, WordNet, etc.). Our approach is based on the assumption that idiosyncrasy of MWEs can be due to different properties (morphosyntactic, semantic, pragmatic and statistical), and thus, different types of measures (statistical, context, distributional) are efficient at extracting different MWEs. We propose new context measures as well as an unsupervised method of combining measures in which we cluster vectors of ranks assigned by individual measures. The proposed method accounts for different properties of MWEs and allows surpassing both individual measures and their simple sum/product.
This article launches a series of studies in which popular vector word2vec models are considered not as an element of the architecture of an NLP application, but as an independent object of linguistic research. The linguist's view on the surrogate of contexts on the corpus, as which vector models can be considered, makes it possible to reveal new information about the distribution of individual semantic groups of vocabulary and new knowledge about the corpus from which these models are derived. In particular, it is shown that such layers of English and Russian vocabulary, such as the names of professions, nationalities, toponyms, personal qualities, time periods, have the greatest independence from changing the model and retain their position relative to their neighbour words—that is, they have the most stable contexts regardless of the corpus; it is shown that the vocabulary from the Swadesh list is statistically more resistant to changing the model than the frequency vocabulary is; it is shown which word2vec models for the Russian language preserve best the ontological structures in vocabulary.
Передача церковнославянского текста средствами гражданской графики: можно ли получить ее при помощи формальной процедуры?
The paper discusses the problem of rendering Church Slavonic text in the modern Russian script, which is a common practice at present. The relevant procedure would include the following stages: spelling out words with titla, replacing the letter-based denotation of numerical values with Arabic numerals, replacing characters that are absent from the Russian alphabet with characters with the same phonetic value, removing breathings, replacing different accent marks with a unified stress accent. Certain semantic and grammatical information will be lost in the resulting text while the sound will be kept. In other words, the resulting text may be regarded as a practical transcription of the original text. At the next point, the procedure should aim at replacing the original punctuation with the common Russian punctuation (within certain limits) and at the capitalization of certain words (the latter task might require a system of determining co-reference links). The need for a system of automatic punctuation (when the input is a written text) and a system of automatic resolution of referential ambiguity poses challenges to computational linguistics.
The 2019 Shared Task on Automatic Gapping Resolution for Russian (AGRR2019) aims to tackle non-trivial linguistic phenomenon, gapping, that occurs in coordinated structures and elides a repeated predicate, typically from the second clause. In this paper we define the task and evaluation metrics, provide detailed information on data preparation, annotation schemes and methodology, analyze the results and describe different approaches of the participating solutions.
Nowadays the task of selecting key information from large amount of text data is becoming more and more relevant. This article proposes a model of deep neural network with phrase-based attentional mechanism used for automatic generation of news headlines. The proposed architecture achieves a new state-of-the-art on the RIA news dataset.
In this paper we describe rule-based and neural approaches to gapping resolution task for Russian language. Our study was conducted on the material of AGRR-2019 Shared Task. We demonstrate that neural model definitively outperforms the rule-based one even when only 2000 annotated sentences are available. The rule-based model took the 6th place in AGRR-2019 competition (2nd in terms of precision), while the neural one was better than the second-ranked system.
It this paper we study morphological parsing and lemmatization on the material of Evenk and Selkup language. We compare basic neural models with their extensions that attempt to utilize additional linguistic information from the training data. We show that the augmented model does not improve over the baseline even decreasing performance for the task of lemmatization. We hypothesize that to be helpful additional information should be extracted from external resources, if available, not the corpus itself.
The study is focused on the detection of depression by processing and classification of short essays written by 316 volunteers. The set of 93 essays was provided by two different teams of psychologists who asked patients with clinically confirmed depression to write short essays on the neutral topic. The other 223 essays on the same topic were written by volunteers who completed questionnaires, which are designed to reveal depression status and did not demonstrate any signs of mental illnesses. The study describes psycholinguistic and classic text features which were calculated by utilizing natural language processing tools and were used to perform on the classification task. The machine learning classification models achieved up to 73% of f1-score for the task of revealing essays written by people with depression.
Headline generation is a task that has a good solution based on seq2seq models with an attention mechanism. However, it is still quite challenging to deal with morphologically rich languages, such as Russian, which have many word forms and therefore larger vocabularies. To deal with complex dependencies arising in such languages we propose several approaches based on using stems and grammemes. We applied these approaches to the pointer-generator network and took second place in the competition on headline generation held by the conference Dialogue-2019.
The paper deals with some formal features of the completive prefix do- (‘to finish, to complete’). It was claimed in previous studies, that this prefix along with some others, has a range of formal properties that differ both from formal properties of productive “superlexical” prefixes (such as the cumulative na-, the distributive po-) and “lexical” (highly integrated) ones. Two important features were mentioned among others. 1) It can attach both to the perfective stem and to the imperfective one. 2) It cannot attach to secondary imperfectives. In the paper, I verify and develop these claims on corpus data. 1) I propose the rules of choice between the perfective vs. imperfective stem and describe the pool of variation. 2) I show, that, contrary to expectations, in informal speech do- attaches to secondary imperfectives quite easily.
Language Models for Unsupervised Acquisition of Medical Knowledge from Natural Language Texts: Application for Diagnosis Prediction
Following recent success of neural language models in various downstream language understanding tasks, including common sense reasoning, we investigate possible utility of such models in domain specific reasoning task— proposing of preliminary diagnosis based on patient complains, presented as natural language text. We demonstrate that language model, trained on the texts collected from online medical forums posses significant accuracy in this task (73% at top 10 suggestions), when evaluated on dataset, constructed from clinical case reports, published in specialized medical journals. While preliminary, these findings indicate a possible new method that can be used to augment online symptoms checkers and clinical decision support systems.
In this paper we study approaches to assessing the quality of student theses in pedagogics. We consider a specific subtask in thesis scoring of estimating its adherence to the thesis’s theme. The special document (theme header) comprising the theme, aim, object, tasks of the thesis is formed. The theme adherence is calculated as the similarity value between the theme header and thesis segments. For evaluation we order theses in the increased value of the calculated theme adherence and compare the ordering with expert grades using the average precision measure. The best configuration for theses ranking is based on the weighted averaged sum of word embeddings (word2vec) and keywords extracted from the theme header.
Исследование конкуренции русских лично- и возвратно-притяжательных местоимений в связанном употреблении (как в Я1 встретился с моими1 / со своими1 друзьями) ведётся достаточно давно, однако не все аспекты этого явления были изучены квантитативными методами и получили описание в рамках той или иной теории синтаксиса и семантики. В работе исследуется поведение местоименных посессоров в прямообъектных именных группах, связанных местоимением 1 или 2 лица в позиции подлежащего. Акцент делается на связи выбора местоимения с возможностью или необходимостью коллективной интерпретации глагольной группы и отношения принадлежности. Пользуясь данными Национального корпуса русского языка и корпуса Araneum Russicum Maximum, мы показываем, что выбор стратегии выражения посессора связан с числом субъекта (как в целом по корпусу, так и для отдельных глаголов). Проведённое анкетирование позволяет установить, что предпочтение лично-притяжательного местоимения связано с коллективным прочтением, причём в отсутствие такого прочтения при ед.ч. объекта любое выражение посессора затруднено (если лексема — вершина ИГ не является singulare tantum). Предлагается интерпретация полученных данных, основанная на том, что притяжательное местоимение имеет интерпретируемый признак числа посессора (например, наш обозначает коллективную принадлежность множеству посессоров), а ИГ-дополнение без посессора может реанализироваться как часть предиката.
The Paper is devoted to a corpus study of the Contrast relation between discourse units in Russian. It is based on the data of the Ru-RSTreebank annotated within the framework of the Rhetorical Structure theory [Mann, Thompson 1988]. The research question is what cue phrases and lexical and grammatical patterns are used to express the Contrast relation as opposed to the Comparison relation. Since the simple connectives such as conjunctions а or no “but” and others are ambiguous it may be useful to single out specific cues for the Contrast relation and to find other linguistic features that can also help to differentiate Contrast and other relations, such as Comparison. The investigation of cues signalling different types of relations is an important issue for both automatic discourse mining and the theoretical researches of text coherence. We test several hypotheses presented in the reference literature on Russian against corpus data.
We describe a model for a robot that learns about the world and her companions through natural language communication. The model supports open-domain learning, where the robot has a drive to learn about new concepts, new friends, and new properties of friends and concept instances. The robot tries to fill gaps, resolve uncertainties and resolve conflicts. The absorbed knowledge consists of everything people tell her, the situations and objects she perceives and whatever she finds on the web. The results of her interactions and perceptions are kept in an RDF triple store to enable reasoning over her knowledge and experiences. The robot uses a theory of mind to keep track of who said what, when and where. Accumulating knowledge results in complex states to which the robot needs to respond. In this paper, we look into two specific aspects of such complex knowledge states: 1) reflecting on the status of the knowledge acquired through a new notion of thoughts and 2) defining the context during which knowledge is acquired. Thoughts form the basis for drives on which the robot communicates. We capture episodic contexts to keep instances of objects apart across different locations, which results in differentiating the acquired knowledge over specific encounters. Both aspects make the communication more dynamic and result in more initiatives by the robot.
О проекте словаря «интертекстуальный тезаурус современного русского языка»: книжный vs. Мультимедийный
Russian dictionaries of idioms, winged words and quotations do not reflect “the intertextual competence” of modern Russian speakers: on the one hand, their vocabularies abound in obsolete, uncommon and even incomprehensible units; on the other hand, they are short of some well known and widely used catchwords and Internet memes. The article deals with the structure and principles for constructing a new dictionary, namely, “Intertextual Vocabulary of Modern Russian” (in paper and multimedia versions). The dictionary will be based on corpus data and include over 1000 well-known catchphrases from the 20th–21st centuries. The basic unit is a dictionary entry that will include the following parts: lexical input, meaning, source, examples, phraseological model and its transformations, comments; the last two parts are optional. The arrangement is alphabetical by the first word; however, there will be user-friendly indexes for locating all the catchphrases from the same source, same topic, etc. The multimedia version is characterized by quantitative and qualitative increase in content: in addition to text information, the dictionary will contain audio, video, photo fragments, graphics, animation, etc. referring to the relevant “multimedia” sources of intertextual units (such as movies, cartoons, paintings, songs, TV shows, etc.). Using hyperlinks, one can easily find the required information related to a given entry.
The paper is aimed at the analysis of the prosody in the Russian yes-nowquestions with particle LI. The three basic patterns of the Russian LI-questions, which are construed as semantically minimal, are singled out. (The semantically minimal sentences are considered here as such where the prosodic structure brings minimal contribution into the semantic structure of a sentence). Consequently, the prosody of the sentences composed with contrast, or discourse continuity is viewed as being derived from the prosody of the basic types. The illocutionary force in LI-questions is designated not by prosody as in other Russian yes-no-questions but by a segmental means, namely — by LI. Hence, the prosody in LI-questions is not a cue of the illocutionary force but it forms the sentence as an autonomous prosodic unit and designates the non-illocutionary meanings: contrast and discourse continuity. The accent on the first accented word can be either rising, or falling without any reasonable difference in meaning. In questions with particle LI, particle LI preserves its Wackernagel parameters, while the host of the clitic in the majority of cases serves as the first, or the only one, accent-bearer of the sentence. However, in the context of contrast, the first accent-bearer can be placed to the right from LI. Within the discourse continuity, LI-questions have two accent-bearers, the first of them could be either rising, or falling, and, at the same time, either contrastive, or non-contrastive, while the second one — is always the rising one. The prosodic patterns of LI-questions are exemplified here by spoken fragments taken from the Multimodal corpus of the Russian National corpus, and the minor working collection of the Russian speech recordings specifically set up for this investigation. The software program Praat was used in the process of analyzing the sounding data.
В докладе демонстрируется, что в русском языке имеется дискурсивное слово что-то, которое может выражать определенный спектр установок говорящего по отношению к некоторому (наблюдаемому им) обстоятельству, отклоняющемуся от нормы. А именно, дискурсивное что-то может маркировать: желание говорящего обратить внимание слушающего на сообщаемый факт, не интересуясь специально его причиной (ср. Что-то я на склоне лет стал сентиментален), желание говорящего выразить осуждение (ср. Что-то она слишком вырядилась сегодня) или просто сообщить о чем-то негативном (ср. Что-то сегодня пасмурно, но ??Что-то сегодня светит солнце); выразить свою тревогу или подозрение (ср. Что-то в детской слишком тихо); желание ослабить категоричность негативного или потенциально обидного для собеседника высказывания, в частности — смягчить резкость отказа (ср. — Давай чай пить! — Что-то не хочется) и др. Показано, что выделяемое в словарях значение что-то ‘непонятно почему’ возникает лишь в определенных контекстных условиях. Выявлены условия возникновения этого значения и его место в цепи семантической деривации, исходной точкой которой является значение неопределенного объекта. Исследование проведено на материале Национального корпуса русского языка, в том числе его параллельных подкорпусов.
The paper is addressed the corpus grammar of Russian quantifier phrases (QPs), with focus on two issues: (i) subject-predicate agreement patterns in sentences with a QP in the position of a grammatical subject, (b) the choice of the agreeing/non-agreeing form of the adjective in QPs with an embedded NP with the head noun in the feminine gender. QPs license both the plural and the singular form of the predicate. I argue that the singular form optionally shown on the predicate instantiates non-canonic agreement controlled by the QP and does not pattern with the so called default agreement in 3Sg.N. The analysis is based on the complete statistics of all Russian cardinal numerals used in the RNC in QPs of the type ‘два человека/ пять человек’ in the Russian National Corpora. I show the correlations between plural/singular agreement forms, word order (QP―V ~ V―QP) and communicative status of QP. The choice of the agreeing preposed NP-level adjective as in dve interesnye knigi does not constrain the form of the predicate agreement, while agreeing DP-level elements as in eti dve knigi blocks the singular form on the predicate. Russian subject QPs are non-canonic arguments, since in the two thirds of the corpus data they lack the status of a theme.
The role of oriented gestures is crucial while solving spatial problems. We analyze the influence of a robot, using oriented gestures, on a human. In an experimental situation robot F-2 was helping a human to solve a “tangram” puzzle. Robot was indicating in speech, which game element to take and where to place it. In a half of the tasks the robot was using oriented communicative actions (hand gestures, head movements and gaze) to indicate the required game element, and then—the game position to place it in. In the other half of tasks, the robot was using non-oriented gestures. We show, that the use of oriented gestures increases the attractiveness of a robot to human and rises the general satisfaction of the interaction with the robot.
In this paper, we present a dataset for cross-language (Russian-English) text alignment subtask of plagiarism detection. We compare different models for detecting translated plagiarism. One is based on different textual similarity scores, which exploit word embeddings. Another model extends the previous one with the features obtained via neural machine translation. The last model is built on top of pre-trained language representation (Bert) via fine-tuning for our task. The Bert model shows great performance and outperforms other models. However, it requires much more computation resources than simpler models. Therefore, it seems reasonable to use both context-free models and contextual models together in modern plagiarism detection systems.