Proceedings 2018

Alekseev V. A., Bulatov V. G., Vorontsov K. V.
Intra-Text Coherence as a Measure of Topic Models Interpretability
The article is devoted to the problem of how to automatically measure the interpretability of topic models. Some new, intra-text, approaches to estimate the interpretability of the topics are proposed. Computational experiments are conducted with the use of text files from “PostNauka”, which is a collection of popular science content.
Anastasyev D. G., Gusev I. O., Indenbom E. M.
Improving Part-of-speech Tagging Via Multi-task Learning and Character-level Word Representations
In this paper, we explore the ways to improve POS-tagging using various types of auxiliary losses and different word representations. As a baseline, we utilized a BiLSTM tagger, which is able to achieve state-of-the-art results on the sequence labelling tasks. We developed a new method for characterlevel word representation using feedforward neural network. Such representation gave us better results in terms of speed and performance of the model. We also applied a novel technique of pretraining such word representations with existing word vectors. Finally, we designed a new variant of auxiliary loss for sequence labelling tasks: an additional prediction of the neighbour labels. Such loss forces a model to learn the dependencies inside a sequence of labels and accelerates the process of training. We test these methods on English and Russian languages.
Andriyanets V., Daniel M., Pakendor B.
Discovering Dialectal Differences Based on Oral Corpora
This paper discusses a method to detect statistically significant linguistic differences between corpora while factoring in possible variability within the very corpora to be compared. Specifically, we compare two small corpora of dialects of Even, Bystraja and Lamunkhin Even, in an attempt to identify morphemes that are more frequent in either of the corpora. To investigate whether this difference might be due to an over-representation of a speaker who happens to be an outlier in terms of using a particular morpheme, we use DP, a measurement of evenness of the distribution of a specific linguistic feature across subcorpora of the same corpus.
Apresjan V. Ju., Shmelev A. D.
Russian constructions chainik dolgo (ne) zakipaet, komp’iuter dolgo (ne) zagruzhaetsia…
The paper deals with a curious phenomenon of quasi-synonymy that occurs in Russian between sentences with non-negated and negated predicates in the construction with the adverb dolgo ‘for a long time’. Consider sentences like Chainik dolgo zakipal ‘It took the kettle a long time to boil, lit. Kettle for a long time boiled’ vs. Chainik dolgo ne zakipal ‘It took the kettle a long time to boil, lit. Kettle for a long time not boiled’. The paper is an attempt to define the semantic and pragmatic mechanisms of such quasi-synonymy, as well as semantic and aspectual classes of predicates where it occurs. It also considers subtle semantic, pragmatic and communicative differences associated with non-negated and negated construction, respectively. Such quasi-synonymy occurs primarily in cases when the predicate belongs to the aspectual class of accomplishments and denotes a telic process or action with a desired result (‘to boil’, ‘to cool down’, ‘to warm up’, ‘to grow up’, ‘to finish’, etc.). Those predicates include two major semantic components, that is, a lasting process or action and an instant result. In the imperfective aspect they allow at least two possible interpretations, namely, of a process and that of a result. Similar interpretations of sentences with such predicates occur due to different scope assignments of negation and dolgo. In sentences with non-negated predicate dolgo has scope over the ‘process’ component in the verb; in sentences with negated predicate negation has scope over the ‘result’ component of the verb while at the same time falling into the scope of dolgo. The former type of sentences describes long-lasting processes, whereas the latter type describes long-awaited results, which pragmatically amount to the same thing.
Apresyan V. Ju.
Disambiguation of scope in written english texts
The paper is a corpus study of the factors involved in disambiguating potential scope ambiguity in written sentences with negation and universal quantifier all, such as I cannot visit all these universities, which, depending on topic-focus assignment, can alternatively mean ‘I cannot visit any of these universities’ (cannot is focus) and ‘I cannot visit some of these universities’ (all is focus). The factors at play in scope disambiguation are the syntactic function of the constituent containing all (subject, direct complement, adjunct); the status of the main predicate and all with respect to the information structure of the utterance (topic vs. focus); veridical vs. nonveridical context; sentence type (unreal conditional, rhetorical question); and pragmatic implicatures pertaining to the situations described in the utterances. The paper also demonstrates differences in the frequency distribution of various scope readings and their underlying causes, as well as formulating typical contexts for each scope interpretation.
Belyy A. V., Dubova M. A.
Framework for Russian plagiarism detection using sentence embedding similarity and negative sampling
In this paper, we propose a new approach for advanced plagiarism detection in Russian language. It is based on a classifier, dealing with two different types of sentence similarity measures: token set similarity and cosine similarity between sentence embeddings (based on pre-trained RusVectōrēs, unsupervised fastText, and supervised StarSpace models). The diversity of feature space makes it possible to detect different types of plagiarism, starting from simple copy&paste cases and ending with complex manual paraphrases. The proposed approach implies an ability to focus on the particular plagiarism type identification, allowing to train a universal model at the same time. The method shows great results on detection of different types of plagiarism and outperforms the previous approach.
Belyy A. V., Seleznova M. S., Sholokhov A. K., Vorontsov K. V.
Quality Evaluation and Improvement for Hierarchical Topic Modeling
Generic topics of large-scale document collections can often be divided into more specific subtopics. Topic hierarchies provide a model for such topic relation structure. These models can be especially useful for exploratory search systems. Various approaches to building hierarchical topic models have been proposed so far. However, there is no agreement on a standard approach, largely due to the lack of quality metrics to compare existing models. To bridge this gap we propose automated evaluation metrics which measure the quality of topic-subtopic relations (edges) of a topic hierarchy. We compare automated evaluations with human assessment to validate the proposed metrics. Finally, we show how the proposed metrics can be used to control and to improve the quality of existing hierarchical models.
Boguslavsky I. M., Frolova T. I., Iomdin L. L., Lazursky A. V., Rygaev I. P., Timoshenko S. P.
Semantic Analysis with Inference: High Spots of the Football Match
The paper describes a new version of the semantic analyzer SemETAP. Our approach is based on the assumption that the depth of understanding is growing with the number of inferences we can draw from the text. The salient features of SemETAP include: 1) intensive use of both linguistic and background knowledge. The former is incorporated in the Combinatorial Dictionary and the Grammar, and the latter is stored in the Ontology and Repository of Individuals. 2) Words and concepts of the ontology may be supplied with explicit decompositions for inference purposes. 3) Two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences. 4) A new logical formalism Etalog is developed in which all inference rules are written. Semantic analysis with inference allows us to extract implicit information. The analyzer is tested on the task of interpreting high spots of the football match.
Bolshakova E. I., Ivanov K. M.
Term Extraction for Constructing Subject Index of Educational Scientific Text
Subject index, or back-of-the-book index, is a device intended to provide an easy access to relevant fragments of a text document. Subject indexes usually contain particular single-word and multi-word terms from the corresponding documents. Such indexes are especially useful for reading large documents with specialized terminology, as well as educational texts in difficult scientific and technical areas. The central problem of back-ofthe-book indexing is recognition of terms to be included into the index. The paper describes a method developed for extracting and filtering terms from a given educational scientific text, with the purpose of reliable term selection in computer indexing systems. The method is primarily based on rules with lexico-syntactic patterns representing linguistic information about terms and typical contexts of their usage in Russian scientific and educational texts; simple occurrences statistics of terms is used as well. Experimental evaluation of the method has shown a considerable increase of precision and recall of term extraction compared with the widely-used standard techniques.
Bulygin M. V., Sharoff S. A.
Using Machine Translation for Automatic Genre Classification in Arabic
This paper addresses the task of automatic genre classification for Arabic within the Functional Text Dimensions framework, which allows texts to get a reliable genre description, while maintaining an adequate amount of genre labels. Our aim in this study is to build an automatic classification model that can annotate any Web text in Standard Arabic in terms of genres. To build the training corpus we translated English and Russian annotated texts into Arabic using Google MT. For building the model experimented with various machine learning approaches, such as Logistic Regression, SVM, LSTM, and different features, such as words, character n-grams and embedding vectors. For testing the classification models, we collected and annotated in terms of FTDs our own corpus of Arabic Web texts. The best performing model offers reasonable classification accuracy in spite of being based on a training corpus produced by MT.
Denisova V. A., Cienki A., Iriskhanova O. K.
Boundary Expression in Verbs and Gesture: Differences between L1 and L2 Speakers
The notion of event boundaries is closely connected with the category of aspect. Aspectual forms show different views of “internal temporal consistuency of a situation” (Comrie 1976:3) and, consequently, construals of events in different ways. Recently scholars have started looking into the core of the aspectual distinction through multimodality, considering hand gestures. On the basis of Russian and French oral narratives produced by native speakers, we conducted a study, testing our hypothesis about the existence of direct correlation between the expression of boundaries in verbs and in gestures. Means of boundary expression regarded for Russian on the verbal level were perfective (soveršennyj vid) and imperfective (nesoveršennyj vid) verbs, and for French—passé composé and imparfait. On the kinesthetic level we distinguished between bounded gestures (i.e., involving a pulse of movement) and unbounded gestures (i.e., smooth by nature). While for French L1 we found a direct correlation between gesture boundary schemas and aspectual forms, the results for Russian L1 did not support our hypothesis. With a view to these differences between the two languages, we studied the boundedness correlation in oral narratives produced by Russians speaking French as L2 (CEFR levels B2-C1). The comparison between L1 and L2 narratives revealed a certain change of gestural patterns: the Russian speakers of French L2 used almost the same number of unbounded and bounded gestures with the perfective verb forms and more unbounded gestures with the imperfective forms, thus moving closer towards French L1 speakers’ verb-gesture patterns. The use of gestures can be accounted for by a series of noise factors related to language peculiarities, the cognitive mechanism of profiling and challenges of speaking in L2.
Dobrovol’skij D. O., Zalizniak Anna A.
German constructions with modal verbs and their Russian correlates: A supracorpora database project
The paper outlines the principles of analyzing German and Russian modal constructions. Our first task is to clarify the set of meanings of German modal verbs and the conditions for their implementation. The second task is to describe the means of expressing modal values in Russian that are encountered in parallel corpora as functional equivalents of constructions with German modal verbs. As empirical data we use a representative array of parallel German-Russian texts from the Russian National Corpus (RNC). A supracorpora database of translation correspondences is constructed, in which both the German constructions with modal verbs and their Russian translation equivalents are attributed an annotation of their relevant characteristics. This database, on the one hand, is a valuable linguistic resource that can be used, among other things, to create a new generation of electronic interactive German-Russian and Russian-German dictionaries. On the other hand, the inventory of Russian construction types with (implicit) modal meanings constructed on this database will contribute to the Construction Grammar and confirm the continuity between grammar and lexicon.
Egorova M. A.
Discourse marker tipa according to the data of russian national corpus: its origin, semantics and pragmatics
Discourse marker tipa became widespread in colloquial Russian in the decade 1990s–2000s. However, until recently, it has gained little attention. In this paper we use the data from the Russian National Corpus and we aim to accomplish the following goals: 1) to highlight the origin of the discourse marker tipa from the noun tip ‘type’, 2) to describe the semantics of the discourse marker tipa as well as that of the partly grammaticalized element tipa as part of parametric constructions. We base our approach mainly on the results achieved by Susanne Fleischman and Marina Yaguello.
Fomin V. V., Bondarenko I. Yu.
A study of machine learning algorithms applied to GIS queries spelling correction
The problem of spelling correction is crucial for search engines as misspellings have a negative effect on their performance. It gets even harder when search queries are related to a specific area not quite covered by standard spell checkers, such as geographic information systems (GIS). Moreover, standard spell-checkers are interactive, i.e. they can notice a misspelled word and suggest candidate corrections, but picking one of them is up to the user. This is why we decided to develop a spelling correction unit for 2GIS, a cartographic search company. To do this, we have extracted and manually annotated a corpus of GIS lookup queries, trained a language model, performed various experiments to find the best feature extractor, then fitted a logistic regression using an approach suggested in SpellRuEval, and then used it iteratively to get a better result. We have then measured the resulting performance by means of cross-validation, compared at against two baseline algorithms and observed a substantial increase. We also present an interpretation of the result achieved by calculating and discussing the importance of specific features and analyzing the output of the model.
Galitsky B., Taylor R.
Discovering and Assessing Heated Arguments at the Discourse Level
The problem of detecting heated arguments in text such as political debates and customer complaints is formulated as tree kernel learning of discourse structures. Affective argumentation structure is discovered in the form of discourse trees extended with edge labels for communicative actions. Extracted argumentation structures are then encoded as defeasible logic programs and are subject to dialectical analysis, to establish the validity of the main claim being communicated. We evaluate the accuracy of each step of this affect processing pipeline as well as overall performance.
Grashchenkov P. V., Kirillova A. A., Smirnova O. S.
The influence of syntax on prosody: the experimental data from a study of one russian text
The paper examines dependencies between the syntactic and prosodic structure with particular attention to the pausation and different levels of prosodic boundary strength. The research is based on the prosodic data markup for a spoken Russian text and the manual tagging of this text with the relevant syntactic constituent boundaries. Two types of structures, the finite clause and the asyndetic coordination, exhibit a strong positive correlation with the appearance of a pause and the perceptual prosodic boundary. We also demonstrate the presence of a substantial correlation between the syntactic embedding depth and prosodic boundaries. The results of our research show a significant connection between some of the initially proposed syntactic factors and prosodic structure. We thus anticipate that prosodic modules of TTS systems can benefit from taking certain syntactic information into consideration.
Inkova O. Yu.
Supracorpora database as an instrument of the study of the formal variability of connectives
The article intends to describe the formal variation of the connectors of the Russian language on the basis of a cognitive-semantic approach. Every discourse variant DV of a connector K, i.e. the specific form assumed by K in a discourse section, is singled out, and registered in the supracorpora database of connectors (SCDB), in which a system of intersecting clusters has been developed, allowing to assign in the course of the annotation the same DV to different structural clusters. In the next phase, on the base of further semantic analysis, the DVs with a common element are combined into a structural-semantic complex around a basic form: the minimal linguistic unit that enables the speaker to express a certain logical-semantic relation, and the listener to identify it. In conclusion, criteria for describing the formal variation of the connectors are proposed, as well as examples of the “profiles” of the basic forms. They reflect the potential of linguistic means that the speaker has at his disposal to express one or another logical-semantic relations or one of their combinations.
Inkova O. Yu., Nuriev V. А.
To what extent is the conjunction khotya language-specific?
The paper describes the Russian connective khotya (‘although’) from a contrastive perspective. First, it focuses on the semantic description of the connective and proposes to differentiate its four meanings, namely, concessive propositional, concessive illocutionary, adversative propositional and adversative illocutionary. The paper analyzes the functioning of the connective khotya (prototypical marker of concessive relations) and that of the connective no (‘but’, prototypical marker of adversative relations). In so doing, it comes to the following conclusion: the adversative meaning of khotya develops on the basis of its concessive meaning as the connection between the situations presented in the textual fragments that are linked by the connective becomes less logical. Similarly, i.e. vice-versa, as the logical connection between situations becomes stronger, this gives rise to a concessive interpretation in utterances with no. Further, the paper takes a closer look at French equivalents khotya gets, when occurring in each of its four meanings. The concluding section attempts to define the degree of language-specificity of khotya. To this end, several parameters are considered: (1) cases where the connective has a zero equivalent, (2) cases of divergent translation (the connective is translated by a non-connective), (3) number of translation patterns. To perform a contrastive analysis and to collect statistical data, the supracorpora database of connectives is used. The database is built upon the parallel Russian-French and FrenchRussian subcorpora of the RNC.
Iomdin L. L.
Once again on microsyntactic constructions formed with functional words: tо i delo ‘every now and then’
The paper continues a series of research studies into the microsyntax of Russian, conducted by the author for a considerable period of time. Specifically, the focus is on the adverbial syntactic idiom tо i delo ‘≈ every now and then’, which seems very interesting and instructive as it combines implicit semantic features and a unique set of syntactic facets that could be revealed by both present-day and diachronic linguistic data. This syntactic idiom is considered against the background of other microsyntactic elements that happen to be its neighbors in the dictionary but feature a substantially different set of linguistically relevant properties. It is shown how phraseological units of such kind can be presented in the Microsyntactic dictionary of Russian, under development by the author and his colleagues, and in the corpus of texts annotated with microsyntactic phenomena.
Ivanov V. V., Solnyshkina M. I., Solovyev V. D.
Efficiency of Text Readability Features in Russian Academic Texts
This paper addresses the problem of readability assessment for Russian texts and investigates the impact of 24 lexical, syntactic and frequency features. The research was conducted on Russian Readability Corpus containing two sub-corpora, two sets of 5–11 grade level textbooks on Social studies for native speakers of Russian. The sub-corpora were collected for research purposes, annotated and marked as BOG and NIK. The application of the Ridge regression has demonstrated the connection between readability and average sentence length, average number of coordinating chains, average number of sub-trees, frequency and lexical features. The results of the study have the potential to be applied in a wide variety of areas including primarily education, as well as webpage design, document management.
Khristoforova E. A., Kimmelman V. I.
Corpus-based investigation of quotation in Russian Sign Language
This paper presents corpus-based research of quotation constructions in Russian Sign Language (RSL). Quotation constructions have been observed from different perspective in different signed and spoken languages [Brendel, Meibauer, Steinbach 2011]; [Litvinenko et al. 2009]. Based on the corpus of spontaneous narratives recorded from RSL signers [Burkova 2015], we conducted a quantitative analysis of these constructions. We analyzed constituents of quotation construction, such as the source (author of utterance) indication, the introducing matrix predicate, and the quote. Our investigation of non-manual markers in the corpus revealed that nonmanual marking of quotation is optional for RSL quotations. We distinguished direct and indirect quotations in our data based on the reference of indexical elements, the use of subordinating conjunction, and the imperative mood. We found that in RSL non-manuals do not mark the direct/ indirect type of quotation. Our data show that RSL signers tend to use direct quotation much more frequently than indirect quotation. In addition, we compared our findings with the data on quotation constructions in some other sign languages and with the studies of quotation in natural discourse of spoken languages. This comparison showed that RSL quotations share core properties with quotations in spoken and signed languages [Litvinenko et al. 2009].
Kibrik A. A., Fedorova O. V.
Language production and comprehension in face-to-face multichannel communication
Although language production and comprehension are parts of one and the same linguistic capacity, they have been studied separately for a long time. A key issue in the present day research is how the two processes are related, and whether transitions from thought to language and vice versa are accomplished by a single or two separate systems. Important progress in this area has been achieved in the field of psycho- and neurolinguistics; a brief review is provided in Section 1. In this paper we explore the production—comprehension relationship on the basis of our multichannel resource “Russian Pear Chats and Stories”. In Section 2 we describe this resource, including the stimulus material, data collection setup, participants and corpus size, and technical aspects. Section 3 lays out two main theoretical notions: a model of face-to-face multichannel communication and a scheme of the production-comprehension interweaving in each interlocutor. In subsequent sections we discuss three case studies of production—comprehension relationships: relative contributions of kinetic channels to discourse understanding (Section 4), turn-taking and eye gaze (Section 5), and multichannel continuity (Section 6). The evidence of the multichannel corpus suggests a cognitive architecture that integrates language production and comprehension.
Klyshinsky E. S., Lukashevich N. Y., Kobozeva I. M.
Creating a Corpus of syntactic co-occurrences for Russian
In the paper we discuss methods used to create CoSyCo, a corpus of syntactic co-occurrences, which provides information on syntactically related words in Russian. We describe a list of shallow parsing templates, which were used to collect data for CoSyCo. The paper includes an overview of the corpora collected for CoSyCo creation and an outline of how the noun ‘virus’ is used in its subcorpora as an example of the information which can be obtained from this online resource.
Konovalov V. P., Tumunbayarova Z. B.
Learning Word Embeddings for Low Resource Languages: the Case of Buryat
Word-vector representations have been extensively studied for rich resource languages with large text datasets. However, only a few studies analyze semantic representations of low resource languages, when only small corpus is available. In this study we introduce a methodology and compare techniques to learn semantic representations of low resource languages. The proposed methodology consists of defining accurate preprocessing steps, applying language-independent stemmer and learning word-vector representations. In addition, we propose a simple word embeddings evaluation scheme that can be easily adapted to any language. By using this methodology we learn word-vector representations for Buryat language. In order to promote further research we make the source code and the resulting word embeddings corpus publicly available.
Korotaev N. А.
How intonation structures spoken narratives: non‑final phase contexts
Topic—focus articulation in Russian has been mainly studied against isolated utterances. In a categorical sentence, this communicative opposition is reflected in the linear-accentual structure [Paducheva 2015]. For a simple declarative sentence, that would normally mean that the topic (theme) comes first and has a rising phrasal accent, while the focus (rheme) completes the utterance and is pronounced with a falling accent. At the same time, these formal features do more than just differentiate between topics and foci; they also mark the discourse-semantic category of phase [Kodzasov 2009]. In syntactically simple utterances, topics tend to correlate with anticipated continuation, hence non-final phase; foci are usually phase-final. As I intend to show in this paper, the non-final phase provides a variety of contexts that challenge the topic—focus distinction. The study is based on the “Stories about presents and skiing”—a collection of prosodically annotated spoken narratives. In Section 1, I concentrate on issues within a simple clause, where non-final verbal elements often have a fuzzy communicative interpretation. In Section 2, I analyze complex syntactic structures. The data show that non-final clauses may demonstrate both thematic and rhematic properties with regard to their intonation patterns, internal structure and discourse function. Hence, one can claim that some non-final clauses are topics, while others are foci. However, a majority of non-final clauses in the analyzed corpus may not be unambiguously attributed to either of these categories. Section 3 provides a pilot study of complex intonation patterns. Only phase distinction being considered, utterances with more than one accentual phrase may follow either (i) the basic adaptation strategy (comprising a non-final rising accent and a final falling accent), or, more often, (ii) a complicated strategy: (a) multiple parallel adaption, (b) consecutive adaptation, or (c) parenthetical strategy.
Kotov A. A., Zaidelman L. Y., Arinkin N. A., Zinina A. A., Filatov A. A.
Frames Revisited: Automatic Extraction of Semantic Patterns from a Natural Text
Our project aims to design a syntactic parser, which constructs a semantic representation in a frame format: a clause is represented as a table of valencies, filled in with semantic markers. This representation is compared to a list of scripts—used to disambiguate and classify the semantic representation as well as to select an appropriate reaction for a companion robot F-2.
Krivnova O. F., Smirnova O. S.
A database of wordbreaks discursive features in russian oral speech: the structure, composition and application
Thе paper discusses the most important results of the project “Hierarchy of prosodic phrasing in spoken language: controlling factors and means of realization”. The project was aimed at expanding the empirical base of phrasal prosody researches, which inadequacy is marked in many scientific areas: discourse theory, syntax, intonational phonology, general phonetics, speech synthesis and recognition etc. The introduction provides a brief description of the study background and formulates the tasks which were necessary to solve for the ultimate goal of the project planned for 3 years of implementation. The first section describes the characteristics of speech corpora created in the the project for construction of a complex, linguistic-prosodic database required for the study and modeling of prosodic phrasing in Russian speech, which takes into account, if possible, all controlling factors and means of realization. The second section is devoted to the description of the structure and composition of wordbreaks’ discursive features database (BDF), obtained on the basis of annotated, prosodically graduated and acoustically analyzed speech corpora. It should be noted the universality and flexibility of the format and structure of the database as a computer resource, freely admitting to extend its feature set and to detail their parametric characteristics. The third section illustrates as the BDF application for theoretical and statistical modelling of inter-level correlations “syntax—linguistic prosody” in both directions and “linguistic prosody and speech signal (acoustic speech)” in both directions. The conclusion summarizes the results of research and discusses some promising directions for further studies on relevant topics.
Kustova G. I.
Mental predicates in metatext
The paper deals with metatext (parenthetical) constructions (MC) with mental verbs (znat’ ‘know’, ponimat’ ‘understand’, verit’ ‘believe’ and the like) in the 2nd person. The following problems are considered: is there a semantic correlation between the proposition and MC; what illocutionary function MC and proposition have. It was shown that some MCs are used only in interrogative sentences.
Kutuzov A. B.
Russian Word Sense Induction by Clustering Averaged Word Embeddings
The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE’2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word’ senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data—not only in intrinsic evaluation, but also in downstream tasks like word sense induction.
Laposhina А. N., Veselovskaya Т. V., Lebedeva M. U., Kupreshchenko O. F.
Automated Text Readability Assessment for Russian Second Language Learners
This paper presents an outline of the readability assessment system construction for the purposes of the Russian language learning. The system is designed to help educators easily obtain the information about the difficulty level of reading materials. The estimation task is posed here as a regression problem on data set of 600 texts and a range of lexico-semantic and morphological features. The scale choice and annotated text collection issues are also discussed. Finally, we present the results of the experiment with learners of Russian as a foreign language to evaluate the quality of a predictive model.
Levin I., Andriyanets V., Iomdin B., Ambartsumian A.
Lexical Variation: Word Knowledge and Polysemy in Russian Everyday Life Lexicon
Many words that according to the dictionaries have just one meaning are in fact understood in different ways by different speakers. In this article we deal with Russian nouns denoting everyday life objects which are subject to much variation by age, gender, and region and are poorly described by the existing dictionaries. We report the results of a multilevel survey, propose some possible metrics of word knowledge and show to what extent the words we studied are known among a certain population. We also claim that different speakers possess different sets of meanings for each word, propose ways to discover the distribution patterns for these sets and introduce the notion of disperse polysemy. We believe that our findings may be useful in lexicography (providing detailed information on current word usage in different social groups), lexical semantics (researching meaning shifts and patterns of its distribution among speakers), and language testing (more precise detection of the vocabulary sizes both in native speakers and in language learners).
Levontina I. B.
Corpus-based study of non-canonical use of russian interjections
The paper deals with the Russian interjections (oj, oh, aj, ogo, uh, etc.), namely their non-canonical use in collocations with K-words (Wh-words), mostly kak and kakoj. This type of use demonstrates a sort of syntactic recomposition — collocations oj kak, oh kakoj, etc. function as lexical units with the meaning of high degree, high quality or big quantity, although with very specific semantic shades. The paper makes use of the corpus data (the Russian National Corpus as well as the Internet data) to discover individual properties of interjections and their historical changes. Primary interjections are described against the background of interjections derived from the words of different part of speech. It turns out that in non-canonical use of primary interjections K-word can hardly be omitted, whereas derived interjections can also function the same way even without K-word. Noncanonical use of derived interjections is, with and without K-words, is very popular in contemporary Russian, especially in slang.
Levontina I. B., Shmelev A. D.
The russian aby: corpus‑driven research (synchrony and diachrony)
The paper deals with the Russian aby as a marker of “free choice” (or, rather, not specified choice criteria) within indefinite pronouns against the background of other markers of “free choice” such as ugodno, popalo, pridetsia. It pays attention not only to the synchronic semantics of aby, but also to its history and claims that the modern meaning of aby is related to its usage as a conjunction. The paper makes use of the corpus data (the Russian National Corpus as well as the Internet data) to follow the changes in the use of the particle in question over the last two hundred years. It investigates into the range of K-words that can collocate with aby: the most typical are collocations with kto, chto, kak and kakoi; however, collocations with other K-words are also present in the corpora. In addition, it discusses the question of negative polarity of aby and the increasing degree of its polarization.
Lobanov B. M., Solomennik А. I., Zhitko V. A.
An experience of the objective estimation of intonation quality of the synthesized russian speech
The paper describes an experiment on an instrumental evaluation of the intonation quality of synthesized Russian speech by using of “Inton@Trainer” computer system. The system was originally designed to train learners in producing the basic intonation patterns of Russian speech. It is based on comparing the melodic portraits of a reference sentence and a sentence pronounced by the learner. Our approach to assessing the intonational quality of speech allows to treat a synthesized speech with the same strict requirements as are applied to students studying Russian as a second language. We describe the technology used for the instrumental evaluation of the intonation quality of synthesized speech and the acoustic database of reference phrases used to assess the intonation quality of synthesized speech. The paper presents the results of testing the intonation quality of two Russian synthetic voices. We discuss the results of the experiment and outline the ways for improving the methods for objective evaluation of synthesized speech prosodic quality, as well as the possibility of applying the developed system in other linguistic tasks.
Loukachevitch N. V., Rusnachenko N.
Extracting Sentiment Attitudes from Analytical Texts
In this paper we present the RuSentRel corpus including analytical texts in the sphere of international relations. For each document we annotated sentiments from the author to mentioned named entities, and sentiments of relations between mentioned entities. In the current experiments, we considered the problem of extracting sentiment relations between entities for the whole documents as a three-class machine learning task. We experimented with conventional machine-learning methods (Naive Bayes, SVM, Random Forest).
Lyutikova E. A., Tatevosov S. G.
Re-interpreting events: notes on one linguistic innovation in russian
The paper explores the distribution and interpretation of the discourse marker po(-)xodu (PX) and addresses a possible path of its diachronic development. We argue that the range of uses of PX attested in the corpora supports an analysis that identifies three meanings / functions of this item labeled eventive PX, epistemic PX and discourse-level PX throughout this paper. We propose that the latter two are the products of re-interpretation of the former. We argue for a presuppositional analysis of the eventive PX whereby it requires there be a set of background events that show a temporal overlap with the asserted event and add up to the integral whole. We analyze the epistemic PX as resulting from inferential reinterpretation of the relationship between background and asserted events, with the abductive reasoning being the key ingredient of this reinterpretation. Finally, we treat the discourse-level PX as a counterpart of the eventive PX in the domain of speech acts. We speculate that Krifka’s (2014) recent view of speech acts as index changers opens a way of accounting for this parallelism in a principled way. On the diachronic side, we identify PX as the product of diachronic development of the construction in which the argument of the noun xod ‘move’ is expressed by an overt DP. In the course of development, this DP was first replaced by pro, which gave rise to the eventive PX, and later on developed epistemic and discourse-level meanings / functions.
Miftahutdinov Z., Tutubalina E.
Leveraging Deep Neural Networks and Semantic Similarity Measures for Medical Concept Normalisation in User Reviews
Nowadays a new yet powerful tool for drug repurposing and hypothesis generation emerged. Text mining of different domains like scientific libraries or social media has proven to be reliable in that application. One particular task in that area is medical concept normalization, i.e. mapping a disease mention to a concept in a controlled vocabulary, like Unified Medical Language System (UMLS). This task is challenging due to the differences in language of health care professionals and social media users. To bridge this gap, we developed end-to-end architectures based on bidirectional Long Short-Term Memory and Gated Recurrent Units. In addition, we combined an attention mechanism with our model. We have done an exploratory study on hyperparameters of proposed architectures and compared them with the effective baseline for classification based on convolutional neural networks. A qualitative examination of the mentions in user reviews dataset collected from popular online health information platforms as well as quantitative one both show improvements in the semantic representation of health-related expressions in user reviews about drugs.
Mikhalkova E. V., Ganzherli N. V., Karyakin Y. E., Grigoryev D. A.
Machine Learning Classification of User Interests Across Languages and Social Networks
Being a matter of cognition, user interests should be apt to classification independent of the language of users, social network and the essence of interest itself. To prove it, we built a collection of English and Russian Twitter and Vkontakte community pages manually classified according to the interests of their followers. First, we created a model of Major Interests (MaIs) with the help of expert analysis and then classified the mentioned set of pages using machine learning algorithms (SVM, Neural Network, Naive Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors) trying different optimization techniques. We take three interest domains that are typical of both English and Russian-speaking communities: football, rock music, vegetarianism. The results of classification show a greater correlation between Russian-Twitter and English-Twitter pages. The Logistic Regression with Bernoulli bag-of-words model proves to be the most effective classification algorithm.
Nedoluzhko A., Novák M., Ogrodniczuk M.
Analysis of coreferential expressions in PAWS (English-Czech-RussianPolish Parallel Treebank with Anaphoric Relations)
In this paper, we decribe the coreference annotation on a multi-lingual parallel treebank (PAWS), a portion of Wall Street Journal translated into Czech, Russian and Polish which continues the tradition of multilingual treebanks with coreference annotation. The paper focuses on language-specific differences. We analyse syntactic structures concerning anaphoric relations in the languages under analysis, such as personal and impersonal constructions in polypredicative constructions and pro-drop qualities.
Nedoluzhko A., Lapshinova-Koltunski E.
Pronominal Adverbs in German and their Equivalents in English, Czech and Russian: Evidence from the Parallel Corpus
The paper presents a contrastive analysis of pronominal adverbs in German (dabei, darauf, damit etc.) and their equivalents in English, Czech and Russian. The analysis is based on an empirical study of parallel news texts. Our main focus is to show the interplay between cohesive devices expressed through German pronominal adverbs in text and explore their equivalents in English, Czech and Russian. As the dataset at hand contains translations, we also focus on the influence of the translation factor in parallel texts.
Paducheva E. V.
Suspended assertion and nonveridicality
The paper addresses the notion of “snyataya utverditel’nost’” (suspended assertion). The author argues that the term “suspended assertion”, introduced by U. Weinreich in 1963, covers the same range of phenomena as the term nonveridicality (its suggestedRussian equivalent is neveridicativnost’), which has become widespread due to the works by F. Zwarz, A. Giannakidou and many others. It is demonstrated that the notion of suspended assertion an be applied to interpret a number of facts of the Russian language, such as nibud’-pronouns, pronouns of negative polarity, the disappearance of a semantic argument of verbs with the direct (non- parametrical) diathesis, the mirror symmetry of past and future, the negation with an extended scope, nibud’-pronouns in the scope of negation, the interchangeability of eshche ‘yet’ and uzhe ‘already’. It’s the author’s conviction that the notion of suspended assertion will be applicable in many other contexts.
Panchenko A., Lopukhina A., Ustalov D., Lopukhin K., Arefyev N., Leontyev A., Loukachevitch N.
RUSSE2018: a Shared Task on Word Sense Induction for the Russian Language
The paper describes the results of the first shared task on word sense induction (WSI) for the Russian language. While similar shared tasks were conducted in the past for some Romance and Germanic languages, we explore the performance of sense induction and disambiguation methods for a Slavic language that shares many features with other Slavic languages, such as rich morphology and virtually free word order. The participants were asked to group contexts of a given word in accordance with its senses that were not provided beforehand. For instance, given a word “bank” and a set of contexts for this word, e.g. “bank is a financial institution that accepts deposits” and “river bank is a slope beside a body of water”, a participant was asked to cluster such contexts in the unknown in advance number of clusters corresponding to, in this case, the “company” and the “area” senses of the word “bank”. For the purpose of this evaluation campaign, we developed three new evaluation datasets based on sense inventories that have different sense granularity. The contexts in these datasets were sampled from texts of Wikipedia, the academic corpus of Russian, and an explanatory dictionary of Russian. Overall, 18 teams participated in the competition submitting 383 models. Multiple teams managed to substantially outperform competitive stateof-the-art baselines from the previous years based on sense embeddings.
Pekelis O. E.
Speech act conjunction: the scale of speech act use and its manifestation in grammar
This paper deals with the phenomenon of speech act conjunction in which the relation expressed by the conjunction holds on the level of speech act performance rather than on the level of states of affairs. It is argued that besides clearly speech act and clearly non-speech act uses, there is a class of constructions of an intermediate nature. The criteria are proposed that serve to distinguish between these three types of use. In particular, it is demonstrated that imperative sentences can only be of the “intermediate” type, while interrogative sentences can represent the clearly speech act use. The proposed distinction manifests itself in grammar. Namely, different conjunctions are compatible with different types of speech act use; the correlative item togda (‘then’) cannot be used within a clearly speech act construction.
Petrova M. A., Druzhkina A. A., Garashchuk R. V., Yudina M. V.
Semi-automatic Integration of a new Language into a multilingual NLP model: the case of Japanese
The current paper deals with the integration of the Japanese language in a multilingual NLP model, namely, the Compreno model. The formalism includes morphological, syntactic and semantic patterns, covering all possible semantic and syntactic dependencies a word can attach. The architecture of the model allows us to acquire nearly all semantic links of a word through its proper positioning in a thesaurus-like semantic hierarchy, where words are linked through semantic dependencies. The inheritance principle of the hierarchy simplifies the syntactic description of a newly added language as well. Unlike the traditional approach to Japanese parsing based on chunks, or bunsetsus, we suggest a Japanese parser based on constituents. Special attention is given to the tools that allow us to automatize language description process and significantly speed up the description. The work on the Japanese model is still in progress, therefore, we show the current results we have achieved, and point out problems that remain to be solved.
Piperski A. Ch.
Corpus Size and the Robustness of Measures of Corpus Distance
This paper studies the impact corpus size has on the robustness of various frequency-based measures of corpus distance (or similarity, respectively), such as Euclidean distance, Manhattan distance, Cosine distance, χ², Spearman’s ρ, and Simple-Maths Keyword distance. An experiment performed using the British National Corpus shows that Euclidean distance is least influenced by corpus size and thus is best suited for the purpose of comparing corpora.
Podlesskaya V. I.
“A u nas v kvartire gaz. A u vas?”: the russian conjunction a viewed through the prism of prosodically annotated corpus data
The paper focuses on Russian constructions with clauses (or VPs) combined by means of the discourse marker A, that behaves as a conjunction or as a particle in different contexts. Prosodically, the construction may come up in two forms: (a) as a single illocution with the first clause pronounced with a rising pitch that projects discourse continuation, and (b) as two separate illocutions with the first clause pronounced with a falling pitch that projects no continuation. Basing on the data from the Prosodically Annotated Corpus of Spoken Russian, prosody and grammar of (a) and (b) were analyzed qualitatively and quantitatively. Type (b) appeared to be as frequent as type (a) and systematically favored in pragmatically marked contexts.
Rygaev I. P
Referring Expression Generation for Question Answering and Graph Visualization
This paper describes a practical solution for the task of referring expressions generation (REG) in the context of a question-answering system. When an answer to a question is found in the knowledge base the system has to decide how to present the answer to the user, which properties uniquely distinguish the object found from other objects in the knowledge base. Another task where referring expressions would be useful is the semantic graph visualization task. Building on top of the graph-based approach presented by Krahmer et al in 2003 this paper provides some practical improvements to the algorithm, namely: 1) Instead of depth-first graph search we use breadth-first search, which is dramatically faster when a scene graph is big but the description graph to be found is small, 2) Limit on the size (the number of edges) of the resulting description graph to increase performance and avoid useless long descriptions. Also a sketch on linguistic realization of the referring expressions is outlined.
Sherstinova T. Yu.
The structure of everyday dialogue as the sequence of speech act
The structure of Russian everyday dialogue was studied on the basis of 73 microdialogues of everyday speech communication from the ʽOne Day of Speechʼ corpus (the ORD Corpus). The aim of the research was to find out what types of speech acts commonly initiate and complete everyday dialogues, as well as to reveal the most typical sequences of speech acts in these dialogues. Altogether, 2230 speech acts of 30 people referring to both professional, and household conversations have been analysed. N-gram analysis has been used to calculate the most frequent sequences of speech acts. The obtained results showed that dialogues are usually started by representatives, i. e. speech acts related to the exchange of information (38% of all cases), etiquette beginnings (greetings, vocatives) take place in 23% of the dialogues, and in 19% of cases the conversation begins with a regulative form. Speech acts ending dialogues show a greater variety: representatives contribute 2% of all dialogue ends, valuative judgments and regulatory forms cover 14% each, further go directives (8%), commissions (8%), etiquette forms (8%) and emotional and expressive form (7%). As for the most typical bigrams of speech acts, they are the following: two consecutive representatives (22.35%), a regulatory form followed by a representative (6.93%), a representative and a regulatory form (6%), a valuative with a following representative (5.21%), a representative and a valuative judgment (4.77%), as well as two combinations of a directive with a representative (2.77% each). Besides, the article presents data on the occurrence of the most frequent pairs of speech acts at the subtype level. Here, the most frequent one is the sequence ʽquestionʼ+ʽanswerʼ, which covers 2.45%.
Skachkov N. A., Vorontsov K. V
Improving topic models with segmental structure of texts
Probabilistic topic modeling is a powerful tool of text analysis, that reveals topics as distributions over words and then softly assigns documents to the topics. Even though the aggregated distributions can be good with basic models, a sequential topic representation of each document is often unsatisfactory. This work introduces a method that allows to increase the quality of topical representation of each single text using its segmental structure. Our approach is based on Additive Regularization of Topic Models (ARTM), which is a technique for imposing additional criteria into the model. The proposed method efficiently avoids a bag-of-words assumption by considering the topical connections of words that co-occur in a local segment. We assume, that sequential sentences are topically and semantically coherent, while the number of topics in each particular text fragment is low. We apply our model to topic segmentation task and achieve a better quality than the current state-of-the-art TopicTiling algorithm. In further experiments we demonstrate that the proposed technique reveals an interpretable sequential structure of documents, while keeping a number of topics low, i.e. the sparsity of the model increases. Apart from topic segmentation, the constructed topical text embeddings can be used in any other applications, where the analysis of the document structure is desirable.
Skorinkin D., Fischer F., Palchikov G.
Building a Corpus for the Quantitative Research of Russian Drama: Composition, Structure, Case Studies
In this paper we introduce RusDraCor—an open corpus of Russian drama for digital literary & linguistic research. The corpus ( contains plays from the middle of XVIII to the first third of XX century provided with structural (plus some semantic) markup and metadata. Texts are encoded in the XML-based standard TEI, widely used in building corpora for the humanities. We describe the contents and annotation layers of our corpus, provide some details on its development and enrichment, and finally describe three research cases. Each case demonstrates the use of RusDraCor to answer specific questions about composition, structural features and historical evolution of Russian drama.
Slabodkina T. A., Fedorova O. V.
Speech disfluencies analysis in the discourse of 10–12 years old native russian speaking children
The paper reviews the problem of speech disfluency which over the years has becometraditionalforthe “Dialogue” conference (seePodlesskaya, Komarova 2010; Laurinavichyute, Fedorova 2010; Fedorova 2010; Podlesskaya 2013; Bogdanova-Beglarian 2013; Podlesskaya 2014; Potanina et al. 2016). In this paper, we compared speech disfluencies in two corpora of dialogues between children of 10–12 years old (section 1) and adults (section 2). Both corpora were collected using the referential communication task “Tangrams” (to perform the task, participants had to agree on the nomination of some abstract figures). In the third section of the text, the authors provide the classifications of speech disfluencies present in the dialogues with examples. The results of the comparison and the methods of analysis are given in the fourth paragraph. Finally, the last section contains the discussion of the results and perspectives of the further work. The paper shows that speech of children of the given age group differs from adults’ speech in terms of disfluencies at the discourse level.
Slioussar N. A.
Gender, Declension and Stem-final Consonants: an experimental Study of Gender Agreement in Russian
Every adult native speaker of Russian knows that kon’ is masculine and lan’ is feminine, although 3rd declension nouns present some difficulties in the first and second language acquisition. However, will the fact that these nouns are less frequent than masculine nouns ending in a consonant or feminine nouns ending in -a/ja play a role for online subject-predicate agreement processing? Or will subject-predicate agreement processing be more problematic with subjects of a certain gender? Finally, some final consonants are more characteristic for feminine gender, while the others for masculine gender. Are speakers sensitive to this? We present two experiments addressing these questions. We found that all three factors play a role, but for different tasks (online agreement processing or determining the gender of a novel word) and at different processing stages.
Sorokin A. A.
Improving neural morphological Tagging using Language Models
We offer a new neural architecture for character-level morphological tagging, combining character-level networks with the output of neural language model on morhological tags. Our proposal reduces tagging error up to 10% in comparison with baseline model and achieves state-of-the-art performance both on ru_syntagrus and MorphoRuEval datasets.
Stoynova N. M.
Differential object marking in contact-influenced Russian Speech: evidence from the Corpus of Contact-influenced Russian Speech of Russian Far East and Northern Siberia
The paper deals with differential object marking in the Russian Speech of Nanai-Russian bilingual speakers, namely the variation such as принес рыбу ~ принес рыба (‘{he} brought fish-acc ~ fish-nom’). The puzzle is that this peculiarity can result from a number of different processes: morphosyntactic borrowing from Nanai, penetration of dialectal features into the speech of bilinguals, under-acquisition or reinterpretation of the Standard Russian system. The data of a small corpus of contact-influenced Russian Speech is used to test all these hypotheses. The results are following. Nominative forms are used in DO-position in quite a systematic way and such uses cannot be estimated as occasional “errors”. The main factors that influence the NOM~ACC distribution are a) information structure and b) the accentual type of noun stem. The latter fact supports the hypothesis of a systematic reinterpetation of the Standard Russian system in the situation of incomplete acquisition. No significant correlations with animacy, definiteness, verb form and word order were attested. DOM pattern of Nanai Russian differs from those of Russian dialects and reveals some similarity to those of Nanai. However it cannot be considered as a full morphosyntactic calque.
Tiskin D. B.
The interpretation of russian pronouns in counteridentity contexts: a corpus study
This paper is a first step towards a corpus-based description of the semantics of Russian pronouns in intensional contexts. Having justified the use of corpus in (formal) semantic research, I delineate a particular issue within the topic: whether a given pronoun is interpreted de se or de re in counteridentity contexts. A counteridentity context is a clause within the scope of a counterfactual (clause or adverbial) that affects the identity of a real individual, e.g. if I were you, were I you, etc. If a pronoun such as I, my or the Russian reflexive possessive svoj is used in such a context, two options are theoretically possible: either it picks out the speaker’s real self (de re), or it refers to the identity assumed by the speaker in the contrary-to-fact situations introduced by the counterfactual (de se). Using data from the GICR corpus (approx. 20 billion tokens), I show that for the Russian first-person singular pronoun ja and its corresponding possessive moj, de se reference is possible but de re interpretation is more frequent. The opposite holds for the reflexive sebja, whereas svoj is interpreted de se with no exception. Special attention is paid to situations where more than one referential strategy is possible. The paper concludes with a couple of observations relevant for the future formal accounts of de se reference.
Toldova S., Pisarevskaya D., Kobozeva M., Vasilyeva M.
The cues for rhetorical relations in Russian: “Cause—Effect” relation in Russian Rhetorical Structure Treebank
The purpose of the paper is to investigate cues signalling the relations between discourse units in Russian. Building a lexicon of discourse connectives is an indispensable subtask in many discourse parsing applications as well as an essential issue in theoretical researches of text coherence. In order to develop such a resource for Russian, we have conducted a corpus-based study of discourse connectives that were manually extracted from the Russian Rhetorical Structure Treebank (Ru-RSTreebank). The Treebank includes 79 texts annotated within the RST framework [Mann, Thompson 1988]. In order to provide a deeper analysis of connectives in Russian, we focus on causal relations only, namely, the ‘Cause-Effect’ relation. Some of the connectives (primary connectives) are enumerated in grammars and dictionaries. They primarily mark the intra-sentential relations. However, there is an expansive class of less grammaticalized items (secondary connectives) that have received less attention till now. Some of them are based on content words (e.g. по причине ‘for the cause’). Secondary connectives often serve as linking devices for inter-sentential relations. We suggest a scheme for connectives annotation for Russian. We specify the basic patterns that can be used for less-grammaticalized connectives mining in an unannotated corpus. Besides, we provide the comparison of two classes of connectives (primary vs. secondary ones). Our research has shown that these two classes differ in their properties. There is a statistically significant difference between them with respect to the nucleus/ satellite position, intra- vs. inter-sentential relations and some others.
Uryson E. V.
Syntax of prepositional adverbs: some difficult cases
The subject of this paper are Russian so called adverbial prepositions; cf. vokrug (kostra) ‘around smth.’, daleko ot (doma) ‘far from smth.’, etc. By definition, an adverbial preposition either coincides with an adverb (cf. vokrug) or contains an adverb and a preposition (cf. daleko ot). As I have demonstrated in my previous works, an adverbial preposition and the underlying adverb have the same meaning, the only difference between them being in the mode of expression of the main semantic actant; cf. Gorel koster, vokrug (preposition) kostra stojali liudi ‘A fire was burning, people were standing around it’ vs. Gorel koster, vokrug (adverb) stojali liudi ‘A fire was burning, people were standing around’. From the modern point of view, syntactic distinction is insufficient for interpreting such cases as different words (or different meanings of a word). So, an adverbial preposition and the underlying adverb should be interpreted as the same meaning of a given word. I argue that this word is an adverb (or a prepositional adverb). This paper deals with syntax of these adverbs. Such adverbs have one or more semantic actants, at least one of them being expressed by a noun or a prepositional group. The problem is that in some cases it is not clear whether the prepositional group is governed by the adverb or by the verb governing this adverb (thus the adverb and the prepositional group are co-governed by the verb). A criterion of adverb vs. verb governing of such groups is discussed. Two Russian adverbs zadolgo ‘for a long time before smth.’ and nezadolgo ‘for a long time before smth.’ are described from this point of view.
Vilinbakhova E. L.
Chto budet, to (i) budet: on one pattern of tautologies in russian
This paper contributes to the debate on the analysis of linguistic tautologies—structures that state an unquestionable truth by virtue of their logical form and therefore require a reinterpretation to be informative. While there is a great number of studies of nominal tautologies of the form ‘Х is X’, clausal tautologies, i.e. conditionals ‘if P, P’, disjunctives ‘either P or not P’, free relatives ‘P, what P’, etc., are given less attention. This paper investigates one of such patterns, namely, correlative tautologies, where the subordinate clause precedes the main clause, that could be exemplified by the expression chto budet to (i) budet lit. ‘what will be that (EMPH) will be’. The data taken from the Russian National Corpus and Internet as well as dictionary definitions show that tautologies of this kind exhibit various peculiar properties. First, some correlative tautologies can receive opposite interpretations in different contexts, i.e. chto bylo, to bylo lit.’what has been that has been’ can mean both ‘this fact cannot be denied’ [Bylugina, Shmelev 1997] or ‘the past should be forgotten for the sake of the future’ [Active Dictionary of Russian]. Next, the particle i, which is commonly used in Russian correlatives, cf. [Mitrenina 2010], is acceptable for some tautologies but not licensed in others. I argue that for correlative tautologies the crucial ingredient is salience of the situation in question as presented by the speaker that, along with specific vs. generic readings available, results in four possible strategies of their interpretation.
Yanko T. E.
Imperatives, vocatives, and questions in coherent discourse: the prosodic markers of incompleteness in the russian spoken speech corpora
One of the means of designating the coherence in the spoken discourse is demonstrating that the current utterance of the discourse is not terminal. Every step of narrative consisting of the chain of statements can be marked as non-final. The prosodic cues for incompleteness applied to the speech act of a statement have been studied in details in linguistic literature. In this paper, the discourse incompleteness is analyzed as composed not only with statements but with questions, imperatives, and vocatives as well. The results of the investigation are as follows. The wh-questions, imperatives, and vocatives can be freely composed with the meaning of discourse continuity, and they have specific prosodic cues for marking this combination of meanings. Whereas the yes-no-questions do not accept the prosodic incompleteness marking. The prosodic patterns of incompleteness and the accent placement in questions, vocatives, and imperatives are exemplified here by the dialogues taken from the Multimodal corpus of the Russian National corpus, the Prosodically Annotated Corpus of Spoken Russian (, and the minor working collection of the Russian speech recordings specifically set up for this investigation. The software program Praat was used in the process of analyzing the sounding data.
Zalizniak A. A., Denisova G. V., Mikaelian I. L.
Russian kak-nibud’ through the prism of parallel corpora
The paper proposes a semantic analysis of the Russian indefinite adverb kak-nibud’ based on the data collected from the French-Russian, ItalianRussian, and English-Russian parallel subcorpora of the Russian National Corpus, as well as from the Data Base of the Russian Discourse Markers and their French equivalents. The study applies the “unidirectional method” of contrastive analysis within which the translation by a professional translator is viewed as a quasi-lexicographic explication of a given unit revealing implicit components of its semantics. Our analysis demonstrates that kak-nibud’ is a highly language-specific Russian word. It reflects in a high percentage of null equivalents of this unit in the three languages under investigation, for both Russian taken as the source or target language. The study has also allowed us to show that the analyzed adverb can function as a marker of non-controllability of a hypothetic event similar to the function of the subjunctive mood in Romance languages. On the other hand, the use of kak-nibud’ (‘anyhow’, ‘poorly’) in a purely evaluative meaning cited by monolingual and bilingual dictionaries has shrunk in contemporary Russian compared to the Russian of the 19th century.
Zimmerling A. V.
Two dialects of russian grammar: corpus data and formal models
This paper is addressed the problem of parametric variation in Russian grammar, with focus on copular constructions with agreeing and nonagreeing adjectival predicates. Basing on Russian National Corpus, I reconstruct two dialects of Russian morphosyntax. They differ regarding the assignment of the predicative instrumental case, raising conditions and the distribution of agreeing vs non-agreeing predicates after быть 'be', стать 'become' and казаться 'seem'. Russian-A only licenses predicative instrumental on adjectives after SEEM (казалось странным, что P) and non-agreeing predicatives after non-zero forms of BE or BECOME (было странно, что P). Russian-B allows non-agreeing forms after SEEM (казалось странно, что P) and forms of the predicative instrumental case after non-zero forms of BE and BECOME (было странным, что P). I argue that the differences between Russian-A and Russian-B must explained in terms of parametric settings and claim that Russian predicatives lack forms of the predicative instrumental. The assignment of the predicative instrumental to adjectival heads can be explained as subject control in all dialects, but only Russian-B allows raising of sententional arguments to the position of the matrix subject.
Zinina A. A., Arinkin N. A., Zaydelman L. Ya., Kotov A. A.
Development of communicative behavior model for f-2 robot basing on “rec” multimodal corpora
The article describes the developed architecture for modeling natural communicative behavior on the F-2 robot. The important part of our work is the study of human communicative behavior and the transfer of this behavior to the robot. For this purpose we are developing the Russian Emotional Corpus (REC) where video recordings of natural emotional dialogues are collected. We explore the features of natural communication, and also develop an architecture that takes into account these features. For example, using the architecture presented in the article a robot can express any communicative function, using one or more executive organs: for example, to express an appeal with facial expressions, head movements or gestures. The developed architecture also allows us to flexibly combine gestures with different communicative functions. The architecture allows us to use “split”, “join” and “single” modes to combine tags from different BML-packages, and also to synchronize tags in a single BML-package. These features are important for modeling of human-like behavior for the robot F-2, and are necessary to improve the communication between a robot and a user.