Proceedings 2015

Adaskina Yu. V., Panicheva P. V., Popov A. M.
Syntax-based Sentiment Analysis of Tweets in Russian
The paper describes our approach to the task of sentiment analysis of tweets within SentiRuEval—an open evaluation of sentiment analysis systems for the Russian language. We took part in the task of object-oriented sentiment analysis of Russian tweets concerning two types of organizations: banks and telecommunications companies. On both datasets, the participants were required to perform a three-way classification of tweets: positive, negative or neutral. We used various statistical methods as basis for our machine learning algorithms and checked which features would provide the best results. Syntactic relations proved to be a crucial feature to any statistical method evaluated, but SVM-based classification performed better than the others. Normalized words are another important feature for the algorithm. The evaluation revealed that our method proved to be rather successful: we scored the first in three out of four evaluation measures.
Apresjan V. Ju.
Correlation between Semantic and Communicative Properties of Words
The objective of this paper is to determine what semantic components in the meaning of a word facilitate its lexicalization as prosodically marked and aid its focalization in an utterance. The paper demonstrates that prosodic and communicative properties of a word correlate with its semantic properties. In particular, a case study of different senses of the words tol’ko ‘only’, pravda ‘true’, eshche ‘still, more’, voobshche ‘in principle, generally’, po krajnej mere ‘at least’ and some others reveals that focalization and prosodic marking in a word are triggered by the semantics of contrast, high degree, and addition. On the other hand, semantics of concession in the meaning of a word limits its ability for accentual marking and focalization. The observed correlations between semantics/on the one hand, and prosody and communicative properties, on the other, are confirmed by the multimedia corpus data.
Arefyev N. V., Panchenko A. I., Lukanin A. V., Lesota O. O., Romanov P. V.
Evaluating Three Corpus‑based Semantic Similarity Systems for Russian
This paper reports results of our participation in the first shared task on Russian Semantic Similarity Evaluation (RUSSE). We compare three corpus-based systems that measure semantic similarity between words. The first one uses lexico-syntactic patterns to retrieve sentences indicating a particular semantic relation between words. The second one builds traditional context window approach on the top of Google N-Grams data to take advantage of the huge corpora it was collected on. The third system uses word2vec trained on a huge book collection. word2vec is one of the state-of-the-art methods for English. Our initial experiments showed that it yields the best results for Russian as well, comparing to other two systems considered in this paper. Therefore, we focus on study of word2vec meta-parameters and investigate how the training corpus affects quality of produced word vectors. Finally, we propose a simple but useful technique for dealing with out-of-vocabulary words.
Baranov A. N.
Justice versus Injustice: Metaphorical Interpretations in Modern Russian Discourse (through Textcorpus of Print Media)
In the paper words spravedlivost’ (justice) and nespravedlivost’ (injustice) in Russian and their corresponding concepts are considered. It is shown that formally words spravedlivost’ and nespravedlivost’ are antonyms, because morphologically they differ only in morpheme ne- (“no”). But their meanings differ in a more complicated way. Word spravedlivost’ has an abstract meaning, it denotes a value category. At the meantime extensional set of the word nespravedlivost’ is another one: it is used for denoting of wide range of situations where features of justice as a value concept are violated. For this reason words spravedlivost’ de facto is singularia tantum: it has not plural. At the same time the word nespravedlivost’ (injustice) has in Russian speech bide forms: singular, as well as plural. Differences in semantics between two words under consideration become apparent in metaphorical models which are used by speakers in interpretation of justice an injustice in Russian public discourse, model of which is text corpus of print media.
Aleksandrs Berdičevskis, Hanne Eckhoff
Automatic Identification of Shared Arguments in Verbal Coordinations
We describe automatic conversion of the SynTagRus dependency treebank of Russian to the PROIEL format (with the ultimate purpose of obtaining a single- format diachronic treebank spanning more than a thousand years), focusing on analysis of shared arguments in verbal coordinations. Whether arguments are shared or private is not marked in the SynTagRus native format, but the PROIEL format indicates sharing by means of secondary dependencies. In order to recover missing information and insert secondary dependencies into the converted SynTagRus, we create a simple guessing algorithm based on four probabilistic features: how likely a given argument type is to be shared; how likely an argument in a given position is to be shared; how likely a given verb is to have a given argument; how likely a given verb is to have a given argument frame. Boosted with a few deterministic rules and trained on a small manually annotated sample (346 sentences), the guesser very successfully inserts shared subjects (F-score 0.97), which results in excellent overall performance (F-score 0.92). Non-subject arguments are shared much more rarely, and for them the results are poorer (0.31 for objects; 0.22 for obliques). We show, however, that there are strong reasons to believe that performance can be increased if a larger training sample is used and the guesser gets to see enough positive examples. Apart from describing a useful practical solution, the paper also provides quantitative data about and offers non-trivial insights into Russian verbal coordination.
Bergelson M. B., Akinina Yu. S., Dragoy O. V., Iskra E. V., Khudyakova M. V.
Markers of Word Production Difficulties in Normal and Clinical Discourse Production: Continuity of Norm in Language and Discourse
Aphasia is language impairment due to brain damage. Word-finding and word-retrieval problems can be very prominent in the speech of people with aphasia, being detectable in almost every aphasic speaker. On the other hand, word-finding difficulties and speech errors can sometimes occur in speech of neurologically healthy people. It is assumed that the same psycholinguistic levels of word-retrieval breakdown can account for the mistakes of both groups. In the meanwhile, retrieving of a single word from mental lexicon is not the only possible level of hindrance for a speaker: referential and lexical choices that take place at more general discourse and pragmatic level can also be disturbed. The Russian CLiPS—Russian CLinical Pear Stories—is a corpus of filmelicited narratives retrieved following (Chafe, 1980) methodology from healthy and language-impaired cohorts. The aim of our research was to investigate the characteristics of formal markers of word retrieval difficulties in narratives of neurologically healthy people and people with aphasia. Three types of markers were considered (discourse markers, false starts and self-corrections) in the nominations of common referents of Pear stories narratives. The markers at different breakdown levels are qualitatively analysed, creating a platform for future analysis.
Blinov P. D., Kotelnikov E. V.
Semantic Similarity for Aspect‑Based Sentiment Analysis
The paper investigates the problem of automatic aspect-based sentiment analysis. Such version is harder to do than general sentiment analysis, but it significantly pushes forward the limits of unstructured text analysis methods. In the beginning previous approaches and works are reviewed. That part also gives data description for train and test collections. In the second part of the article the methods for main subtasks of aspectbased sentiment analysis are described. The method for explicit aspect term extraction relies on the vector space of distributed representations of words. The term polarity detection method is based on use of pointwise mutual information and semantic similarity measure. Results from SentiRuEval workshop for automobiles and restaurants domains are given. Proposed methods achieved good results in several key subtasks. In aspect term polarity detection task and sentiment analysis of whole review on aspect categories methods showed the best result for both domains. In the aspect term categorization task our method was placed at the second position. And for explicit aspect term extraction the first result obtained for the restaurant domain according to partial match evaluation criteria.
Bogdanov A. V., Gorbunova I. M.
The Case of Russian Subject Pro in Machine Translation System
This paper concerns a problem of Russian floating quantifiers (also known as semipredicatives) in machine translation. Floating quantifiers in Russian (such as оба ‘both’, один ‘alone’, сам ‘on one’s own’ etc) are inclined for case, number and gender and agree in those categories with the subject of the minimal (finite) clause containing them. However, the case of a floating quantifier in an infinitive clause varies according to the type of PRO control applied and some other structural characteristics of the infinitive clause. This poses a problem for rule-based machine translation, to choose the correct case for the quantifier at synthesis, or to link it correctly to its antecedent at analysis. A model-based machine translation system, such as ABBY Y Compreno, can handle the case choice problem, as this paper is to show.
Igor Boguslavsky, Vyacheslav Dikonov, Leonid Iomdin, Alexander Lazursky, Victor Sizov, Svetlana Timoshenko
Semantic Analysis and Question Answering: a System Under Development
The paper presents a system of semantic analysis and a question answering system implemented on its basis for a specific subject domain: (European) football match news. As input, the system obtains a natural language question (in Russian), which it answers with an element (or elements) from the repository of individuals. The core part of the system is the semantic analyzer of natural language texts. For each sentence of the text processed, the special semantic analysis component of ETAP-3 linguistic processor constructs a semantic structure, which consists of a set of triples of the type semantic_relation (individual,individual). Semantic relations and individuals constituting this structure correspond to the elements of the ontology, which can thus be viewed as a functional analogue of a dictionary for the semantic language. Semantic structures of sentences belonging to a particular text are integrated thanks to coreference and anaphora resolution and converted into an OWL-document, which is later used as a database. This database is supplemented by background knowledge from the repository of individuals concerning specific teams, football players, and games. Thanks to this resource, we are able to find an answer to the question using not only the data contained in different sentences of the text but also in the repository of individuals. If the user asks “What team defeated the champion of Spain?” while we have a text reporting that “Slutsky’s players outplayed Atletico Madrid” then the system will establish the correspondence with the question, the text, and the records in the depository of individuals, and will come with the correct answer “CSKA”. The semantic structure obtained from the natural language question is converted into a SPARQL query addressed to the database. Currently, all parts of the system are operating in the test mode.
Anastasia Bonch-Osmolovskaya
Quantitative Methods in Diachronic Linguistic Studies: the Case of Russian Dative Subjects with Predicatives
The paper aims to demonstrate how quantitative corpus methods used in linguistics research may help to range different realizations of the same phenomenon: the use of dative subjects in predicative and adjective constructions. The core idea of the research is to study the distribution of dative subject constructions with predicative and adjective forms that potentially can be used in such constructions, i.e. aptitude of the construction for explication or omitting the dative subject. While usually the predicates are classified on the basis whether they can potentially be used with dative subject, I study the trends for explicit use of dative (or prepositional beneficiary arguments) among the “dative subject predicates”, and show that the frequency rates of real use of dative subjects can be very different with different predicates. Separate analysis of different morphological forms of the same dative subject lexeme (i.e. adjectives in full and short forms, comparative adjectives and predicatives) shows that they may also exhibit different strategies with explicit dative subjects. Finally data from the 18th and the 21st centuries is compared and hierarchical clustering is used to reveal some diachronic trends.
Daniel M. A.
Stem Initial Alternation in Russian Third Person Pronouns: Variation in Grammar
The paper discusses the present stage of the evolution of the initial [n]/[j] stem alternation in Russian third person pronouns. After providing a short overview of the origins of the forms, I focus on their category status, discuss Zalizniak’s ‘adpositionality’ in some detail, and then proceed to considering the cases where the ‘n’-forms are induced by a distant ‘controller’. I will show that the fact that the ‘n’-forms are essentially variants is better accounted for by the notion of ‘trigger’ of a morphological variant. To my eyes, this opens ways to a better understanding of the observed evidence than that using the conventional notion of morphosyntactic controller, on the one hand—and certainly than explaining them in (morpho)phonological terms. In the end, I will briefly argue that, in a sense, the evolution of the alternation is similar to degrammaticalization, showing a movement from a morphophonologically conditioned external sandhi to a morphosyntactic category similar to government.
Dmitrij Dobrovol’skij, Irina Levontina
Modal Particles and the Actualization of Forgotten Details (Based on the Materials of Parallel Corpora)
The use of parallel corpora carries with it special problems, particularly when it comes to units that are typical of oral speech. Nevertheless, it is the presence of good Russian-English and Russian-German parallel texts in the RNC that has made the present study possible. Our analysis also demonstrates the limitations inherent in investigations based on parallel corpora, especially with respect to discursive words. Only a combination of various research methods is capable of producing adequate results. In this study we analyze a group of discursive words whose semantics actualize things that have been forgotten. There are two types of situations here. In the first, the speaker reminds the addressee of some object or event; in the second, the speaker attempts to remember some detail or name connected with the events s/he is talking about. Certain discursive words apply to both types of contexts, whereas others are special to one of them. Among the Russian discursive words that express these ideas are units such as biš’, tam, ešče, pomnite, ėtot and phrasemes like kak tam, kak ego, kak že, ėto samoe, etc. We also examine English and German equivalents used to translate these words. Russian has a rich repertory of discursive resources for actualizing forgotten details. In English, if the corresponding meanings are expressed at all, it tends to be done either syntactically or by means of explicit utterances. The German arsenal of discursive resources is no less extensive than the Russian, but there are no one-to-one correspondences between the Russian and German discursive words. They have different semantic configurations, and although the meaning components are often quite similar, they combine differently, so that translations of such particles in various contexts are rather diverse.
Dobrushina N. R.
Subjunctive Particle as a Part of Conjunction
Russian subjunctive is expressed by an analytical form which consists of subjunctive particle by (b) and past indicative or infinitive or a few predicative adverbs and adjectives. The subjunctive particle is an enclitic. It often merges with subordinate conjunctions, which yields words functioning as conjunctions and containing the subjunctive particle. Historically, the particle by in conjunctions can be traced back to the marker of subjunctive. Synchronically, however, the group is not homogenous. The aim of the paper is to find out which of the conjunctions with by should be considered as containing the marker of subjunctive, and test whether the particle can or can not be separated from the conjunction. Four criteria are used. The first and the second, namely, (a) the forms available in the subordinate clause with the conjunction and (b) the possibility of repetition of the particle by with the second predicate shows that comparative conjunctions do not synchronically contain the subjunctive marker. The third and fourth criteria, namely (c) the omission of the particle by and (d) its ability to be separated from the conjunction by another words give different results.
Fedorova O. V.
Referent Introduction in Russian Spoken Narratives
In a series of papers published twenty years ago on analysis of Russian, German and Shan tales, we examined the typology of referent introduction in written texts. The general purpose of the current study was to evaluate how the “tale” model of introduction is applicable to the spoken narratives; as an alternative approach we considered the Chafe’s model, based on the English “Pear stories” (Chafe 1980). Twenty five Russian participants took part in the experiment; all the participants described the same experimental film about some child stealing pears; thus we analyzed 25 narratives and 125 introductive sentences. Surprisingly, our model differs from both the “tale” introductive model and the Chafe’s model for each of the following points: (1) type of the common ground, (2) speech disfluencies, (3) the character status and clauses number, (4) the “light subject” constraint, (5) the “one new idea” constraint. However, all of these results need further empirical justification in new studies on Russian materials.
Galitsky B. A.
Document vs. Meta‑Document: are Their Rhetoric Structures Different?
The problem of classifying text with respect to belonging to a document or a meta-document (metalanguage and language object patterns) is formulated and its application areas are proposed. An algorithm is proposed for document classification tasks where counts of words is insufficient do differentiate between such abstract classes of text as metalanguage and object-level. We extend the parse tree kernel method from the level of individual sentences towards the level of paragraphs, based on anaphora, rhetoric structure relations and communicative actions linking phrases in different sentences. Tree kernel learning is then applied to these extended trees to leverage of additional discourse-related information. We evaluate our approach in the domain of action-plan documents, as well as in literature domain, recognizing some portions of text in Kafka’s novel “The Trial” as metalanguage patterns and differentiating them from the novel’s description in the studies of Kafka by others.
Galyashina E. I.
Linguistic Analysis in the Speaker Identification Systems: Integrated Complex Examination Approach Based on Forensic Science Technology
The article proposes the concept of the integrated expert techniques for speaker identification on the domain of complex acoustic-phonetic and linguistic methods of oral speech analysis, defining general and special forensic expert competences. The result of forensic speaker identification used as evidence must exhibit a high level of reliability. The author examines the key concepts and terms of the procedure of individual-specific speaker identification in the aspect of modern expertology (forensic science). The paper states the need to take into account that the role of professional linguistic competences increases in conditions when digitized speech signals are compared, algorithm of coding is indefinite and falsification of utterances is not excluded. To solve this problem the author proposes a multistage approach consisting of a parallel application of instrument and technical methods together with aural-perceptual, waveform and sonogram investigation and sophisticated linguistic analysis. The main attention is paid to the linguistic component of the complex integrated approach based on the phonetic and semantic analyses. It is stated that individualized speech unit is formed by a system of miscellaneous formal and semantic relations of structural speech components in linguistic contents. The proposed method of integration of the multilevel speech modules was implemented in forensic linguistic methodology of speaker identification technique. This made it possible to considerably increase the reliability of the expert’s decision and provided an opportunity to use it as a component of the multistage system for speech utterances authentication.
Goncharova M. B., Kozlova E. A., Pasyukov A. V., Garashchuk R. V., Selegey V. P.
Model-Based WSA as Means of New Language Integration into a Multilingual Lexical-Semantic Database with Interlingua
This paper presents a model-based approach to Word Sense Alignment (WSA) applied for new language integration within ABBYY Compreno lexical-semantic database with interlingua. Using the model, i.e. semantic and syntactic compatibility, we perform semantic-syntactic analysis with language-independent structure as a result. With the comprehensive description of core languages at our disposal, we analyze parallel resources, namely, the part of a bilingual dictionary and of a parallel corpus in a source language, and obtain a set of candidate concepts for meanings of a target language. In this way, we accomplish WSA between the dictionary meanings and the concepts of interlingua. Once the correspondences between the meaning and the concepts of the hierarchy are established, these new meanings can be incorporated into the lexical-semantic database. The integration is fulfilled semi-automatically, i.e. at the final stage the correspondences are to be approved by a linguist; however, the amount of manual work is reduced to minimum.
Grishina E. A.
Quantifiers, Gesticulation, and Viewpoint
The study analyzes gestures, which regularly accompany Russian universal quantifiers ves’ ‘the whole of’, vse ‘all’, kazhdyy ‘every’, l’uboy ‘any’. The results of the study shows that the accompanying gesticulation correlates more with the pragmatic features of the quantifiers (the spatial and evidential speaker’s position and the speaker’s modus operandi), than with the logical components of the quantifier’s semantic structure. The Multimodal Russian corpus (MURCO) has been used as a source of the data.
Grozin V. A., Dobrenko N. V., Gusarova N. F., Ning Tao
The Application of Machine Learning Methods for Analysis of Text Forums for Creating Learning Objects
Nowadays the concept of a learning object (LO) is widely used in preparation of educational materials. Usually, LOs are parts or fragments of previously created educational content, which is very informative and pedagogically focused. However, concerning high-dynamic branches of science and technologies LOs tend to become outdated and trivial thus losing their educative value. In this situation, specialized text forums become a valuable source of knowledge. Forums contain experience of people who actually used the technology and its features. They contain both positive and negative experience—something that is not available from official documentation at all. However, they also contain many trivial, repeated and still irrelevant posts. Also, an expert needs to extract useful messages from text forums according to his individual learning objectives. The paper deals with the task of automatically identifying texts potentially useful for preparation of textual educational materials within text forums. For our experiments, we have selected highly inflective languages with complex grammar and rather weak text analysis tools: French, German, Russian and Chinese (Mandarin). We have overviewed non-semantic text and social features of a text forum which indicate the suitability for creation of a textual LO. We have analyzed those features. For this purpose, we have constructed linear and non-linear models of machine learning and conducted feature selection. Even for the forums providing little information about chosen topics and forums with a lot of off-topic text in dataset, these models were better than the baseline selection methods.
Iomdin B. L.
Nuts: What are They?
When describing words which denote real life objects, dictionaries tend to use scientific terms and classifications, even when dealing with natural language. This approach may lead to misunderstanding, especially in cases when scientific classification (e. g. in biology) differs from what is found in natural language data. One of such cases is discussed here, namely the small but rather interesting class of nuts (Russian orexi). In the botanic world view nuts usually include hazelnuts and chestnuts, but do not include almonds (which are considered stone fruits), pine nuts (seeds), peanuts (legumes), pistachio (kernels), etc. The Russian orex, English nut, Latin nux exhibit similar behaviour here. Explanatory dictionaries of Russian more or less follow the botanical definitions, even though in many fields (such as cooking, food industry, medicine, etc.) nuts are classified differently. In order to establish the boundaries of nuts in Russian, more than 1,000 native speakers were questioned and multiple texts of different periods were studied. The result is a peculiar class which could not be identified with any of the natural language supercategories described by Anna Wierzbicka. A new lexicographic description is proposed for some words included into this class.
Ivanov V. V., Tutubalina E. V., Mingazov N. R., Alimova I. S.
Extracting Aspects, Sentiment and Categories of Aspects in User Reviews about Restaurants and Cars
This paper describes a method for solving aspect-based sentiment analysis tasks in restaurant and car reviews subject domains. These tasks were articulated in the Sentiment Evaluation for Russian (SentiRuEval-2015) initiative. During the SentiRuEval-2015 we focused on three subtasks: extracting explicit aspect terms from user reviews (tasks A), aspect-based sentiment classification (task C) as well as automatic categorization of aspects (task D). In aspect-based sentiment classification (tasks C and D) we propose two supervised methods based on a Maximum Entropy model and Support Vector Machines (SVM), respectively, that use a set of term frequency features in a context of the aspect term and lexicon-based features. We achieved 40% of macro-averaged F-measure for cars and 40,05% for reviews about restaurants in task С. We achieved 65.2% of macro-averaged F-measure for cars and 86.5% for reviews about restaurants in task D. This method ranked first among 4 teams in both subject domains. The SVM classifier is based on unigram features and pointwise mutual information to calculate category-specific score and associate each aspect with a proper category in a subject domain. In task A we carefully evaluated performance of a method based on syntactic and statistical features incorporated in a Conditional Random Fields model. Unfortunately, the method did not show any significant improvement over a baseline. However, its results are also presented in the paper.
Kibrik A. A.
The Problem of Non-Discreteness and Spoken Discourse Structure
Language consists of units of various hierarchical levels, but the boundaries between the units are not always crisp, and non-discrete effect are observed. That applies not only to syntagmatic structure, but also to paradigmatics, diachrony, and even whole languages. Non-discreteness is a common property of language and cognition. In contrast to conventional discrete and continuous structures, I propose another kind of structure that can be called focal. Focal phenomena are simultaneously distinct and related. It is necessary to recognize focal structure as one of the major types of structures typical of natural language. Non-discrete effects can be observed at the level of discourse. Spoken discourse consists of elementary discourse units (EDUs), identifiable with the help of a set of behavioral criteria. Along with prototypical clausal EDUs, there are deviant EDUs of various kinds. Parcellated elaborations constitute an example of a paradigmatic outlier among the EDUs. Non-discrete boundaries between EDUs are an illustration of syntagmatic difficulties in EDU identification. Phonemes, EDUs, and other units are not as crisp and clean as our digital mind would want them to be. In order to address linguistic reality in its actual complexity, we have to recognize that segmentation follows the principles of focal structure, which is the general property of language and cognition.
Kipyatkova I. S., Karpov A. A.
Development of Factored Language Models for Automatic Russian Speech Recognition
In this paper, we present a study of factored language models (FLM) of Russian for rescoring N-best lists in automatic speech recognition (ASR) systems. We used 3-gram language models as baseline. Both 3-gram and factored language models were trained on a text corpus collected from recent Internet online newspapers; total size of the text corpus is about 350 million words (2.4 Gb data). For FLM creation, we used five linguistic factors: wordform, word lemma, stem, part-of-speech, and morphological tag. We studied several FLMs with two factors (word-form plus one of the other factors) using 2 fixed backoff paths: (1) the first drop was of the most distant word and factor, then—of the less distant ones; (2) the first drop was of the words in time-distance order, then drop of the factors in the same order. We investigated the influence of a factor set and backoff paths on language model perplexity and word error rate (WER). Also we created FLMs with some parallel generalized backoff paths. Optimization of the FLM parameters was carried out by means of the genetic algorithm. The FLMs were embedded in the automatic Russian speech recognition system with a very large vocabulary. Experimental results on continuous Russian speech recognition task showed a relative WER reduction of 8% when the FLM was interpolated with the baseline 3-gram model.
Yuri Kiselev, Andrew Krizhanovsky, Pavel Braslavski, Ilya Menshikov, Mikhail Mukhin, Nataly Krizhanovskaya
Russian Lexicographic Landscape: a Tale of 12 Dictionaries
The paper reports on quantitative analysis of 12 Russian dictionaries at three levels: 1) headwords: the size and overlap of word lists, coverage of large corpora, and presence of neologisms; 2) synonyms: overlap of synsets in different dictionaries; 3) definitions: distribution of definition lengths and numbers of senses, as well as textual similarity of same-headword definitions in different dictionaries. The total amount of data in the study is 805,900 dictionary entries, 892,900 definitions, and 84,500 synsets. The study reveals multiple connections and mutual influences between dictionaries, uncovers differences in modern electronic vs. traditional printed resources, as well as suggests directions for development of new and improvement of existing lexical semantic resources.
Kisseleva X. L., Tatevosov S. G.
Notes on the Structure of Circumfixal Verbs in Russian
The paper argues for an analysis that reduces the derivation of non-compositional circumfixal verbs to a fully compositional combination of two pieces of morphology independently attested in Russian, a (resultative) prefix, and the reflexive morpheme -sja. Circumfical verbs are analyzed as involving the following steps of derivation. First, the activity event structure projected by a non-derived stem is augmented by the change-of-state component that turns it into an accomplishment even structure. Secondly, prefixation occurs that introduces the maximal degree of change along a relevant scale creating a transitive verb. Such a verb, however, is ill-formed since the changeof- state component has not been licensed via lexical insertion. To rescue the derivation, reflexivization is invoked, and the change-of-state subevent gets licensed through identification of its participant with the clausal subject. A wider theoretical implication of the analysis is that circumfixation, as a primitive type of affixation, is superfluous and is to be abandoned.
Klyachko E.
Using Folksonomy Data for Determining Semantic Similarity
This paper presents a method for measuring semantic similarity. Semantic similarity measures are important for various semantics-oriented natural language processing tasks, such as Textual Entailment or Word Sense Disambiguation. In the paper, a folksonomy graph is used to determine the relatedness of two words. The construction of a folksonomy from a collaborative photo tagging resource is described. The problems which occur during the process are analyzed and solutions are proposed. The structure of the folksonomy is also analyzed. It turns out to be a social network graph. Graph features, such as the path length, or the Jaccard similarity coefficient, are the input parameters for a machine learning classifying algorithm. The comparative importance of the parameters is evaluated. Finally, the method was evaluated in the RUSSE evaluation campaign. The results are lower than most results for distribution-based vector models. However, the model itself is cheaper to build. The failures of the models are analyzed and possible improvements are suggested.
Knyazev S. V.
Vowel Reduction as an Indicator of its Stress in Standard Modern Russian
The phonetic unity of phonological word in Standard Modern Russian is governed to a large extend by phonological rules of vowel realization, which forbid the reduced vowel in a position of first pretonic syllable (with the exception of some clitical syntactic words — prepositions and particles). The paper deals with the instrumental study of phonetic markers of stress in compound (predominantly loan) disyllabic words in Standard Modern Russian (e. g. stop-krán ‘emergency brake’) as compared with non-compound native words of the same phonological structure (stoptát’ ‘tread down’). The paper states that compound disyllabic words under a phrase accent have both syllables stressed, stress being signalized by mid [o] // [e] vowels impossible in unstressed position in Standard Modern Russian. In non-compound native words unstressed [o] vowel in all types of phrase positions after “hard” consonants is displaced by [a] (unstressed [a] being about 10 percent longer than stressed [o]). Our data shows that in compound disyllabic words in a position with no tonal accent phonetically unstressed /o/ is realized by reduced [ə] (not standard [a]) vowel (being somewhat twice shorter than unstressed [a]). Thus, non-trivial [o] vowel reduction in compounds may serve as a phonetic cue of phonological stress which is shown up fully only under the tonal accent. Phonetically the units in question should be treated as a combination of two phonological words and phonetic data may be used as a ground for orthographic adaptation of loan word.
Korotaev N. A.
Elementary Discourse Units in Spoken Monologues: Evidence from Communicative Prosody
The paper addresses the issue of spoken discourse segmentation. Using the corpus “Stories of presents and skiing”, I explore the concept of Elementary Discourse Unit (EDU) — a chunk of speech flow defined on both prosodic and syntactic grounds. I propose for a new procedure to establish EDUs’ boundaries. Compared to previous studies, a communicative perspective is added. I introduce the notion of communicative prosodic constituent, as well as a typology of those. It is based on three oppositions: (i) topic vs. comment; (ii) completion vs. transitional-continuity; (iii) main line vs. parenthesis. These oppositions are defined independently of one another and provide for a six-fold classification. Several remarks should be made here. First, comments and (optionally) topics are found not only in statements, but also in other illocutionary types — such as questions, directives, vocatives, and so on. Second, it is sometimes hard to distinguish between comment constituents that express transitional continuity properties and topic constituents. I show that in some cases, this distinction can be made even though the intonation patterns are quite similar. Third, parenthetic constituents may as well have internal topics and comments. Next, EDUs’ boundaries are re-defined as a subset of communicative prosodic constituents’ boundaries. Comment constituents always imply EDU’s boundaries, while a topic constituent needs a syntactical support to do so. Finally, I provide an analysis of communicative structure and EDUs boundaries in an excerpt from the corpus.
Kotov A. A., Zinina A. A.
Functional Analysis of Non‑Verbal Communicative Behavior
In this study we represent functional annotation of the Russian Emotional Corpus (REC). The annotation is appended to the regular annotation of eyes, eyebrows and hand movements with supplementary annotation for head and corpus movements. The annotation records communicative functions, where a movement is intended for a particular goal or can be understood as connected to a particular goal/stimulus by the addressee. We show that a particular function can be expressed by different patterns, utilizing facial expression and/or hand/body movements. Functional annotation is also used as a non-terminal symbol in a generative grammar to produce nonverbal behavioral patterns.
Kreydlin G. E., Khesed L. A.
Human Body in an Oral Dialog: the Corporal Feature “Size of the Somatic Object”
The paper continues a series of works on multimodal oral communication, many of which were printed in the Proceedings of previous «Dialogs». The interplay of verbal and nonverbal, mainly corporal, Russian sign codes in everyday communication is explored within the framework of the featured approach. The latter is based on the concept and instruments of the semiotic conceptualization of the human body, i.e. the naive map of how Russians think and talk about the body and its parts, organs, corporal liquids, covers, etc. and how Russians use somatic objects in various types of gestures, postures, sign movements, and other body meaningful units. The core of the semiotic conceptualization holds several sets such as those of somatic objects and their natural language names, the sets of corporal features, their values and names, etc. In this paper, we focus primarily on the series of features named «the size of the somatic object» and provide some results of their language and nonverbal semiotic analyses. Two basic kinds of the features discussed that we call an absolute size and a relative size are distinguished, and the meaning and usage of many Russian expressions which reflect absolute and relative sized are described. Also, some correlations between some verbal and nonverbal Russian sign units of size are singled out.
Krivnova O. F.
The Depth of Prosodic Breaks in Spoken Text (Experimental Data)
This paper deals with the problem of prosodic phrasing in a spoken text. The introductory section provides a brief description of the background, clarifies basic terms and explains the concept of prosodic break and word boundary strength. The second section contains a short analysis of the current state of research in this area of phrasal prosody, highlights the main directions of the modern fundamental studies and applications, notes their relevance and the need to expand their empirical base. The third section deals with issues related to the local markers of prosodic phrasing, their hierarchy and phonetic means of realization. Here are given the examples of prosodic labeling of poetic and prose texts in the original transcription of famous Russian linguists Scherba and Avanesov with equivalent transcripts using quantitative, graduated scale of prosodic indexes similar to the labeling scheme adopted in foreign prosodic studies. Particular attention is paid to discussion of A.Sanderman’s study, which is the most thorough contemporary analysis of prosodic phrasing. The fourth section describes the aim, material, technique and results of of perceptual and instrumental analysis of the location and depth of prosodic breaks carried out by the author of this paper on the Russian material. It is shown that native speakers quite consistently determine the location and depth of prosodic breaks using a 5-point rating scale, but breaks with minimum indexes are clearly opposed to the other types on the probability of their perceptual detection. Correlation of perceptual breaks’ evaluation with pause duration at word boundaries is also investigated. In conclusion the material, methods and results of the experimental studies discussed in this paper are compared, the current trends in the use of the data are highlighted, the prospects and challenges for further studies of prosodic phrasing in speech are outlined.
Krylova T. V.
Particles ‘VОТ’ and ‘VОN’: the Mechanisms of Secondary Meanings Formation on the Basis of Deictic Values
This article is devoted to consideration of the particles VОТ and VОN. It is another attempt to bring secondary meanings of VОТ and VОN from their deictic meanings. After analyzing derived meaning of these particles we found no parallelism in the structure of their polysemy. We suggested that this is due to differences in the localization of the object in their index meaning (‘proximity to the speaker’ VS. ‘remoteness from the speaker’). Next, we made an attempt to trace how these components are transformed in a secondary meanings of these words. We found that the component ‘proximity to the speaker’, which is included in the sense of VОN, is transformed into component ‘proximity to the moment of speech’. The last one passes in its transformation following stages (each the next is characterized by increasing of metaphoricalness): ‘proximity to the moment of speech’ → ‘temporal proximity of the events’ → ‘interdependency of events’.
Kudinov M., Piontkovskaya I.
Automatic Update of the Named Entities Database Based on the Users Queries
We describe an algorithm of update of the database of named entities providing support of voice commands on a device. The update is made automatically with no human assistance by means of analysis of query logs of the dialogue system. The logs consisted of responses of the automatic speech recognition engine and thus contained erroneous recognitions. The search of such mistakes is also made as a part of our method. The problem of named entities extraction was solved by means of an algorithm based on entropy and mutual information statistics. The detection of recognition mistakes was made by means of a novel data-driven probabilistic approach taking into account grapheme substitution statistics in the data. Assuming grapheme alignment hidden, we use the EM algorithm for training the model. As a result we obtain a statistical model capable for sequence similarity assessment. The algorithm based on our similarity score performs better in terms of F1-measure than one using the classical Levenshtein distance.
Kustova G. I.
Abstract Lexemes’ Valencies: Reduction vs. Specification
Abstract vocabulary of different semantic classes (interpretation: nepriyatnost’ (‘nuisance’, ‘trouble’, ‘annoyance‘), promakh (‘a piece of carelessness’), razlad (‘discord’); event: referendum, soveshchanie (‘meeting’), turnir (‘tournament’); activities: mery (‘measures’) / rabota (‘work’) / usiliya (‘efforts’) [po uregulirovaniyu (‘on the settlement’)] et al. are considered by analogy with speech and mental lexemes. The latter lexemes have valency on content (it is usually expressed by the subordinate clause) and valency on topic (it is usually expressed by the prepositional phrase o X ‘about X’). Abstract lexemes valency on content / topic may also be expressed by the prepositional phrase po X [Dat.] ‘on X’: mekhanizm po privlecheniyu klientov ‘mechanism to attract customers’, negativnaya tendentsiya po ukhudsheniyu portfel’a (‘negative trend for the deterioration of the portfolio’), shagi po osushchestvleniyu mandata (‘steps to implement the mandate’). Valency on topic is both a reduction and a specification of the content of the situation.
Kuzmenko E. A., Mustakimova E. G.
Automatic Disambiguation in the Corpora of Modern Greek and Yiddish
The problem of morphological ambiguity is widely addressed in the modern NLP. Mostly ambiguity is resolved with the use of large manually-annotated corpora and machine learning. However, such methods are not always available, as good training data is not accessible for all languages. In this paper we present a method of disambiguation without gold standard corpora using several statistical models, namely, Brill algorithm (Brill 1995) and unambiguous n-grams from the automatically annotated corpus. All the methods were tested on the Corpus of Modern Greek and on the Corpus of Modern Yiddish. As a result, more than a half of words with ambiguous analyses were disambiguated in both corpora, demonstrating high precision (>80%). Our method of morphological disambiguation demonstrates that it is possible to eliminate some of the ambiguous analyses in the corpus without specific linguistic resources, only with the use of raw data, where all possible morphological analyses for every word are indicated.
Kutuzov A., Andreev I.
Texts in, Meaning out: Neural Language Models in Semantic Similarity Tasks for Russian
Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from 2nd to 5th position, depending on the task. We introduce the tools and corpora used, comment on the nature of the evaluation track and describe the achieved results. It was found out that Continuous Skip-gram and Continuous Bag-of-words models, previously successfully applied to English material, can be used for semantic modeling of Russian as well. Moreover, we show that texts in Russian National Corpus (RNC) provide an excellent training material for such models, outperforming other, much larger corpora. It is especially true for semantic relatedness tasks (although stacking models trained on larger corpora on top of RNC models improves performance even more). High-quality semantic vectors learned in such a way can be used in a variety of linguistic tasks and promise an exciting field for further study.
Lagutin M. B., Katinskaya A. Y., Selegey V. P., Sharoff S., Sorokin A. A.
Automatic Classification of Web Texts Using Functional Text Dimensions
The work addresses automatic genre classification of Web texts. We show that functional text dimensions could be used for this tasks, with their stable combinations (clusters) corresponding to genres. Basing on a gold standard corpus, we construct a list of such genres. We also show that functional dimensions values can be automatically extracted from language features. In the conclusion we discuss the application of our results for automatic annotation of large Web corpora.
Lobanov B. M.
An Experience of Creating Melodic Portraits of Complex Declarative Sentences of Russian
We proceed from the model of intonation patterns (IP) by Elena Bryzgunova, widely used in teaching Russian speech intonation. Bryzgunova distinguishes seven major Russian intonation patterns, named IP 1 to IP 7, of which only IP 1 is clearly used in declarative sentences to mark their completeness. The remaining six IPs are implemented for interrogative (IP2 — IP 4) or exclamatory (IP5 — IP 7) types of sentences. Obviously, declarative sentences are overwhelming in professional and literary texts, particularly in professionally voiced texts of various genres (audio books). Most of them are not simple sentences and often consist of a mixture of complex and compound sentences. The present study continues the author's paper “Universal melodic intonation portraits of Russian speech” presented to Dialogue 2014 conference, which introduced the concept of Universal Melodic Portrait (UMP). The present paper experimentally studies the intonation features of declarative sentences. It describes the results of auditory analysis and IP interpretation for declarative sentences of varying degrees of complexity, voiced by 3 speakers, and provides experimental representations of their intonation structures in the form of a sequence of Universal Melodic Portraits (UMP). The paper is organized as follows. Section 1 describes the experimental procedure, including the characteristics of selected text and audio material, the listening method and the method of constructing a sequence of UMP’s audio recordings. Section 2 presents the experimental results: the graphical representation of an experimental sequence of universal melodic portraits of analyzed audio recordings. Section 3 offers an interpretation of the results.
Lopukhin K. A., Lopukhina А. A., Nosyrev G. V.
The Impact of Different Vector Space Models and Supplementary Techniques on Russian Semantic Similarity Task
This paper presents a system for determining semantic similarity between words that was an entry for the Dialog 2015 Russian semantic similarity competition. The system introduced is primary based on word vector models, supplemented with various other methods, both corpus- and dictionary-based. In this paper we compare performance of two methods for building word vectors (word2vec and GloVe), evaluate how performance varies on different corpus sizes and preprocessing techniques, and measure accuracy gains from supplementary methods. We compare system performance on word relatedness and word association tasks, and it turns out that different methods have varying relative importance for these tasks.
Loukachevitch N. V., Blinov P. D., Kotelnikov E. V., Rubtsova Y. V., Ivanov V. V., Tutubalina E.
SentiRuEval: Testing Object‑oriented Sentiment Analysis Systems in Russian
The paper describes the data, rules and results of SentiRuEval, evaluation of Russian object-oriented sentiment analysis systems. Two tasks were proposed to participants. The first task was aspect-oriented analysis of reviews about restaurants and automobiles, that is the primary goal was to find word and expressions indicating important characteristics of an entity (aspect terms) and then classify them into polarity classes and aspect categories. The second task was the reputation-oriented analysis of tweets concerning banks and telecommunications companies. The goal of this analysis was to classify tweets in dependence of their influence on the reputation of the mentioned company. Such tweets could express the user’s opinion or a positive or negative fact about the organization.
Olga Lyashevskaya, Egor Kashkin
Inducing Verb Classes from Frames in Russian: Morpho‑Syntax and Semantic Roles
The paper presents clustering experiments on Russian verbs based on the statistical data drawn from the Russian FrameBank ( While lexicology has essentially abandoned the idea of syntactic transformations as the primary basis for grouping verbs into semantic classes (Apresjan 1967, Levin 1993), the hypothesis of the same lexical and syntactic distributional profiles underlying lexical clusters is still attractive. In computational linguistics, some attempts have been made to obtain verb classes for English, German and other languages using observable morpho-syntactic and lexical properties of context (Dorr and Jones 1996; Lapata 1999; Schulte im Walde 2006; Lenci 2014, among others). Our experiments on semantic classification of Russian verbs are based on two types of tags embedded in the annotation of argument constructions: a) semantic roles and b) morpho-syntactic patterns. The domain of speech verbs is classified automatically on vectors, and the resulting clusters are contrasted against Babenko (2007)’s semantic classes and three other manual classifications. The classes within the domain of possessive verbs are constructed using rule-based solutions and evaluated against Berkeley FrameNet verb clusters. We conclude that clustering on morpho-syntactic (pure formal) patterns loses the race to more intelligent approaches which take into account semantic roles.
Mayorov V., Andrianov I., Astrakhantsev N., Avanesov V., Kozlov I., Turdakov D.
A High Precision Method for Aspect Extraction in Russian
This paper presents a work carried out by ISPRAS on aspect extraction task at SentiRuEval 2015. Our team submitted one run for Task A and Task B and got best precision for both tasks for all domains among all participants. Our method also showed the best F1-measure for exact aspect term matching for task A for automobile domain and both for Task A and Task B for restaurant domain. The method is based on sequential classification of tokens with SVM. It uses local, global, syntactic-based, GloVe, topic modeling and automatic term recognition features. In this paper we also present evaluation of significance of different feature groups for the task.
Malafeev A. Yu.
Exercise Maker: Automatic Language Exercise Generation
Current trends in education, namely blended learning and computer-assisted language learning, underlie the growing interest to the task of automatically generating language exercises. Such automatic systems are especially in demand given the variability in language learning. Despite the abundance of resources for language learning, there is often a lack of specific exercises targeting a particular group of learners or ESP course. This paper gives an overview of a computer system called Exercise Maker that is aimed at flexible and versatile language exercise generation. The system supports seven exercise types, which can be generated from arbitrary passages written in English. Being able to tailor educational material to learners’ interests is known to boost motivation in learners (Heilman et al., 2010). An important feature of the system is the automatic ranking of the source passages according to their complexity/readability. As shown by expert evaluation, the automatically generated exercises are of high quality: the gap precision is about 97–98%, while the overall exercise acceptance rate varies from 90% to 97.5%. Exercise Maker is freely available for educational and research purposes.
Mustajoki A., Vepreva I. T.
Metalinguistic Portrait of Fashionable Words
The paper is based on a corpus study on the essential characteristics of fashionable word. The retrieval system for the identification of basic units consists of the utterances, containing metaoperators modnoe slovo (fashionable or trendy word) and kak modno govorit’ (how to speak in a fashionably way). The theoretical model of fashion, worked out by A.B. Gofman, served as the basis for the interpretation of the findings. The work distinguishes context markers, manifesting attributive characteristics of the fashionable object, modernity, universality, demonstrativeness and play. Based on the metalinguistic valuations corpus there were distinguished three classes of fashionable words, associated by commonplace consciousness with two attributive fashionable values—modernity and universality. These words include 1) the new words, naming the new reality; 2) the words, referring to the new naming of the known reality; 3) the only class of words meeting all the requirements of fashion, realizing play activity or speaker’s aesthetic need to renew his speech. The demonstrativeness markers are distinguished by the context indicators of the unusualness of the fashionable word form. In the first instance, the foreign neologisms demonstrate external attractiveness. The class of the fashionable lexical items can be presented as a field structure with a nexus, having all the required value criteria of a fashionable object, and peripheral layers of various degree, depending on the number of criteria they have.
Muzychka S., Piontkovskaya I.
Graph-Based Approach in the Dependency Parsing Task for Russian Language
Dependency parsing is one of the key components in a large number of tasks of automatic processing of natural language texts. Effective dependency tree construction can be applied to a wide variety of machine translation systems, automatic speech synthesis and recognition, and so forth. Graphbased approach in dependency parsing proved to be efficient for morphologically rich languages due to its possibility to deal with non-projective dependency trees and flexible word order. Usually graph-based methods enable to perform probabilistic analysis over distribution on the set of syntax trees. In some NLP tasks it is not required to present a full syntactic parsing (in particular, to set labels on the edges of the tree). It is enough to find a parent for а given token. In this case, the graph-based approach is more appropriate because the likelihood that a token is an ancestor of the other, can be calculated by the explicit formula. We consider a task of automatic syntax tree construction with application to Russian language corpus SynTagRus. We propose a novel technique which enables to reduce time costs for training and doesn’t affect resulting accuracy. Experiments show that our algorithm outperforms existing analogues on SynTagRus in UAS (unlabeled attachment score) measure (percentage of correctly identified unmarked dependencies).
Nedoluzhko A., Toldova S., Novák M.
Coreference Chains in Czech, English and Russian: Preliminary Findings
This paper is a pilot comparative study on coreference chaining in three languages, namely, Czech, English and Russian. We have analyzed 16 parallel English-Czech newspaper texts and 16 texts in Russian (similar to the English-Czech ones in length and topics). Our motivation was to find out what the linguistic structure of coreference chains in different languages is and what types of distinctions we should take into account for advancing the development of systems for coreference resolution. Taking into account theoretical approaches to the phenomenon of coreference we based our research on the following assumption: the recognition of coreference links for different structural types of noun phrases is regulated by different language mechanisms. The other starting point was that different languages allow pronominal chaining of different length and that coreference chains properties differ for the languages with different strategies for zero anaphora and different systems for definiteness marking. This work reports our first findings within the task of the structural NP types’ distribution comparison in three languages under analysis.
Nikolaeva Y. V., Fedorova O. V., Kibrik A. A.
Discourse Structure: a Perspective from Multimodal Linguistics
This paper is a step towards multimodal linguistics, considering the verbal form of spoken discourse along with prosodic and gestural phenomena, involved in the process of spoken communication. It is well established that spoken discourse is structured with the help of prosodic features. The basic segment of talk is elementary discourse unit (EDU), defined on the basis of a set of prosodic criteria and correlated with the semantico-syntactic unit known as clause. A hierarchically more complex unit is sentence. Sentence boundaries are also identified by prosodic features. Illustrative gestures can signal EDU combination into sentences, too. This is performed by gesture assimilation in formal (location of gesture, hand configuration, trajectory, and direction of movement) and content-related (referents, place and time of event) characteristics. One kind of gesture assimilation, catchment, correlates with spoken sentence, whereas the other kind, gestural inertia, with a higher level unit, namely episode. We thus observe partial correlation between the components of multimodal discourse.
Paducheva E. V.
Verbs Byt’ and Byvat’: Contemporary State and History
The verb BYVAT’ ‘to be ’ (formed with the help of an iterative suffix -yva- from byt’) belongs to the class of verbs of the iterative Aktionsart, which includes such verbs as xazhivat’, slyxivat’ related to the imperfective xodit’, slyshat’. But BYVAT’ occupies a special place in that class. In particular, it is the only one to have an analytical form of the future tense. It is claimed that this form exists only in the context of the motional meaning of BYVAT’ , cf. acceptable Ja budu byvat’ u vas chashche ‘I shall be BYVAT’ at your place more often’ but *So vremenem ne budet byvat’ takix sluchaev ‘over time there won’t be BYVAT’ in such cases’. The following explanation is given to this fact: motional BYVAT’ is included in the aspectual system of Russian not as an iterative of the imperfective byt’ but as an imperfective of the momentary perfective pobyvat’. No wonder that motional BYVAT’ possesses all the properties of an imperfective of a momentary verb: a complete set of tense forms, iterative and general factual meaning of aspect, etc. The relationship is considered between the verb BYVAT’ and the verb byt’, which was earlier proved to be perfective in some of its uses and, thus, be-aspectual. The attention is drawn to the fact that BYVAT’ is often used as an expressive correlate of byt’.
Panchenko A., Loukachevitch N. V., Ustalov D., Paperno D., Meyer C. M., Konstantinova N.
RUSSE: The First Workshop on Russian Semantic Similarity
The paper gives an overview of the Russian Semantic Similarity Evaluation (RUSSE) shared task held in conjunction with the Dialogue 2015 conference. There exist a lot of comparative studies on semantic similarity, yet no analysis of such measures was ever performed for the Russian language. Exploring this problem for the Russian language is even more interesting, because this language has features, such as rich morphology and free word order, which make it significantly different from English, German, and other wellstudied languages. We attempt to bridge this gap by proposing a shared task on the semantic similarity of Russian nouns. Our key contribution is an evaluation methodology based on four novel benchmark datasets for the Russian language. Our analysis of the 105 submissions from 19 teams reveals that successful approaches for English, such as distributional and skip-gram models, are directly applicable to Russian as well. On the one hand, the best results in the contest were obtained by sophisticated supervised models that combine evidence from different sources. On the other hand, completely unsupervised approaches, such as a skip-gram model estimated on a largescale corpus, were able score among the top 5 systems.
Piperski A. Ch.
To Be or not to Be: Corpora as Indicators of (Non-)Existence
This paper discusses the notions of acceptability, occurrence, grammaticality and existence, and focuses on the relationship between corpus linguistics and the question of the existence of lexical items. Since corpora are almost exclusively samples from larger populations, it is claimed that they cannot provide evidence for non-existence of words, collocations or constructions. This is because the upper limit of a confidence interval for frequency based on a sample is always greater than zero regardless of the sample frequency. The rule of thumb goes as follows: anything that does not occur in a corpus might have occurred in a similar same-sized corpus zero to five times. If an item occurs in a corpus, this fact can serve as a proof of its existence in the language, but the final decision depends on whether the relevant contexts from the corpus are judged representative of the language variety of interest. In conclusion, I claim that a corpus-based study cannot prove the non-existence of a linguistic item, although it can be used to prove its existence. However, the latter type of proof includes assessing the representativeness of a corpus, which might lead to subjectivity and value judgments.
Podlesskaya V. I.
“I ne drug, i ne vrag, a tak…”: Distribution and Prosody of Discourse Markers that Signal Irrelevance (Evidence from the Multimodal Subcorpus of the Russian National Corpus)
The paper focuses on three Russian discourse markers tak, prosto and prosto tak, which fall under a broad category of what is called “loose uses” of language or “vague reference”. These are lexical, grammatical and prosodic resources that allow the speaker to refer to objects and events for which the speaker fails to retrieve the exact name, or simply finds the exact name to be unnecessary or inappropriate. The examined discourse markers are employed to signal that the actual state of affairs is less relevant than another (overtly mentioned or implied) one. The three markers are shown to be associated with different information, syntactic and prosodic structures (e. g. pitch movements). The provided qualitative and quantitative analysis is based on data from the multimodal subcorpus of the Russian National corpus.
Polyakov P. Yu., Kalinina M. V., Pleshko V. V.
Automatic Object-oriented Sentiment Analysis by Means of Semantic Templates and Sentiment Lexicon Dictionaries
This paper studies use of a linguistics-based approach to automatic objectoriented sentiment analyses. The original task was to extract users’ opinions (positive, negative, neutral) about telecom companies, expressed in tweets and news. We excluded news from the dataset because we believe that formal texts significantly differ from informal ones in structure and vocabulary and therefore demand a different approach. We confined ourselves to the linguistic approach based on syntactic and semantic analysis. In this approach a sentiment-bearing word or expression is linked to its target object at either of two stages, which perform successively. The first stage includes usage of semantic templates matching the dependence tree, and the second stage involves heuristics for linking sentiment expressions and their target objects when syntactic relations between them do not exist. No machine learning was used. The method showed a very high quality, which roughly coincides with the best results of machine learning methods and hybrid approaches (which combine machine learning with elements of syntactic analysis).
Ponomarev S. V.
Learning by Analogy in a Hybrid Ontological Network
This article describes the general principles of question-answering (QA) system, which produces answers to questions by analogy with the answers and the questions at training sets. As a knowledge base the system uses a number of ontological information of words and expressions from open-access sources and statistic information, collected by processing large text corpora. The knowledge base is presented as a hybrid ontological network— an oriented graph, where vertices1 are the words and expressions and edges are the links between words. In addition, each link between two words or expressions is oriented, typified and weighted. The link type characterizes the information source, from which this link and its type were extracted (for example, synonym from Wiktionary). Link weight is determined by reliable information source. All links, obtained from dictionaries and ontological bases, have the weight equals to one. The links, collected by processing text corpora, have the weight equals to frequency of relevant agreed bigrams (for example, a bigram adjective + noun). The structure of the hybrid ontological network characterizes by a large number of links between the network vertices. Besides direct links connecting two particular network vertices, there could be used composite links, passes through intermediate vertices, which leads to cardinally increasing of number of possible ways between vertices. Here’s a training algorithm that allows setting in the hybrid ontological network the links between words and items in term of combinations of weighted paths between network vertices.
Protopopova E., Antonova A., Misyurev A.
Acquiring Relevant Context Examples for a Translation Dictionary
This paper addresses the problem of automatic acquisition of parallel context examples for a translation dictionary. We extract them automatically from a parallel corpus, relying on word alignments and parse trees. The ranking of the extracted examples is an essential problem, since we need to select the most distinctive and informative contexts. We propose a machine learning approach as an alternative to simple ranking criteria, such as frequency, or mutual information. We perform the analysis of common sources of inadequate context examples and design a set of features, which can possibly distinguish the bad examples from the good ones. We also experiment with vector models (word2vec) in order to get features that are sensitive to semantics. The evaluation result show that the best of our ranking methods yields 31% improvement in accuracy compared to the ranking by frequency, and 20% improvement over the ranking by mutual information. Using vector models also improves the classification performance.
Shelmanov A. O., Smirnov I. V., Vishneva E. A.
Information Extraction from Clinical Texts in Russian
We present and evaluate the pipeline for processing of clinical notes in Russian. The paper addresses the tasks of drug identification and disease template filling, which are related to entity recognition and relation extraction. The disease template filling consists in recognition of disease mentions in text, mapping them to concepts of a thesaurus, and discovering their attributes. Discovering attributes means identifying corresponding spans in text, linking them to diseases, and normalizing them i.e. determining their generalized meaning from a predefined set. We implemented tools for determining the following attributes of disease mentions: negation; the flag indicating the disease mention is not related to a patient; severity; course; and body site. For different tasks, we used different techniques: rule-based patterns and several supervised machine-learning methods. Since there were no annotated corpora of clinical notes in the Russian language available for research purposes, we annotated a dataset, which we used for training and evaluation of the developed tools. The created corpus is available for researchers through the data use agreement.
Sheremetyeva S. O.
On Summarization Supporting Readability and Translatability
The article describes a methodology of developing an interactive computer system for supporting a single document text-to-text summarization process focusing on providing for high readability and translatability of the generated summary that, in turn, facilitates further human or automatic processing of the summary text, translation being the most important. The decisions on content selection is delegated to a human but are largely supported by the system. High readability and translatability of the generated text is provided by controlling the syntax of the nascent summary. The approach is a combination of empirical and rational NLP techniques and incorporates a language independent algorithm and language-dependent knowledge base. The validity of the approach was proved by its implementation into a summarizer for scientific papers in the domain of mathematical modelling in the Russian language. The summarizer is fully operational. The methodology presented in this paper is highly portable and allows for extending the summarizer to other domains and languages.
Shmelev A. D.
Russian Language-specific Lexical Units in Parallel Corpora: Prospects of Investigation and “Pitfalls”
The paper deals with language-specific lexical units as they appear in parallel corpora and the degrees of linguistic specificity. It discusses new insights into the languages compared that parallel corpora can provide as well as various pitfalls on the way to an accurate account of typological and cultural differences and similarities. In particular, it deals with Russian language-specific words, which defy translation into other languages. On the other hand, Russian languagespecific words quite often appear in translations into Russian even though no exact equivalent exists in the language of the original text; special attention is given to particles. The lack of such a particle where the communicative situation calls for it every so often gives the impression that we deal with a word-for-word translation of the original text containing no similar marker. In the absence of the relevant particle, the wrong implicatures may appear, or the text may cease to have coherence, or the utterance is perceived as a manifestation of arch use of language.
Tarasov D. S.
Natural Language Generation, Paraphrasing and Summarization of User Reviews with Recurrent Neural Networks
Multi-Document summarization and sentence generation are important challenges in natural language processing. This paper presents recurrent neural network (RNN) architecture capable of producing abstractive document summaries, as well as generating novel paraphrases of input sentences in the same language. We demonstrate practical application of our system on the task of multiple consumer reviews summarization.
Tarasov D. S.
Deep Recurrent Neural Networks for Multiple Language Aspect-based Sentiment Analysis of User Reviews
Deep Recurrent Neural Networks (RNNs) are powerful sequence models applicable to modeling natural language. In this work we study applicability of different RNN architectures including uni- and bi-directional Elman and Long Short-Term Memory (LSTM) models to aspect-based sentiment analysis that includes aspect terms extraction and aspect term sentiment polarity prediction tasks. We show that single RNN architecture without manual feature-engineering can be trained to do all these subtasks on English and Russian datasets. For aspect-term extraction subtask our system outperforms strong Conditional Random Fields (CRF) baselines and obtains stateof-the-art performance on Russian dataset. For aspect terms polarity prediction our results are below top-performing systems but still good for many practical applications.
Tutubalina E. V., Zagulova M. A., Ivanov V. V., Malykh V. A.
A Supervised Approach for SentiRuEval Task on Sentiment Analysis of Tweets about Telecom and Financial Companies
This paper describes a supervised approach for solving a task on sentiment analysis of tweets about banks and telecom operators. The task was articulated as a separate track in the Sentiment Evaluation for Russian (SentiRuEval-2015) initiative. The approach we proposed and evaluated is based on a Support Vector Machine model that classifies sentiment polarities of tweets. The set of features includes term frequency features, twitter-specific features and lexicon-based features. Given a domain, two types of sentiment lexicons were generated for feature extraction: (i) manually created lexicons, constructed from Pros and Cons reviews; (ii) automatically generated lexicons, based on pointwise mutual information between unigrams in a training set. In the paper we provide results of our method and compare them to results of other teams participated in the track. We achieved 35.2% of macro-averaged F-measure for banks and 44.77% for tweets about telecom operators. The method described in the paper is ranked second and fourth among 7 and 9 teams, respectively. The best SVM setting after tuning parameters of the classifier and error analysis with common types of errors are also presented in this paper.
Uryson E. V.
Conjunctions I ‘And’ vs. No/A ‘But’ Between Two Coordinate Clauses (Refinement of The Terms “Expectation” and “Norm”)
The paper deals with the Russian coordinating conjunctions i ‘and’ vs. no/a ‘but’ in a compound sentence. It is common knowledge that in a sentence like “Q, i P” the conjunction i ‘and’ marks correspondence to a certain “norm” while in a sentence like “Q no/a P” an adversative conjunction ‘but’ marks discrepancy between “norm” and the expressed state of affairs. The problem is that in some cases both i ‘and’ and no/a ‘but’ can be used. Another problem is that in some cases the usage of these conjunctions hardly can be interpreted in the terms of “norm”. I demonstrate that relevant facts can be adequately described if the basic concept of semantic interpretation is “expectation”, but not “norm”. Expectation can be induced by (a) common knowledge of laws of nature; (b) common notions of human life, social relations, etc.; (c) text grammar. The conjunction i ‘and’ marks correspondence to the expectation and the conjunctions no/a ‘but’ mark the cancelled expectation in all cases. But the cancelled expectation is obligatorily marked in the case (a), but not (b). As for case (c), text grammar induces merely two expectations: (c1) P has the same general “microtopic” as Q; (c2) P has the same object in the focus as Q. The propositions P and Q can have the same general “microtopic”, but different objects in the focus. It is Speaker who chooses strategy for marking this or that case.
Ustalov D. A.
Russian Thesauri as Linked Open Data
Open linguistic data is a good recently established trend allowing both researchers and developers in the field of natural language processing to create their own applications using high-quality dictionaries, thesauri, corpora, etc. At the same time, the published open data are stored in different formats making them difficult to be used in an efficient way without falling within vendor lock-in. This paper is devoted to the problem of representing popular lexical resources of the Russian language in the form of Linked Open Data. It summarizes the recent work in the field of thesauri representation formats and approaches to converting such formats to those of Linked Data. It also proposes an approach to converting popular Russian thesauri to the vocabularies that are the essential parts of the Linguistic Linked Open Data Cloud. The proposed approach has been implemented in open source software and the resulted dataset has been made publicly available on NLPub in the Turtle format under the terms of a Creative Commons license.
Vasilyev V. G., Denisenko A. A., Solovyev D. A.
Aspect Extraction and Twitter Sentiment Classification by Fragment Rules
The paper deals with approaches to explicit aspect extraction from user reviews of restaurants and sentiment classification of Twitter messages of telecommunication companies based on fragment rules. This paper presents fragment rule model to sentiment classification and explicit aspect extraction. Rules may be constructed manually by experts and automatically by using machine learning procedures. We propose machine learning algorithm for sentiment classification which uses terms that are made by fragment rules and some rule based techniques to explicit aspect extraction including a method based on filtration rule generation. The article presents the results of experiments on a test set for twitter sentiment classification of telecommunication companies and explicit aspect extraction from user review of restaurant. The paper compares the proposed algorithms with baseline and the best algorithm to track. Training sets, evaluation metrics and experiments are used according to SentiRuEval. As our future work, we can point out such directions as: applying semi-supervised methods for rule generation to reduce the labor cost, using active learning methods, constructing a visualization system for rule generation, which can provide the interaction process with experts.
Vilinbakhova E. L.
Article Means Article: on One Pattern of Tautologies in Russian
The study based on Internet and corpus data [RNC] deals with a special pattern of tautological constructions in Russian called metalinguistic tautologies. The notion was briefly introduced in 1996 by E. Miki as a label for a set of quite heterogeneous examples, but was not further developed. While other tautologies describe entities in the real world, metalinguistic tautologies refer to the use of a linguistic expression. Such constructions show that the speaker is employing a word or an expression in its common, straight meaning. Therefore, they are most often used when context allows other possible interpretations of the linguistic expressions (such as euphemisation, irony, or hyperbole), and sometimes such alternatives are explicitly spelled out: Inexpensive means inexpensive, not poor quality. Metalinguistic tautologies are established in Russian with patterns Х znachit Х, Х oznachaet Х ‘Х means Х’, X eto X ‘Х is Х’, and are distinguished from homonymous constructions by their semantic and pragmatic features.
Yakovleva I. V.
The Metaphor of Caused Motion in the Air in the Target Domain of Speech Act Verbs in Russian from a Contrastive Perspective
The study is devoted to the metaphor of caused motion in the air in the semantic field of speech act verbs in Russian from a contrastive perspective. This semantic shift is typical for verbs of contactless motion implying contact only in the initial phase like brosat’ (‘to throw’). In Russian verbs implying an action performed with hands demonstrate the outstanding capability of accomplishing the semantic shift under study. Another source domain including predicates implying an ejection of a certain object through a kind of hole, mainly a mouth, is not very productive in Russian and is restricted with some negative connotations. The verbs implying emission of a kind of substance with the whole surface do not normally accomplish this type of semantic shift in Russian and we distinguish this hypothetic source domain only when we study the Russian data against the typological background. The semantic shift under study is subject to the aspectual restrictions brought about by the co-occurrence of the basic speech act verbs govorit’ (‘to speak’) and skazat’ (‘to say’).
Yanko T. E.
Contrastive Analysis of Prosody: Odessa Regional Russian vs. Standard Russian
There is a considerable body of literature referring to segmental phonetic, syntactical, lexical, and stylistic parameters of Odessa accent. However, the prosodic peculiarities of spoken Odessa Russian are seriously underestimated, particularly if we take into consideration the role Odessa language plays in the Russian culture. This paper is aimed at contrasting the Odessa prosody to the prosody of the standard Russian spoken language. The Russian standard prosody, as it is viewed here, is the system corresponding to the inventory of intonational constructions recognized by E. A. Bryzgunova, including their functions in the standard Russian spoken discourse. Prosody is thus referred to not only as a system of distinctive features of the spoken language but also as the basic means of manifesting the communicative meanings: the illocutionary force, the contrast, the discourse continuity. Prosody is analyzed at the level of distinctive features of pitch accents (such as the pitch movement patterns, patterns of the pitch alignment with the text, intensity), at the level of integral pitch accents as they are represented by E. A. Bryzgunova, and at the level of the pitch accents as manifestations of the communicative meanings. For investigation, a minor working corpus of Odessa speech recordings was set up. The corpus consists of interviews with the speakers of Odessa Russian, short stories about Odessa told by the citizens, cooking recipes, jokes, and funny stories. The software programs Praat and Speech Analyzer were used in the process of analyzing the sounding data. The results presented here are exemplified by frequency and intensity tracings of records from Odessa speech oral corpus.
Zakharov V. P.
Set Phrases: a View through Corpora
The study of word collocatibility is one of the main tasks of linguistics. Syntagmatic relations bind together language units being in direct contact with each other. The combinatory ability of language units, collocatibility, is one of the linguistic syntagmatic laws. This phenomenon is the main object of the phraseology and lexicography. The article deals with set phrases of different types from the point of view of their numerical evaluation. Corpus linguistics understand set phrases as statistically determined unities. This approach is the basic point of different automatic ways to extract idioms and collocations. The paper describes experiments which show how text corpora and corpus methods and tools such as association measures, word sketches, concordances can be used to expand the entries in existing dictionaries and how set phrases could be evaluated quantitatively. There are a small numbers of works on set phrases productivity during time periods because of small size of historical corpora. In this research examined set phrases usage was studied diachronically on the base of the big Google books Ngram Viewer Russian corpus counting billions of tokens. The study argues that diachronic productivity is best evaluated with a studying contexts. Used corpus tools enable to do it. Ultimately, it is shown and maintained that corpus linguistics methods and tools allow to create dictionaries of new type which have to include a larger amount of set phrases and collocations than before.
Zalizniak A. A.
Russian Language-specific Words as an Object of Contrastive Corpus Analysis
The paper summarizes methodological principles and some preliminary results of a project “Contrastive corpus-based study of the specific features of Russian semantic system” currently conducted by a research group on the basis of an aligned Russian-French and French-Russian parallel corpus. The purpose of the research project is to verify, by means of contrastive corpus-based analysis, a number of hypotheses concerning Russian “language-specific” words formulated in the course of previous investigations. We assume that translation equivalents of a language unit in another language can be considered as a source of information about the semantics of the latter. Such approach is particularly efficient in case of languagespecific words that usually do not have full-fledged equivalents in other languages. Indeed, there are at least three possible types of mismatch: a certain semantic component is lacking in the translation equivalent; the translation equivalent includes an additional component, which is absent in the original unit; a certain semantic component is rendered by supplementary means. Each of these types of mismatch provides us with linguistic information that contributes to clarify the semantics of the source language unit and thus to verify the hypothesis of its language-specific status.
Anton Zimmerling, Ekaterina Lyutikova
Approaching V2: Verb Second and Verb Movement
The paper discusses constituting properties of V2 languages in a perspective of parametric typology. V2 languages are a small group of syntactically uniform languages sharing a number of parameters constraining the clausal architecture and the finite verb placement. We argue that whereas the generative procedure of deriving V2 by verb movement and feature composition of the target head is correct and has empirical validation, the broader definitions of V2 phenomena found in the contemporary work on the subject that loosen the diagnostic criteria on the single preverbal constituent are counterproductive. So called ‘partial’ or ‘residual’ V2 languages, where verb movement to the left-peripheral position is allegedly characteristic for a part of root declaratives, do not exist; at the same time, the verb movement by itself is not sufficient to produce the classic V2 profile.