This paper deals with morphological parsing of natural language texts. We propose a method
that combines comprehensive morphological description provided by ABBYY Compreno system
and sophisticated machine learning techniques used by the state-of-the-art POS taggers. The
morphological description contains information about possible grammatical values of a dictionary word that helps to identify a set of potential hypothesis for each word during the morphological analysis stage. To analyse out-of-vocabulary words we are building a number of most likely paradigms in the morphological model using the orthographic features of the analysed word. The proposed method helps to reduce the number of hypotheses using the context information of each word. We use Bidirectional LSTM classifi er to handle the context information and to predict the most probable grammatical value. The ambiguous grammatical values obtained from morphological description are used as features for the classifi er. Also, we use word embeddings and orthographic features to achieve better results.
Negative and positive polarity items (NPIs and PPIs) are one of the well-explored topics in formalsemantics and typology. However, the phenomenon of polarization is only addressed on a very limited linguistic material, such as indefi nite pronouns (some vs. any), temporal adverbs (yet vs. already), certain idioms (not to lift one’s fi nger), expressions of attitude (would rather, unfortunately). The main focus of polarity research are licensing contexts, or contexts that allowto use NPIs or PPIs, respectively. The current paper introduces considerable new body of NPIs and PPIs, explores semantic underpinnings of polarization, demonstrates that polarization is an all-pervasive property of language and shows that it is a scalar feature. The paper provides evidence for the fact that polarization is a regular phenomenon in polysemy, especially in verbs, where it characterizes single senses of polysemous words rather than words as a whole. It further establishes semantic connections between the phenomena of negative and positive polarization. The degree of polarization depends on semantic structure: the more “weighty” is the modal frame, the more polarized an item is.
The paper considers the less known aspects in the functioning of Russian lexical “xeno” markers, in particular, of the particle jakoby ‘allegedly, ostensibly’. Traditionally described as expressing the falsity of a proposition contained in somebody’s utterance, in conjunction with a negative assessment of the utterer as aware of its falsity, jakoby displays very different usages in the language of contemporary mass media. Namely, it is frequently used as a mere marker of evidentiality, without an obligatory assessment of the proposition as false or of its source as untruthful. In fact, it can even be used to refer to statements that are treated as true within the very same text, only to indicate that the source of this information is not the writer herself but somebody else (e.g., a different news agency), in what might be termed as “safety” strategy.
Besides, jakoby in its mass media usages demonstrates unusual syntactic behaviors, namely shifts in scope, where it is placed before the speech verb rather than before the challenged proposition: jakoby utverzhdat’, chto P ‘jakoby claim that P’ instead of utverzhdat’, chto jakoby P ‘claim that jakoby P’
However, the study of the Russian-English parallel corpus reveals that these usages are not as unusual as they may appear. In Russian translations of English texts jakoby sometimes functions as a translation of the English supposedly, allegedly, ostensibly or other (e.g., verbal) markers of uncertainty, but more frequently occurs with no apparent stimulus in the source, merely to mark indirect quotation.
It appears therefore that there is a certain need in the Russian language for a neutral evidentiality marker. It is occasionally filled with jakoby, which in this case displays a tendency for grammaticalization: it expresses that the source of information is other than the speaker herself (but contains no other semantic components), and takes syntactic scope over the speech verb instead of the proposition it challenges.
The paper presents a tentative scheme for the analysis and annotation of the factors that contribute to metaphoricity of verbs in real-world data. The need for a comprehensive metaphor annotation scheme was demonstrated in the comparative study of four state-of-the art automatic metaphor identification systems (Dunn, 2013c), which showed low agreement on a gold standard. The proposed scheme aims to capture a variety of factors involved in metaphoricty, their cumulative effect, and the gradual nature of metaphoricity.
The proposed inventory of metaphoricity factors includes nonbasic meanings (the less concrete, body-related and precise meaning(s) of a word); semantic shifts (additional semantic components of the meaning), newly attested meanings; morphologically motivated metaphoricity; different types of personification (a switch in the semantic class of the argument from animate to inanimate); a switch in the semantic class of the argument between concrete and abstract; the use of a metonymic argument, participation in idiomatic expressions; and direct metaphors – the direct use of a word involved in a cross-domain mapping.
We also attempt to represent the dynamic interactions between these factors using graph-like visualization. Personifications and direct metaphors are described in terms of conceptual sources and targets. Secondary metaphoricity is attested for words that participate in extended metaphors. Metaphorically related words can be signaled by different types of cues: graphically, or by specific grammatical constructions, or by lexical cues.
The evidence is drawn from 2,036 occurrences of 100 Russian verbs sampled from the Russian syntactically annotated corpus SynTagRus.
The paper provides a corpus-based analysis of Russian idioms of drunkenness from the Thesaurus of Modern Russian Idioms (Moscow, 2007). As a starting point of a study three text corpora differentiated in accordance with different spheres of functioning of Russian were selected: Corpus of Russian Literature (texts from 1960 to 2000 years, 35 millions of typetokens), Corpus of Mass-Media (texts from 1990 to 2010 years, 29 millions of typetokens) and Corpus of Drama (plays from 1960 to 2010 years, 23 millions of typetokens).
The main idea consists in comparing frequency of idioms of drunkenness under consideration in every corpus. The analysis shows that Russian idioms of drunkenness form four different groups in accordance their frequency characteristics. The first group is formed by low-frequency idioms from all corpora. The second group consists of idioms with the same relative high-frequency in all corpora. The third group includes idioms which demonstrate a high variations in frequencies in all corpora. The forth group consists of idioms with a high frequency rank within every single corpus. The paper describes each of these groups providing the interpretation of the statistic data from the field of stylistics. The high frequency range of an idiom in different corpora corresponds with its stylistic features. The paper argues that there is a strict correlation between the discourse frequency of an idiom and its stylistic properties.
This paper focuses on evaluation of discourse abilities of speakers with brain damage: people with dynamic aphasia (PWA(d)) and right hemisphere damage (RHD) as compared to healthy speakers of Russian language. The study is based on the material from the Russian CliPS corpus that contains retellings of the Pear Film produced by PWA and RHD, as well as neurologically healthy controls.
The nature of the narratives in the corpus allows for a comparative investigation of discourse on the level of micro-structure (grammatical and lexical phenomena) and on the macro level: narrative structure, coherence and cohesion, interactional patterns and narrative discourse strategies. In this paper we present results of the comparative analysis of some macro level discourse strategies: the way interaction and empathy are realized in the stories by PWA(d), RHD and healthy speakers.
We have found significantly higher numbers of attitude expression markers, as well as significantly lower numbers of cognitive difficulties markers, in healthy speakers as compared to PWA(d). These results support what is known about difficulties that PWA(d) demonstrate in discourse production tasks. While PWA(d) use interactive markers to get a break from keeping with the story plan, they avoid using epistemic predicates whose subjects are the story characters.
The article deals with Russian imperative constructions 1Pl with particles davay(te) / day(te). It is argued that conversation-oriented approach in analysis of these imperative utterances can be very productive. On the basis of the neighboring utterances and specifics of dialogue organization one can distinguish four semantic subtypes of 1Pl imperative: propositive (which is an expression of 'suggestion' and presupposes a verbal reaction from the listener), hortative (which expresses an 'mild order' and no verbal reaction of the listener is expected), factitive permissive (expresses a request to remove obstacles for action, no verbal reaction of the listener is expected), permissive (expresses 'permission', the utterance appears as a reaction in the final part of the microdialogue). Imperative constructions are studied in two speech corpora. The difference in frequency for the four imperative subtypes in the two corpora can be explained by the difference in the amount of everyday and institutional talks in the sampling. In the corpus where institutional talks between «sellers» and «customers» are predominant most of the 1Pl imperative usages (70%) are propositive. This is determined by the fact that such are specific strategies used by the «sellers». The research results can be used for the modeling of Russian dialogue.
The presented research was carried out on the material of the ORD speech corpus in the framework of the project, dedicated to study sociolinguistic variation of Russian speech and aimed at identifying diagnostic features characterizing everyday speech of major social groups (age-, gender-, status-, professional-related, etc.). The obtained results showed that practically on each linguistic level one may observe the features exhibiting a very high similarity between different sociolects. In particular, the coincidence is observed in the distribution of phonemes, distribution of parts of speech, and the frequency of some syntactic structures. The distribution of phonemes was determined on the subcorpus of 172 000 allophones. The following ten phonemes are the most frequent in speech of all social groups: /a/ (18,18%), /i/ (9,04%), /t/ (6,36%), /o/ (5,43 %), /u/ (4,49 %), /n/ (4,11 %), /j/ (3,82 %), /e/ (3,57 %), /k/ (3,35 %), /s/ (3.01 %). The distribution of parts of speech in everyday speech was obtained on the linguistically annotated subcorpus of 125 437 tokens and has the following breakdown: V (17,43 %), S (15,29 %), S-PRO (14,13 %), PART (13,35 %), CONJ (9,47 %), PR (7,09 %), ADV-PRO (5,30 %), ADV (4,51 %), A-PRO (4,30 %), A (3,73 %), PRAEDIC (1,84 %), INTJ (1,41 %), NUM (1,29 %), PARENTH (0,56 %), ANUM (0,27 %), PRAEDIC-PRO (0,01 %). At the syntactic level, one-element structures are prevailing in everyday speech of all social groups, the most frequent among them being D (particle / discursive word) (3,73 %), S (2,26 %), and V (1,88 %). Statistical analysis of the left- branching and right- branching verb groups has showed that the first ones significantly prevail in speech of all sociolects. The revealed features reflect some constant, universal properties of everyday spoken Russian and can be used for adjustment and improvement of speech synthesis and recognition systems.
Words denoting numbers (cardinal and ordinal numerals, or adjectives) represent a small (although potentially infinite) lexicographic type. In this article we deal with the polysemy structure of these two lexical classes. We propose a lexicographic pattern and study standard types of semantic shifts, including regular metaphors and metonymies. The words of both classes normally develop special senses with conversion into other parts of speech. Additional senses, different for different words, can appear due to cultural conventions.
The semantic analyser SemETAP is a module of the ETAP-3 Linguistic Processor. It uses 2 static semantic resources – the combinatorial dictionary and the ontology. The former contains multifarious information about the words, and the latter stores extralinguistic (world) knowledge on the concepts and serves as the metalanguage for semantic description. World knowledge is needed, on the one hand, to enhance text analysis, and, on the other hand, to extract implicit information by means of inference. Both words and concepts are supplied with semantic descriptions. A semantic description consists of a definition in a formal language, which can optionally contain implications and expectations. For user’s convenience, the description may also be provided by examples and a definition in NL. Semantic descriptions of several words and concepts are given.
This paper describes experiments on humorous response generation for short text conversations. Firstly, we compiled a collection of 63,000 jokes from online social networks (VK and Twitter). Secondly, we implemented several context-aware joke retrieval models: BM25 as a baseline, query term reweighting, word2vec-based model, and learning-to-rank approach with multiple features. Finally, we evaluated these models in two ways: on the community question answering platform Otvety@Mail.ru and in laboratory settings. Evaluation shows that an information retrieval approach to humorous response generation yields satisfactory performance.
Alongside with ordinary words, natural-language text also contains non-standard words (NSWs), such as abbreviations, acronyms, dates, phone numbers, currency amounts etc. Before phonetizing these text elements in Text-to-Speech synthesis, it is necessary to normalize them by replacing them with an appropriate ordinary word or word sequence. NSWs are increasingly diverse and most of them require specific normalization rules. In this paper, we present a taxonomy of NSWs for the Russian language developed on the basis of news texts, software and car reviews and instruction manuals. We grouped NSWs that have similar normalization rules or patterns taking into account their graphic form and their context dependence. We propose five main groups of NSWs: abbreviations (including acronyms and initialisms), text elements containing numbers, special characters, foreign words written in the Latin alphabet and mixed-type non-standard words. In this work, we describe these NSW types and address the issue of their normalization in Russian Text-to-Speech synthesis.
The representation of oral speech in literature differs sharply from real spoken language. It is therefore interesting to study Vladimir Sorokin’s novel Ochered’ (The Queue), which consists entirely of dialogue and gives the impression of conveying authentic orality.
The present study examines the use of the particle ну, which occurs some 300 times in the novel. The aim is to analyze various uses of its German and English equivalents, which will allow us to better describe the functioning of the Russian particle. On the other hand, comparison with translations will lead to a better understanding of this particle in the structure of literary texts.
Lexicographical descriptions indicate that the standard equivalent of Russian ну in German is the particle na. As is clear from our analysis, however, in many cases translations select a different variant. In a context involving coaxing (Nu, skazhite, ne vrednichayte), the equivalents aber and schon are found more often, while the particles na, naja, ja, also, so, eben or the interjections ach, tja are used more frequently to express overcoming embarrassment (Nu chto vy…). In English this play on discourse particles is not conveyed. This indicates that we need to refine the contrastive and lexicographic description of these particles on the basis of a representative corpus of parallel texts.
Our analysis shows that Sorokin uses ну not merely as a colloquial marker but also as a keyword, as this term is understood in linguistic poetics. In translation this function is not preserved or preserved only partially.
Collocation acquisition is a crucial task in language learning as well as in natural language processing. Semantics-oriented computational approaches to collocations are quite rare, especially on Russian language data, and require an underlying semantic formalism. In this paper we exploit a definition of collocation by I.A. Mel’čuk and colleagues (Iordanskaya, Mel’čuk 2007) and apply the theory of lexical functions to the task of collocation extraction. Distributed word vector models serve as a state-of-the-art computational basis for the tested method. For the first time experiments of such type are conducted on available Russian language data, including Russian National Corpus, SynTagRus and RusVectōrēs project resources. The resulting collocation lists are assessed manually and then evaluated by means of precision and MRR metrics. Final scores are quite promising (reaching 0.9 in precision) and described algorithm improvements yield a considerable performance growth.
Multimodal communication – words, prosody, gesticulation, postures, eye gaze – is a fundamental part of language use. It is widely thought that a communication is more effective when it is accompanied by gesticulation than when it is not. In search of systematic evidence how the interface “speech–gesticulation” works we turn to the referential communication task originally devised by Krauss and specified by Clark. In our experiment, two participants were seated on tatami separated by a low or high plastic screen (the within-participants factor “visibility”), in front of each were twelve cards of so-called Tangram figures. For the Director the cards were already arranged in a target sequence, and for the Matcher the same figures lay in a random sequence. The Director’s job was to get the Matcher to rearrange her figures to match the target ordering. Thirty two participants carried out the task in four trials. All descriptions (over three hours long) were annotated, in Praat at the vocal channel and in ELAN at the kinetic channel. We consider six concrete hypotheses: (1) any kinetic activity benefits communication; (2) iconic gestures and postures benefit communication; (3) iconic postures benefit communication; (4) visible kinetic activity benefit communication; (5) visible iconic gestures and postures benefit communication; (6) iconic gestures and postures of the 1-st trial benefit communication. However, none of these hypotheses was confirmed. The most likely explanation of the obtained results is associated with the complex interface between speech and gesticulation: kinetic channel demonstrates not the additive effect, but the interaction one.
Due to the process of globalization, the number of English borrowings in different languages is constantly growing. In natural language processing (NLP) systems, such as spell-check, POS tags, etc. the analysis of loan words is not a trivial task and should be resolved separately. This article continues our previous work on the corpus-driven Anglicism detection by proposing an improved method to the search of loan words by means of contemporary machine translation methods. It then describes distribution of the borrowed lexicon in different online social networks (OSN) and blog platforms showing that the Anglicism search task strongly depends on corpus formation method. Our approach does not contain any pre-prepared, manually acquired data and gives a significant automation in Anglicism dictionary generation. We present an effective dictionary collection method that gives the same coverage compared to random user selection strategy on a 20 times smaller corpus. Our comparative study on LiveJournal, VKontakte, Habrahabr and Twitter shows that different social, gender, even age groups have the same proportion of Anglicisms in speech.
It is well known that syntax-level analysis of user-generated text such as tweets and forum postings is unreliable due to its poor grammar and incompleteness. We attempt to apply a higher level linguistic analysis of rhetoric structure and investigate the potential application domains. We leverage an observation that discourse-level structure can be extracted from noisy text with higher reliability than syntactic links and named entities. As noisy text frequently includes informal interaction between agents, discussions, negotiations, arguments, complaints, we augment discourse trees with speech acts. Speech Act discourse tree (SADT) is defined as a discourse tree with verbs for speech acts as labels for its arcs. We identify text classification tasks which relies on tree kernel learning of SADTs: detection of negative mood (sentiment), text authenticity and answer appropriateness for question answering in social domains. The results are that the proposed technique outperforms on the discourse level traditional keyword-based algorithms in all of these three tasks.
The paper aims at establishing the list of semantic classes of Russian adjectives. Another goal is to enumerate semantic roles relevant for the participants of Russian adjectives and the surface tools for coding such adjectival participants. The discussion of adjectival classes starts from the classifications proposed in Western and Russian linguistics. Such classifications are often based on pure semantic grounds and sometimes have rather speculative character. To overcome such shortcomings, the paper takes into consideration participants introduced by situations described by adjectives. Such participants can be of two types. Participants of the first type are relevant for specific semantic classes of adjectives. The second type participants can be introduced by adjectives from different semantic classes (and under certain circumstances – by the adjectives that lack government at the first sight). The overall conclusion is that adjectival government is more idiosyncratic and “more semantic” in nature than the verbal one.
The paper considers applying of ensemble algorithm based on rules and machine learning for anaphora resolution in Russian language. Ensemble presents combination of formal rules, a machine learning algorithm Extra Trees and an algorithm for working with imbalanced learning sets Balance Cascade. Complexity of the approach lies in generation of complex features from rules and vectorization of syntactic context, with context data obtained from algorithms mystem (Yandex), SyntaxNet (Google) and Word2Vec.
The paper deals with the question of how to determine the degree of language-specificity of connectives. As connectives are non-descriptive linguistic units and their morphological nature may vary, the author suggests to apply functional semantic criteria and distinguishes five types of interlingual isomorphism violation in Russian/French and Russian/Italian.
The author proposes to establish five categories of language-specific connectives:
1. The relation R expressed by the connective KA of language A is conveyed in language B by a linguistic unit UB not belonging to the functional class of connectives (for instance, the relation of simultaneousness expressed by pri ètom in Russian is frequently expressed in French or Italian by means of a present gerund);
2. The relation R expressed by the connective KA of language A is conveyed in language B by the signal KB of a relation R' which is semantically contiguous, but nevertheless possesses a more specific equivalent in language A (for instance, the pritom of illocutive simultaneousness is rendered in French with en outre, en plus, de plus disposing in Russian of the more precise equivalents k tomu že and krome togo);
3. The connective KA of language A possesses in language B a set of equivalents〖 K〗_A^i, all occurring with the same frequency (it’s the case of Russian vpročem, which is mainly translated with d’ailleurs and du reste);
4. The connective KA of language A has in language B a systemic equivalent KB which is the most frequent in translation, but doesn’t reproduce all the values and uses of KA (for instance, the connective a to of negative alternative has the French equivalent sinon, but the Russian connective is polysemous: it can express both the relations of cause and alternative, as well as a relation of addition – what is not the case of sinon);
5. There is in language B a mostly used translation variant KB of the connective KA of language A, but this ‘equivalent’ KB doesn’t possess the combinatory variety of KA (thus the Russian connective hotya introducing a subordinate clause combines in the main sentence with the conjunction no, but hotya … no is translated in French either by a single bien que, the equivalent of hotya, or by a single mais, the equivalent of no).
The author also describes features helping to determine the degree of language-specificity of connectives that would take into account the translation direction (from Russian to French/Italian and vice versa) and might be subject to a quantitative analysis. These features are: 1) frequency of occurrence of a “zero” functionally equivalent fragment (respectively, translation stimulus); 2) type of functionally equivalent fragment or translation stimulus (congruent or divergent); 3) number of possible translation patterns (respectively, translation stimuli). In conclusion, the author will comment on statistical data confirming the described types of interlingual isomorphism violation in case of the functional class of connectives. The author points out that statistics requires adequate interpretation and should be accompanied by a semantic analysis, as the latter also serves as a tool for the data’s review. The study was carried out by using the Russian-French and Russian-Italian supracorpora database of connectives.
The paper discusses a number of challenging microsyntactic units of Russian formed with functional and broad semantics words, like pronouns, particles, conjunctions and auxiliary verbs. One microsyntactic unit, “kak byt'” ( “what to do”), is described in full detail. Issues of adequate presentation of such units in linguistic resources and their treatment in natural language processing tasks are considered. Special emphasis is laid on two types of resources: the Microsyntactic dictionary, developed as a specific lexicographic resource that can be used in theoretical linguistics and natural language processing, and a deeply annotated corpus of Russian, Syntagrus, supplied with microsyntactic tagging. The Microsyntactic dictionary consists of lexical entries compiled according to a uniform pattern that comprises 9 zones: (1) name of microsyntactic unit; (2) type of unit (syntactic idiom or nonstandard construction); (3) lexical composition of the unit (individual words of the syntactic idiom or classes of words forming the nonstandard construction, if any); (4) analytical lexicographic definition of the unit; (5) government pattern; (6) syntactic properties of the unit presented in terms of dependency grammar, including the identification of syntactic relations that hold between the unit’s constituent elements, and relations that incorporate this unit into the sentence; (7) synonyms and analogues; (8) comments and (9) illustrations. The microsyntactic markup of SynTagRus shows what units are present, how they fit in the structure and which of the polysemous entities occur in the tagged sentence.
We present our approach to Part-of-Speech tagging and lemmatization tasks for Russian language in the context of MorphoRuEval-2017 Shared Task. The approach ranked second on the closed track and on several test subsets it ranked first.
We proposed a filtration-based method which seamlessly integrates a classical morphological analyzer approach with machine learning based filtering. The method addresses both tasks in a unified fashion. Our method consists of two stages. On the first stage we generate a set of candidate substitutions which simultaneously recovers the normal form and provides all necessary morphological information. We select an optimal substitution for the current word given its context on the second stage.
The filtration stage of the presented method is based on Linear SVMs extended with hash kernel. The extension reduces the size of our model by an order of magnitude and allows to easily tune the tradeoff between the precision and the model size.
The paper reports some results of the research, aimed at finding out 1) whether place coarticulation occurs in clusters of [labial or dental nasal + labiodental obstruent] within the phonological word and in external sandhi position in Modern Standard Russian, and 2) whether it may serve as a cue for detecting the presence of prosodic breaks and the phonological rules order.
The results obtained show that the F2 value of nasal before labiodental obstruent is significantly higher for bilabial obstruent and significantly lower for coronal one in comparison with their F2 values in the position before gomorganic stops. This type of place coarticulation is found only within the phonological word and is not available in an external sandhi position; thus the absence of this type of coarticulation may serve as a cue for detecting the presence of prosodic break.
In the case of clusters with final palatalized labiodental obstruent, the F2 value of bilabial nasal is found to be noticeably higher than those of the coronal one because a palatalization coarticulation, which exists in Modern Standard Russian for bilabials but not coronals before labiodentals. Thus, we argue that the phonological rule of palatalization operates before the rule for place assimilation in Standard Russian.
The paper continues a series of research on gaits as a particular type of nonverbal complex sign units. It presents the detailed description of various types of gaits peculiar to Russian people. The research is conducted in the framework of the feature-based approach, which is a main tool for constructing a semiotic conceptualization of the somatic objects or human body in general. Using this approach we analyze physical features of gaits, such as width of pace, speed of walking, type of contact with the surface, noise, which is produces while walking, shape of the body or particular somatic objects (i.e. legs, feet), etc. We describe how the particular manner of walking reflects various characteristics of the human being. These are social characteristics, such as gender and age; traits of character and psychological features, such as being nervous, depressed or happy; various types of dysfunctions, such as being halting. Such parallels between the specific features of the gait and elements of the emotional and psychological states of the person highlight the connections between various systems of the human being.
In spite of its fundamental role in the processes of speech production, speech breathing is studied experimentally quite insufficiently. The main task of this paper was to obtain phonetic-acoustic data on the breathing pauses (BP) with different text localization in oral Russian texts. In the introduction we formulate the problems being discussed with the main goal to analyze the correlation of BP’s acoustics with the boundaries of principal text units - paragraphs, sentences, clauses, taking into account the interspeaker variability in reading of the same text. In our previous studies it was supposed that the distinctive function of BP as prosodic markers of macrosegmentation is realized by the stable differences in their general phonetic patterns and such acoustic parameters as duration, intensity and noise spectrum of the inhalation phase. The quantitative analysis of these differences comprises the main part of the paper. The second section describes the material and the methods of research. In the third section the analysis of the general acoustic-physiological pattern of BP with different text localization is given, which allows to set the regular objective differences between them. The fourth section contains the data of BP’s duration in connection with their text localization. It is shown that BP’s text localization is the main factor to control its duration. In the fifth section the similar information is presented for the average BP’s intensity. The sixth section contains an analysis of the spectral characteristics of the inhalation noise in BP. It is shown that it is the most invariant feature for all BP types, almost independent of its text localization. The conclusion summarizes the study’s results emphasizing the fact that the presence of BP in any oral text can serve as a sufficient indication of its structure, but interspeaker variability shows that BP’s realization is not a necessary feature of the text boundary.
Lexical elements, which signal text cohesion (though, moreover, etc), are a convenient and attractive research object in Translation Studies. Unlike other words, connectives are structurally optional, more context-independent and, therefore, more revealing in terms of motivations behind translators’ linguistic choices. Their frequencies are used to establish differences between translations and non-translations and are interpreted as a linguistic indicator of several tendencies in translation process such as explicitation, simplification and convergence. Particular patterns of translator’s choices result in different degrees of ‘being a translation’ and can be related to translation quality and translational norms. We set out to reveal tendencies in translational behavior at different competence levels by describing frequency distributions of a single functional type of discourse markers (connectives) in English to Russian translation of mass-media texts. To this end, we compared data from a parallel translational learner corpus and a corpus of professional translations to customized selections from English and Russian national corpora. Using independent predefined lists of connectives for each language, we explored cross-linguistic differences and their influence over students’ and professional translations. We hypothesized three possible tendencies: translations follow source language pattern (interference); translations follow target language pattern (normalization) or translations demonstrate independent idiosyncratic (over)use of connectives (explicitation). The observations were done with regard to the overall frequencies of the list items, their semantic groups and individual frequencies. Beside obvious pedagogical implications, the findings are useful in understanding cognitive processes behind translation, applicable in detecting translationese for a given language pair and assessing textual quality of translations.
The article discusses adverbial expressions (adverbials), cf. v ozhidanii (‘in expectation’), pod okhranoj (‘guarded by smb’), po priglasheniju (‘by invitation’), v blagodarnost’ (‘being thankful for’) as one of the stages of the process of verbal nouns grammaticalization – namely, as a stage of their transformation into derivative prepositions.
An adverbial is regarded as a reduced predications related to the main predication. There are two types of adverbials: included adverbials and circumstantial ones. Circumstantial adverbials correlate with adverbial subordinate clauses. The article discusses semantic-syntactic properties of included adverbials.
Among included adverbials two types can be distinguished:
1) coreferential (having the same subject) adverbials that correlate with an adverbial participle: the subject of the adverbial is identical to the subject of the main predication; and the subject (agent) of the adverbial is not expressed and can’t be expressed syntactically, cf. (Passazhiry khodili po perronu v ozhidanii poezda ‘Passengers were walking along the platform waiting for the train to arrive’);
2) non-coreferential (having different subjects) adverbials that correlate with a passive participle: the subject of the adverbial is not identical to the subject of the main predication; and the subject (agent) of the adverbial is necessarily expressed (Prishel v sopovozhdenii advokata ‘came accompanied by a lawyer’). In contrast to such adverbials, nominalizations can express all their arguments.
In this paper, we present the results of preliminary experiments on finding the link between the surface forms of Russian nouns (as represented by their graphic forms) and their meanings (as represented by vectors in a distributional model trained on the Russian National Corpus). We show that there is a strongly significant correlation between these two sides of a linguistic sign (in our case, word). This correlation coefficient is equal to 0.03 as calculated on a set of 1 729 mono-syllabic nouns, and in some subsets of words starting with particular two-letter sequences the correlation raises as high as 0.57. The overall correlation value is higher than the one reported in similar experiments for English (0.016).
Additionally, we report correlation values for the noun subsets related to different phonaesthemes, supposedly represented by the initial characters of these nouns.
In the article, I focus on tense marking in Russian constructions with predicatives, such as xolodno '(it is) cold' and ploxo '(it is) bad'. Statistical data from the Russian National Corpus show that the frequency of past tense forms (e.g., combinations with the form bylo) is much greater for some predicatives than for others. This difference results both from semantic and formal factors. On the one hand, some predicatives denote evaluation (e.g. ploxo ‘bad’). Evaluation can be applied to events that have finished or have never been realized. What is relevant is that the evaluation is made at the moment of speech, and this is why the present tense (= the zero copula verb) is used. On the other hand, it is important that the present tense is unmarked with predicatives, while with verbs, it is marked with special verbal affixes. The unmarked present tense form of a predicative can get the temporal meaning from the embedded verb. Interestingly, this phenomenon is in a sense opposite to the well-known phenomenon of relative tense marking. While the latter presupposes that the tense assignment to the embedded event is anchored to the tense meaning of the main event, the tense value of the construction with evaluation predicatives is assigned by 'agreement' with the embedded verb.
This paper is aimed to analyze the Russian equivalents of the Italian focus particle "magari". This lexeme has attracted much attention among linguists from different countries due to its especially intriguing polyfunctionality that sometimes knows no parallel in other languages. Russian equivalents of "magari" (extracted from the Russian-Italian subcorpus of the Russian National Corpus) clearly demonstrate that "magari" corresponds in Russian to a wide range of lexemes/units with different modality - equipotential non exclusion of factuality, concessivity, weakening of the illocutionary force of imperative, optative modality. The set of functions held by "magari" (non factual, non factual concessive, imperative, optative) also recalls in Russian the semantic network developed by several irrealis markers of non factuality. Moreover, in some contexts in the Italian translations from Russian "magari" appears while there is no concrete equivalent in the source language. In other words, the connotative range of "magari" is mostly achieved in Russian by different semantic mechanisms. Cross-linguistic analysis helps in clarifying the set of interlinguistic Russian-Italian correspondences of the lexeme "magari" and in circumscribing the different constructions and contexts where it occurs.
This paper contributes to the field of multimodal linguistics and discusses annotation of manual gestures in multichannel (multimodal) discourse, using a corpus of Russian discussions of the Pear Film (Chafe ed. 1980); the corpus is currently under construction (www.multidiscourse.ru). The study is based on a fully annotated subcorpus of 3 recordings, total time about one hour, with three participants in each recording. One of the key issues in gesture annotation is the importance of systematic distinction between communicatively significant gestures and general kinetic background. To address this issue, we introduce the notion of a speaker’s gesticulation portrait, involving properties such as (dis)inclination to stillness and to frequent use of physiologically motivated self-adaptors, as well as a set of habitual postures (“neutral hand positions”), serving as starting points for significant movements. An annotation scheme for the ELAN environment is proposed, allowing one to systematically annotate hierarchically organized gestural units as annotations in independent ELAN tiers: gesture chains, individual gestures, and gesture phases. Gesture phases are annotated separately for the left and the right hand; an individual gesture (sometimes called “gesture phrase”) is treated as a combined unit that can be one-handed or two-handed. To divide an ongoing stream of gesticulation into a series of gestures, annotators identify the points where kinetic features change. A simultaneous change of two or more parameters (effort, velocity, trajectory, movement direction, hand shape and orientation, location in gesture space, etc.) marks a phase boundary; a change of several features points to a gesture boundary. A set of subordinate ELAN tiers is used to describe each gesture, including handedness, phase structure, multi-strokes, gesture overlaps, gesture repetitions, etc. As left and right hands movements are often asynchronous, a set of formal rules is proposed for such instances. On the basis of the annotated subcorpus evidence, quantitative observations were made concerning speakers’ preferences in gesture handedness. Three out of nine participants prefer (>60%) two-handed gestures, while six participants prefer either right- or left-handed gestures most of the time. While all the speakers identified themselves as right-handed persons, two of them show preference for the left hand (>63%); the degree of right hand preference varies strongly (from 60% to 97%). This data suggests that adding the “dominant hand” parameter for two-handed gestures may be quite useful. The proposed annotation scheme is oriented towards comparing manual gestures with units of other communication channels.
A software system is presented which is designed to train learners in producing the basic intonation patterns of Russian speech. The system is based on comparing the melodic portraits of a reference sentence and a sentence pronounced by the learner and involves active interaction of the learner with the system. While parametric representation of intonation features of the speech signal faces fundamental difficulties, the paper shows how they can be overcome. The basic algorithms of analyzing and comparing intonation features, used in the proposed learning system, are presented. The features of the acoustic database composed of reference sentences and used in the learning system are presented. The set of reference sentences represents intonation patterns of Russian speech (IP1 to IP7) and their basic varieties. The system’s interface is presented and the results of system operation are illustrated.
The assumption that senses are mutually disjoint and have clear boundaries has been drawn into doubt by several linguists and psychologists. The problem of word sense granularity is widely discussed both in lexicographic and in NLP studies. We aim to study word senses in the wild — in raw corpora — by performing word sense induction (WSI). WSI is the task of automatically inducing the different senses of a given word in the form of an unsupervised learning task with senses represented as clusters of token instances. In this paper, we compared four WSI techniques: Adaptive Skip-gram (AdaGram), Latent Dirichlet Allocation (LDA), clustering of contexts and clustering of synonyms. We quantitatively and qualitatively evaluated them and performed a deep study of the AdaGram method comparing AdaGram clusters for 126 words (nouns, adjectives, and verbs) and their senses in published dictionaries. We found out that AdaGram is quite good at distinguishing homonyms and metaphoric meanings. It ignores disappearing and obsolete senses, but induces new and domain-specific senses which are sometimes absent in dictionaries. However it works better for nouns than for verbs, ignoring the structural differences (e.g. causative meanings or different government patterns). The Adagram database is available online: http://adagram.ll-cl.org/.
In this paper we study several groups of features and machine learning methods in the shared task on Russian paraphrasing organized in 2016. We use four groups of features: string-based features, information-retrieval features, part-of-speech features and thesaurus-based features and compare three machine learning methods: SVM with linear and RBF kernels, Random Forest and Gradient Boosting. In our experiments, the best results were obtained with the Random Forest classifier with parameter tuning and using all groups of features. The results of Gradient Boosting with parameter tuning were slightly worse.
The paper explores syntax and semantics of nominal counterfactuals, in which the antecedent contains a DP instead of a clause (“if BY not DP”). We address a number of issues related to their internal structure and semantic interpretation, including obligatoriness of negation and its inability to license negative polarity items. We argue that despite the apparent nominal character of the antecedent, its interpretation is not different from that of regular counterfactuals whereby the antecedent provides the restrictor for the universal quantifier in the domain of possible worlds. We propose that the nominal constituent undergoes re-interpretation along the following path. A pronominal element pro adjoins to DP, which takes its reference in the domain of events. This element finds an antecedent event in the standards way regular pronominals do. It is related to the denotation of the DP by an unpronounced thematic relation. This derives a proposition of the form “that there is an event identical to the referent of pro, which || DP || is a participant of”. Since the referent of pro is an eventuality in the evaluation world, the counterfactual only get a coherent interpretation if the antecedent is negated. For the same reason, the antecedent is not an environment where NPIs can be licensed.
Tolerance is a complex and partly contradictory concept that can be understood differently not only in different cultures, but also within the same culture. This paper presents a comparative study of the perception of tolerance by Russian and English speakers based on analysis of corpus data. At the initial stage of the study, the authors semi-automatically compiled a pilot web-based corpus of texts about tolerance. The corpus consists of a Russian-language subcorpus of 199,607 words and an English-language subcorpus of 210,898 words. After the mini-corpus was analyzed, the results were verified on the data from the general corpora ruTenTen11 and enTenTen13 using the Sketch Engine platform. The authors compared the word sketches for толерантность (tolerantnost’), tolerance, толерантный (tolerantnyi) and tolerant. In particular, this implied analyzing various lexical-semantic fields and thematic groups of collocates, as well as the following patterns: X толерантности (tolerantnosti) and X of tolerance, толерантность к (tolerantnost’ k) X and tolerance towards X, толерантность и/или (tolerantnost’ i/ili) X and tolerance and/or X. In addition, various derivatives of толерантность (tolerantnost’) / tolerance were discovered in the corpora and analyzed, including numerous nonce words. The corpus analysis enabled a deep insight into the way tolerance is perceived by Russian and English speakers.
Call centers receive large amounts of incoming calls. The calls are being regularly processed by the analytical system, which helps people automatically inspect all the data. Such system demands a classification module that can determine the topic of conversation for each call. Due to high costs of manual annotation, the input for this module is the automatically transcribed calls. Hence, the texts (=automatic transcription) used for classification contain ill-transcribed words which can probably influence the classification process. Another important point is that this module also has special requirements: it should be domain-independent and easy to setup. Document classification task always requires an annotated data set for classifier training, but it seems to be too costly to make an annotated training set for each domain manually. In this paper, we propose an approach to automatic speech recognition texts classification that allows the user avoiding full manual annotation and at the same time to control its quality.
As the as the volume of user-generated content in social media expands so do the potential benefits of mining social media to learn about patient conditions, drug indications, and beneficial or adverse drug reactions . In this paper, we apply Conditional Random Fields (CRF) model for extracting expressions related to diseases from patient comments. Our method utilizes hand-crafted features including contextual features, dictionaries, cluster-based and distributed word representation generated from unlabeled user posts in social media. We compare our CRF-based approach with deep recurrent neural networks and a dictionary-based approach. We examine different word embeddings generated from unlabeled user posts in social media and scientific literature. We show that CRF outperformed other methods and achieved the F1-measures of 69.1% and 79.4% on recognition of disease-related expressions in the exact and partial matching exercises, respectively. Qualitative evaluation of disease-related expressions recognized by our feature-rich CRF-based approach demonstrates the variability of reactions from patients with different health conditions.
The article describes a model of automatic analysis of puns, where a word is intentionally used in two meanings at the same time (the target word). We employ Roget’s Thesaurus to discover two groups of words, which, in a pun, form around two abstract bits of meaning (semes). They become a semantic vector, based on which an SVM classifier learns to recognize puns, reaching a score 0.73 for F-measure. We apply several rule-based methods to locate intentionally ambiguous (target) words, based on structural and semantic criteria. It appears that the structural criterion is more effective, although it possibly characterizes only the tested dataset. The results we get correlate with the results of other teams at SemEval-2017 competition (Task 7 Detection and Interpretation of English Puns), considering effects of using supervised learning models and word statistics.
E.Benvenist (Benvenist 1974) proposed a new approach to the study of egocentricity in a language. He delimitates two contexts of language use: the context of discourse and the context of narration (plan de discourse and plan de récit). Thereafter the notion register of interpretation came into being. Two registers are distinguished: dialogical register and narrative register, see Apresjan 2003, Paducheva 1986, 1996. This distinction is of utmost importance for the egocentrical entities of language (egocentricals) whose semantics appeals to the implied speaker. The opposition of primary (rigid) and secondary (shiftable) egocentricals is essential: secondary egocentricals undergo projections (when the role of the speaker is fulfilled by another subject), while primary egocentricals don’t allow projections.
In the paper it is demonstrated that the parenthetical nikak expresses an uncertain statement made on the basis of directly observed situation. The following properties of nikak are distinguished. Nikak is often used in an interrogative sentence, but in an utterance that doesn’t require an answer; different particles are often used in a nikak-sentence and the second person pronoun; nikak-sentence can express an explanation or interpretation of the situation before the eyes of the observer; nikak-sentence can express surprise; nikak may have kazhetsja ‘it seems’ as a synonym. Nikak is used only in an independent clause (thus differing from neuzheli, which is appropriate in a subordinate context). It is used only in the dialogical register and is a primary egocentrical: it can only have a speaker as an implied subject and doesn’t allow hypotactical or narrative projection.
Our experiment is aimed at evaluating the performance of distributional semantic features in metaphor identification in Russian raw text. We apply two types of distributional features representing similarity between the metaphoric/literal verb and its syntactic or linear context. Our approach is evaluated on a dataset of nine Russian verb context, which is made available to the community. The results show that both sets of similarity features are useful for metaphor identification, and do not replicate each other, as their combination systematically improves the performance for individual verb sense classification, reaching state-of-the-art results for verbal metaphor identification. A combined verb classification demonstrates that the suggested features effectively generalize over metaphoric usage in different verbs, shows that linear coherence features perform as well as the combined feature approach. By analyzing the errors we conclude that syntactic parsing quality is still modest for raw-text metaphor identification in Russian, and discuss properties of semantic models required for high performance.
The semantic halo of a meter (semantičeskij oreol metra) is a notion that was introduced by Mikhail Gasparov to describe semantic invariants of poetic texts composed using the same metrical scheme. Most studies addressing this phenomenon have been based on expert knowledge of the text corpus. In this study, I propose an automated approach to analyzing the semantic halo of various meters based on keyword extraction, using a simple measure of keyness as developed by Adam Kilgarriff. The method is applied to texts from the Poetic subcorpus of the Russian National Corpus. It allows us to discern basic motifs that are very close to those identified by literary scholars, which proves that it is a promising way to analyze the semantic halo of a meter. Some novel associations can also be inferred from the keyword lists. Clearly, keyword extraction cannot replace profound expert knowledge, but it can serve as a useful first step of the analysis.
The framework of the Rhetorical Structure Theory (RST) can be used to reveal the differences between structures of truthful and deceptive (fake) news. This approach was already used for English. In this paper it is applied to Russian. Corpus consists of 134 truthful and deceptive news stories in Russian. Texts annotations contain 33 relation categories. Three data sets of experimental data were created: with only rhetorical relation categories (frequencies), with rhetorical relation categories and bigrams of categories, with rhetorical relation categories and trigrams of categories. Support Vector Machines and Random Forest Classifier were used for text classification. The best results we got by using Support Vector Machines with linear kernel for the first data set (0.65). The model could be used as a preliminary filter for fake news detection.
For many natural language processing tasks (machine translation evaluation, anaphora resolution, information retrieval, etc.) a corpus of texts annotated for discourse structure is essential. As for now, there are no such corpora of written Russian, which stands in the way of developing a range of applications. This paper presents the first steps of constructing a Rhetorical Structure Corpus of the Russian language. Main annotation principles are discussed, as well as the problems that arise and the ways to solve them. Since annotation consistency is often an issue when texts are manually annotated for something as subjective as discourse structure, we specifically focus on the subject of inter-annotator agreement measurement. We also propose a new set of rhetorical relations (modified from the classic Mann & Thompson set), which is more suitable for Russian. We aim to use the corpus for experiments on discourse parsing and believe that the corpus will be of great help to other researchers. The corpus will be made available for public use.
The paper focuses on how prosody complements grammar in differentiating between two main strategies of quoting – direct and indirect speech (Peter said: “I am a linguist” vs. Peter said that he was a linguist). Basing on corpus data from spontaneous spoken Russian, the paper addresses two research questions: (1) how do grammatical, lexical and prosodic parameters correlate in canonical cases of direct and indirect speech in Russian, and (2) what are typical prosodic deviations from canonical patterns. Answering the first question, we indicated prosodic features (pitch movements, pitch reset, localization of phrasal accents, pauses etc.) that help to identify a prosodic break between the reporting frame (Peter said) and the following reported utterance in the case of direct deixis (when personal and spaciotemporal indexes are oriented towards the reported situation, cf. I am a linguist) and the lack of such a break in the case indirect deixis (when personal and spaciotemporal indexes are shifted towards the reporting situation, cf. that he was a linguist). Answering the second question, we indicated particular prosodic patterns used (i) to integrate otherwise prosodically disintegrated constructions with direct deixis, and (ii) to disintegrate otherwise prosodically integrated constructions with indirect deixis.
We discuss the status of orthoepy as a linguistic discipline; its scope and limitations; the nature of orthoepic prescriptions and the evolution of dictionary marks in its connection to the evolution of the stress and grammar of standard Russian. We consider the use of Russian National Corpus as an instrument of prediction of further orthoepic changes and envisaging prescriptions resulting from this change.
1. Prescription can not apply to contextual modifications of phonemes ([cм’]ех/[c’м’]ех), because they are not perceived by a “lay speaker”, are not subject to his or her conscious choice and can not lead to communicative failures.
2. On the other hand, prescription may apply to variants that are different in ‘sound types’ (звукотип, e.g. ж[а]ле́ть/ж[ы]ле́ть, [ceйф]/[c’eйф], ти́[xъй]/ти́[x’и́й], [ч]то/[ш]то, е́[жж]у/е́[ж’ж’]y); these variants may be suggested to represent newer vs. older norms.
3. Stress variants are the most dynamic domain of orthoepy. Special focus is on the change in verbal stress as reflected in the difference between the prescribed vs. real usage and in the chronology of dictionary marks (дружи́т→дру́жит, родился́→роди́лся etc.).
4. The use of the data of Russian National Corpus and, more specifically, comparative analysis of the main corpus and the corpus of newspapers is an important method of analysis of grammatical variation. We consider the competition between such forms as поезжай/езжай, одеть/надеть, тычу/тыкаю, мучу/мучаю, их/ихний and provide charts showing positive dynamics of the more recent variants.
In conclusion, we stress the difference in the nature of orthographic and orthoepic marks in the dictionaries. Orthographic marks may be prescriptive and prohibitive. Orthoepic marks may only, in most cases, have a status of recommendations; they can not (be regarded as means to) block language change.
The paper describes the diachronic development of Russian construction with the verb govorit’ ‘to say’, the main marker introducing the direct speech. The analysis relies heavily on the data from XIX century Russian as presented in a special annotated corpus of Early Modern Russian.
Presented in this report are the initial findings of automatic bridging anaphora recognition and resolution for the Russian language. For a resolution of F-measure = 0.65 we use a manually-annotated bridging corpus and machine-learning techniques to develop a classifier to predict bridging anaphors, bridging anchors, and bridging pairs. In addition to this, we discuss the features used for the classifier and discuss the importance of each feature. Experimental results show that our classifier works well, however, potential improvements can be made, these improvements will be explored.
Distributed representations of words are currently used in a variety of linguistic tasks. A specific branch of their possible applications includes automatic extraction of word-level grammatical information by formulating it as a problem of word embedding classification. In this paper, we investigate applicability of this approach to prediction of several particular classifying grammemes. We focus on animacy of Russian nouns and transitivity of Russian verbs. These categories can serve as good examples of classifying grammatical categories in the Russian language since their concrete values can hardly be predicted judging by appearance of words and morphemes that constitute them. We conduct experiments on a corpus of Russian texts from the Web with several widely used word-embedding algorithms and different parameter settings. Experimental evaluation includes the comparison of performance of several classifiers, with distributed representations being source of features for classification task. Our findings show feasibility of the approach and its potential to be implemented for solving related tasks.
In this study we present the method of morphological tagging on base of a deep learning neural network. The method includes two levels of an input sentence processing: individual characters level and word level. The comparison with other morphological analyzers was carried out with SynTagRus dataset in its original format of morphological characters, and its versions in Universal Dependencies formats 1.3 and 1.4. Achieved accuracies of Part-of-speech tagging: 98,34%, 98.49%, 97,60% (accordingly to each dataset). Results are a bit higher than the Google Syntaxnet accuracies and higher than the accuracies of the systems based only on Bidirectional Long short-term memory models. At the MorphoRuEval competition the method gained the third place.
We present and evaluate neural network models for semantic role labeling of texts in Russian. The benchmark for evaluation and training was prepared on the basis of the FrameBank corpus. The paper addresses different aspects of learning a neural network model for semantic role labeling on different feature sets including syntactic features acquired with the help of SyntaxNet. In this work, we rely on architecture engineering and atomic features instead of commonly used feature engineering. We investigate the ability of learning a model for labeling arguments of “unknown” predicates that are not present in a training set using word embeddings as features for the replacement of predicate lemmas. We publish the prepared benchmark and the models. The experimental results can be used as a baseline for further research in semantic role labeling of texts in Russian.
The paper demonstrates the possibilities of “one-direction analysis” in contrastive studies based on the parallel corpora. We assume that one may regard translation equivalents and paraphrases of a linguistic unit extracted from real translated texts as a source of information about its semantics; translations into Russian may be even more revealing in this respect. Our methods are based on the above hypothesis: we correlate the explications of the discourse words given in earlier studies with the “stimuli to translation”, that is, fragments of the original text that may cause the appearance of these discourse words in a Russian translation as a “reaction” to those “stimuli”. Using this methodology, we seek to validate, disprove or improve the semantic analysis of these words made without recourse to electronic corpora. The results of the analysis of the Russian discourse words eshche, vidimo, po-vidimomu and vidno are set forth.
In this paper we apply network analysis to the study of literature. At the first stage of our investigation we automatically extract networks (graphs) of characters for each part of Leo Tolstoy’s novel War and peace using two different techniques for network creation. Then we evaluate these two techniques against a set of manually created gold standard networks. Finally, we use the method that demonstrated better performance in our evaluation to test a literary hypothesis about Tolstoy’s novel. The hypotheses we intended to prove was that the parts of the novel describing war (i.e. those where the battlefield or military units are the primary settings), have statistically lower density of interaction between characters, resulting in lower network density, higher network diameters and lesser average node degrees. By showing this correlation we mean to demonstrate the applicability of network analysis to computational research of fictional narrative (e.g. detection of tension changes in the plot).
The role of orthographic neighbors (e.g. bank – tank) in word processing has been discussed in many experimental studies. However, these studies have been conducted on a limited pool of languages, and many important questions are still unresolved. After creating a lexical database StimulStat that contains various neighborhood parameters for Russian, we conducted the first experiment with substitution neighbors in Russian. We used lexical decision task with priming, and manipulated the following factors: whether the prime is more or less frequent than the target, whether the prime is a nominative singular (primary) form or an oblique form, and whether the substituted letter is word-final or in the middle of the word. The results suggest that noun forms undergo morphological decomposition at a very early stage and shed new light on the process of activating candidates during lexical. The results also have practical significance because it is well known that spelling errors are influenced by neighborhood effects.
The paper presents a methodology and preliminary results for evaluating plagiarism detection algorithms for the Russian language. We describe the goals and tasks of the PlagEvalRus workshop, dataset creation, evaluation setup, metrics, and results.
The paper presents the ParaPlag: a large text dataset in Russian to evaluate and compare quality metrics of different plagiarism detection approaches that deal with big data. The competition PlagEvalRus-2017 aimed to evaluate plagiarism detection methods uses the ParaPlag as a main dataset for source retrieval and text alignment tasks. The ParaPlag is open and available on the Web. We propose a guide for writers who want to contribute to the ParaPlag and extend it. The analysis of text rewrite techniques used by unscrupulous authors is also presented in our research.
MorphoRuEval-2017 is an evaluation campaign designed to stimulate the development of the automatic morphological processing technologies for Russian, both for normative texts (news, fiction, nonfiction) and those of less formal nature (blogs and other social media). This article compares the methods participants used to solve the task of morphological analysis. It also discusses the problem of unification of various existing training collections for Russian language.
Different language versions of bilateral state treaties are supposed to be equally authentic and have identical meaning. This belief looks problematic from the linguistic point of view, at least on the sentence level and below. In this article, we discuss the use of negation markers in the Russian and Finnish versions of the state treaties between Russia and Finland. The research material is the Russian-Finnish subcorpus of the PEST corpus (Parallel Electronic Corpus of State Treaties) that includes all treaties starting from 1917, the year of the Russian revolution and Finland’s independence. It was found out that in the Russian versions of the documents explicit negation markers (particles, prepositions, etc.) are used much more frequently than in the Finnish ones, while implicit negation markers (verbs and nouns expressing prohibition, refusal etc.) are more typical for the Finnish versions. This phenomenon was discovered by means of comparing frequencies of translation equivalents in the Russian and Finnish versions. The results were confirmed by studying parallel concordances: numerous examples of replacing explicit negative markers by implicit ones were found. This phenomenon can be explained by the differences between the two languages, the differences in diplomatic discourse conventions, and possible prevalence of translating the treaties from Russian into Finnish over composing texts directly in Finnish. At the current stage of the research, it is not yet clear which of these factors have an impact on the choice of negation markers and to what degree.
In this article we validate two measuring methods: Levenshtein distance and word adaptation surprisal as potential predictors of success in reading intercomprehension. We investigate to what extent orthographic distances between Russian and other East Slavic (Ukrainian, Belarusian) and South Slavic (Bulgarian, Macedonian, Serbian) languages found by means of the Levenshtein algorithm and word adaptation surprisal correlate with comprehension of unknown Slavic languages on the basis of data obtained from Russian native speakers in online free translation task experiments. We try to find an answer to the following question: Can measuring methods such as Levenshtein distance and word adaptation surprisal be considered as a good approximation of orthographic intelligibility of unknown Slavic languages using the Cyrillic script?
Coreference resolution aims at grouping textual references denoting same real world entities into clusters. Many state-of-the-art results have already been received for coreference resolution in European languages, but for Russian this area is still quite novel and underexplored. With this paper we try to fill this gap. Our article reviews existing approaches and presents their adaptation for Russian language. We carry out sufficient number of experiments to estimate efficiency of various machine learning methods and features, utilized under the hood of the algorithms. Additionally we propose a novel feature to be used for head detection subtask, which is based on word embeddings clustering. As a result, we managed to establish baseline implementation for Russian language coreference resolution problem. The key features of the developed approach are simplicity and extensibility. Presence of such a baseline opens many research directions for improving quality of the algorithms; some potential improvements are already pointed out in this paper. We expect further works in this area to significantly increase current level of state-of-the-art results for Russian coreference resolution, making it practically applicable in the near future.
We present neural network semantic parser for Russian language, that utilizes new copying mechanism, intermediate layers supervision and explicit handling of hierarchical nature of the output by means of having RNN blocks operating on different timeframes. Due to the lack of standard Russian dataset for validating semantic parsers, we develop our own small dataset in the domain of logistics and task management and demonstrate that our model can obtains good results on this dataset, despite it very limited size.
This paper presents the results of our experiments on building a general coreference resolution system for Russian. The main aim of those experiments was to set a baseline for this task for Russian using the standard set of features developed and tested for coreference resolution systems created for other languages. We propose several baseline systems, both rule-based and ML-based. We show that adding some semantic information is crucial for the task and even the small amount of data can improve the overall result. We show that different types of semantic resources affect the performance differently and sometimes more does not imply better.
The work deals with the multiple prenominal adjectives ordering within a DP/NP in Russian. The aim is to examine the claim concerning word order hierarchy for adjectives suggested in [Cinque 1994] via quantitative analysis of corpus data from the Russian National Corpus (RNC). The following hierarchy was checked: possessive > quantity > order > quality > size > shape > colour > nationality. We use the RNC semantic annotation taxonomy for the semantic class of an adjective. We checked the pairwise orders for different semantic classes (e.g. possessive > colour vs. colour > possessive). Our data confirm the claim (suggested in the experimental studies) that the possessive adjectives should be divided into two classes: ‘referential’ possessives (e.g. adjectives with suffix –in, Petin ‘Peter’s’) and so called ‘generic’ possessives (e.g. chicken breast). The latter occupy the position closer to the noun. There is one case of significant hierarchy violation. The referential possessive adjectives occupy the intermediate position between gradable adjectives (namely, after ‘colour’) and ungradable ones (‘nationality’ class). Hence, the determiner-like possessives occupy not the left-most position as in languages with articles. These findings serve the additional evidence in favor of theoretical claim that determiner-like adjectives in articleless languages acquire determiner function only in the left-most position, otherwise they function as other adjectives. The corpus data confirm the claim that the change of the adjective structural position induces its semantic type coercion. The detection of expected word-order violations can be helpful in word sense disambiguation and for error detection in RNC semantic annotation.
In recent years, distributional semantics has shown a trend towards a deeper understanding of what semantic relatedness is and what it is composed of. This is attested, in particular, by the emergence of new gold standards like SimLex999, WS-Sim and WS-Rel. Evidence from cognitive psychology suggests that humans distinguish between two basic types of semantic relations: category-based similarity and thematic association. The paper presents a distributional model capable of differentiating between these relations, and a dataset consisting of 500 similar and 500 associated pairs of nouns that can be used for evaluation of such models.
A semantic word network is a network that represents the semantic relations between individual words or their lexical senses. This paper proposes WATLINK, an unsupervised method for inducing a semantic word network (SWN) by constructing and expanding the hierarchical contexts using both the available dictionary resources and distributional semantics’ methods for is-a relations. It has three steps: context construction, context expansion, and context disambiguation. The proposed method has been evaluated on two different datasets for the Russian language. The former is a well-known lexical ontology built by the group of expert lexicographers. The latter, LRWC (“Lexical Relations from the Wisdom of the Crowd”), is a new resource created using crowdsourcing that contains both positive and negative human judgements for subsumptions. The proposed method outperformed the other relation extraction methods on both datasets according to recall and F1-score. Both the implementation of the WATLINK method and the LRWC dataset are publicly available under libré licenses.
The study is dedicated to contradictions A ≠ A in Russian, also known in literature as ‘negated tautologies’. They are often viewed as derivations from equative tautologies A = A. Here I describe structural and semantic features of negated tautologies that are established in Russian with patterns Х ne Х ‘Х is not Х’, Х ne est’ X ‘Х is not Х’ and Х – eto ne X ‘Х this is not Х’. Such constructions show that the speaker is not able to use the corresponding tautology X is X because (a) the referent of a linguistic expression X does not belong to the category x; (b) characteristics of the referent of a linguistic expression X or an attitude towards it differ from the norm; (c) the linguistic expression X is not employed in its common, straight meaning. Besides, negated tautologies are compared to similar Russian constructions Sdat + Х ne Х2 ‘Sdat + Х is not Х’ and Х ne v Х2 ‘Х is not in Х’, and to tautologies X est’ X ‘Х is X’ and X – eto X ‘Х this is X’.
The paper presents the results of using computer tools and of designing an inspection program for the purposes of the automated and semi-automated syntactic, lexical, and grammar error analysis of student essays in a learner corpus. The texts in the corpus were written in English by Russian learners of English. In our experiment we compare the parameters of the essays graded by professional examiners as the best and those graded the lowest in the pool of about 2000 essays. At the first stage in the experiment we applied a syntactic tool for parsing the sentences and collected data regarding mean sentence depth and the average number of relative, other adnominal, and adverbial clauses, then analyzed the results of lexical observations in those texts (such as average word length, number of academic words, number of linking words and some others), and finally collected the statistics related to the errors pointed out in manual expert annotation. The parameters that had very different values for the “good” and for the “bad” essays are regarded by the authors as worthy parts of the feedback a student can get for the text uploaded into the learner corpus.
This paper is aimed at answering the question what intonation of enumeration in Russian is. Our main result is that there is no universal intonation of enumeration in isolation from illocutionary meanings (i.e. the meanings forming a speech act such as either topics or foci), or discourse meanings (i.e. the meanings forming the coherent discourse). Prosodic patterns of enumeration are based on a variety of pitch accents (which originally express the meanings of the topic, the focus, the discourse continuity) aligned with either homogeneous parts of the sentence or homogeneous sentences. The most common (but not the only ones) prosodic patterns of enumeration are represented by pitch accents IK-6, IK-2, IK-3, and IK-4 in the terminology of E.A. Bryzgunova. We argue that enumeration has a field semantic structure. The nuclear of the semantic field of enumeration consists of the meanings expressed by pitch accents IK-6, IK-2, IK-3, and IK-4. The semantic periphery of enumeration consists of such meanings as a roll call, or a specific Russian strategy of splitting a complex sentence into autonomous segments of information. The peripheral meanings as well as the nuclear meanings are expressed by a variety of pitch accents. For investigation, a minor working corpus of the Russian speech recordings was set up. The corpus comprises records retrieved in the Multimodal subcorpus of the Russian National Corpus. This paper is illustrated throughout with the frequency tracings of the sounding patterns of enumeration. The software program Praat was used in the process of analyzing the sounding data.
The paper deals with collocation extraction from corpus data. A collocation is meant as a special type of a set phrase. Many modern authors and most of corpus linguists understand collocations as statistically determined set phrases. The above approach is the basic point of this paper which is aimed at evaluation of various statistical methods of automatic collocation extraction. There are several ways to calculate the degree of coherence of parts of a collocation. A whole number of formulae have been created to integrate different factors that determine the association between the collocation components. Usually, such formulae are called association measures.
The experiments are described which objective was to study the method of collocation extraction based on the statistical association measures. We extracted collocations for the word вода (water) and some others by means of the tool Collocations of the NoSketch Engine system using 7 association measures. It is important to stress that the experiments were conducted using representative corpora, with large amount of the resulting collocations being under study. The data on the measure precision allows to establish to some degree that in cases when collocation extraction is not used for some special purposes such measures as MI.l-og_f, log-Dice, and minimum sensitivity should be used. No measure is ideal, which is why various options of their integration are desirable and useful. And we propose a number of parameters that allow to rank collocates in an integrated list, namely, an average rank, a normalised rank and an optimised rank.
This paper aims at measuring the active vocabulary of Russian predicatives licensing dative-predicative-structures (DPS) and introduces the results of the sociolinguistic experiment paired with the corpus study for the same set of stimuli. The sociolinguistic experiment aimed at measuring the active DPS vocabulary in the idiolects of 18 native speakers of Russian and produced a sample of 422 stimuli ranked according to their acceptance rate. The same set of stimuli was tested on data from the Russian National Corpus, which produced the second sample based on the frequency ranks of DPS clauses in the diagnostic context мне Z-во, with the priority dative argument placed contiguously to the predicative at the distance <-1:1> in the same clause. The correlation of two ranked samples predicts the volume of the DPS vocabulary in an average idiolect of Russian and helps to establish the proportion between the lexical nucleus of the DPS construction and its extension. The analysis verifies the hypothesis that native speakers of Russian apply to the same underlying principles of semantic selection for DPS elements while using non-identical sets of lexical elements. The paper introduces a formal model of lexical extension with the entire DPS being grouped onto 15 thematic classes. The results prove that each thematic class includes high-, middle- and low-frequency DPS elements. Low-frequency DPS predicatives are modelled after high-frequency elements from the same thematic class.
The paper describes an approach to plagiarism detection within PlagEvalRus-2017 competition. Our system leverages deep parsing techniques to be able to detect moderately disguised plagiarism. We participated in the two tracks of the competition: source retrieval (sources detection) and text alignment (paraphrased plagiarism detection). There are various cases of plagiarism presented in datasets of both tracks. They vary by the level of disguise that was used while reusing text. The results show that our method performed quite well for detecting moderately disguised forms of plagiarism.