In this paper we compare different models for measuring synonymy. We consider methods based on monolingual text corpora and parallel texts. We experiment with the features based on context similarity, translation similarity, and similarity of neighbors in the parse trees. We provide an analysis of strong and weak points of different approaches and show that their combination can improve the results. The considered methods can handle large-scale vocabularies and be useful for automatic construction of human-oriented synonym dictionaries.
The paper considers Russian synonymic verbs ischeznut’ ‘to disappear’ and propast’ ‘to vanish’ and analyzes their semantic differences which motivate their differences in syntactic, aspectual and collocational properties, as well as in their polysemies. Semantic oppositions that distinguish between these two verbs, namely, type and referential status of the disappearing object, cause of disappearance, speaker’s expectations, speed and completeness of disappearance, presence of an observer can be applied to the analysis of the entire semantic domain of the ‘end of existence’.
The paper considers the senses of the Russian adjective poslednĳ ‘last’. Its polysemy is analyzed as deriving from a certain core semantic structure that is common to all its meanings. The core structure has two semantic valencies — of a sequence and of a sequence element. Modifications of the core structure, including additional valencies (point of reference and landmark) account for its polysemy, as well as for diversity of its collocational and syntactic properties. The paper also demonstrates the role of pragmatics and lexicalization of grammatical and syntactic forms in disambiguating different meanings of poslednĳ , against the backdrop of its English correlates.
Although there exist comprehensive morphologically annotated corpora for many morphologically rich languages, there have been no such corpora for any polysynthetic language so far. Developing a corpus of a polysynthetic language poses a range of theoretical and practical challenges for corpus linguistics. Some of these challenges have been partly addressed when developing corpora for languages with extensive morphological inventories and numerous productive derivation models such as Turkic or Uralic, while others are unique for this kind of languages. As we are currently working on a corpus of the polysynthetic West Circassian language, we had to identify these challenges and propose theoretical and practical solutions. These include the tokenization problem, which involves delimiting morphology from syntax, the problem
with lemmatization and part-of-speech tagging, and a number of glossing and search issues. The solutions proposed in the paper are partly implemented and will be available for public testing when the preliminary version of the corpus is released.
The paper presents evaluation of three neural network based approaches to Twitter sentiment analysis task performed at SentiRuEval-2016. The task focuses on sentiment classification of tweets about banks and telecommunication companies.
Our team submitted three solutions which are based on different supervised classifiers:
Gated Recurrent Unit neural network (GRU), convolutional neural network (CNN), and SVM classifier with domain adaptation combined with previous two classifiers. We used vector representations of words obtained with word2vec model as features for classifiers. These classifiers were trained on labeled data provided by organizers of the evaluation. Additionally, we collected several million posts and comments from social networks for training word2vec model.
According to evaluation results, GRU-based solution shows the best macro-averaged F1-score for both domains (banks and telecommunication companies) and also has the best micro-averaged F1-score for banks domain among all solutions submitted to SentiRuEval.
The paper deals with linguistic disfluencies (hesitations, repetitions, revisions, false starts,
and incomplete utterances) in Russian-speaking language-impaired (N=12) vs. typically-developing (N=12) preschoolers. The corpus-based study aimed at evaluation and comparison of linguistic disfluency in narrative vs. dialogue discourse within and between the groups. Following the Russian Assessment Instrument for Narrative (RAIN) methodology, each subject performed two tasks, i.e. storytelling and story retelling according wordless picture sequences; each of the tasks was followed by a structured dialogue based on ten comprehension questions. Both narratives and dialogues were transcribed and annotated for automatized linguistic analysis. Finally, individual measures (a number of each category of disfluencies per utterance) were estimated and submitted for statistical analysis.
Results of our study evidenced that mainly linguistic disfluencies are caused by distinct
strategies of speech production due to a level of the subject’s language competence, cognitive resource, and the circumstances of narrative and dialogue production.
The paper discusses different modes of evaluation in Russian. Evaluation is considered as a speech act based on a cognitive procedure which has the following form: (i) evaluation of an object X as possessing a feature q consists of comparing of parameter Q with X and picking out of q as a function of Q with an argument of X; (ii) the feature q presupposes recommendations for decision making in connection with an object X. Cognitive procedure of description of an object X as possessing of a feature q doesn’t presupposes any recommendation for decision making.
In some discursive modes semantics of evaluation lose its influence force or at least it getting weaker. Discursive mode is defined as a sphere of functioning of speech forms in discourse, in which their meaning regularly changed. Different discourses allow different kinds of discursive modes. In the paper are discussed the following discursive modes, which modify evaluation force: irony, language game, common nomination, indefinite reference.
Our paper deals with the rapidly developing area of corpus linguistics referred to as Web as Corpus (WaC), i.e., creation of very large corpora composed of texts downloaded from the web. Some problems of compilation and usage of such corpora are addressed, most notably the “language quality” of web texts and the inadequate balance of web corpora, with the latter being an obstacle both for corpus creators, and its users. We introduce the Aranea family of web corpora, describe the various processing procedures used during its compilation, and present an attempt to increase the size of its Russian component by the order of magnitude. We also compare its contents from the user’s perspective among the various sizes of the Russian Aranea, as well as with the other large Russian corpora (RNC, ruTenTen and GICR). We also intent to demonstrate the advantage of a very large corpus in linguistic analysis of low-frequency language phenomena in linguistics, such as usage of idioms and other types of fixed expressions.
We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is rule-based. The other one (“TOROT”) is being used for annotating the eponymous corpus and is statistical. We apply the two analyzers to the same Middle Russian text and then compare their outputs with high-quality manual annotation. Since the analyzers use different annotation schemes and spelling principles, we have to harmonize their outputs before we can compare them. The comparison shows that TOROT performs considerably better than RNC (lemmatization 69.8% vs. 47.3%, part of speech 89.5% vs. 54.2%, morphology
81.5% vs. 16.7%). If, however, we limit the evaluation set only to those tokens for which the analyzers provide a guess and in addition consider the RNC response correct if one of the multiple guesses it provides is correct, the numbers become comparable (88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple procedure which boosts TOROT lemmatization accuracy by 8.7% by using RNC lemma guesses when TOROT fails to provide one and matching them against the existing TOROT lemma database. We conclude that a statistical analyzer (trained on a large material) can deal with non-standardised historical texts better than a rule-based one. Still, it is possible to make the analyzers collaborate, boosting the performance of the superior one.
We introduce a pattern-based approach applied to the semantic relation retrieval and semantic modeling. Our method relies upon the use of a general knowledge lexical semantic network built, shaped, and handled by crowdsourcing and GWAPs (games with a purpose). Implementing constraints on semantic relations available in the network increases the efficiency of the relation extraction process but also opens a semantic modeling perspective. In terms of (mostly horizontal) relation extraction, we tested our method on radiology reports in French. Our results show the interest of using a general knowledge lexical semantic network for the domain specific textual analysis as well as the interest of implementing series of constraints on semantic relations for the relation retrieval. We recently turned to the analysis of cooking recipes that stand for examples
of domain specific instructional texts. Thus, in addition to the semantic relation discovery, we are building a method for the semantic modeling and conceptualization of cooking instructions. Its first results are presented below. Today, our results are available for French but we target extending the lexical network coverage to other languages in the next few years.
The primary goal of the present study is to improve methods for contrastive corpus investigations. Our data is the Russian construction дело в том, что and its parallels in English, German and Swedish. This construction, which appears to present no difficulty for translation into other languages, is in fact language-specific with respect to at least one parameter. It displays a large number of different parallels (translation equivalents) in other languages, and possesses a complex semantic structure. The configuration of semantic elements comprising the content plane of this construction is unique. The empirical data have been collected from the corpus query system Sketch Engine, subcorpus OPUS2 Russian, and the Russian National Corpus (RNC). The analysis shows that the construction дело в том, что has more than 50 parallels in English, over 30 in German, and about 30 in Swedish. In all three languages the most common means
of translating the construction is to omit it. Also frequent are the English equivalents the fact/thing/point/truth is (that); (it’s/this/that is) because; the German expressions nämlich; die Sache ist, die; denn; and the Swedish constructions saken är den att; problemet/faktum är att. The semantic structure of дело в том, что includes the following components: 1) substantiation of something stated previously; 2) indication of the reason something has happened; 3) emphasis on the significance of what has been stated. The different translations of the construction are motivated by the fact that each specific context focuses on one of these meanings.
This paper presents an algorithm for generating the Domain-Specific Sentiment Russian dictionary using a graph model. It is important to emphasize that the described algorithm does not require any human-labeling, but just a sufficiently large corpus of Russian texts from the subject area, which can be generated automatically for most domains. Our algorithm is not strictly confined to the Russian language and, if necessary, can be generalized to develop dictionaries in other languages.
Dictionaries of positive and negative words are created using the analysis of the graph constructed on unlabeled corpus of the Domain-Specific Russian texts. The graph was built using the approach described in , pre-adapted to texts in Russian. The applicability of this method to create a graph for prediction of polarity of adjectives in reviews in Russian language is experimentally evaluated.
The original method of graph processing for splitting the vertex set of this graph into subsets of positive and negative words was proposed and implemented. The algorithm starts with gathering a small seed set of adjectives, polarity of which is unambiguous irrespective of a subject area (for example, “bad”, “good”, “terrible”, “excellent”).
Further, words are distributed iteratively: each time a vertex is added to the set, if the vertex is most strongly associated with the already existing vertices in the set. Several weighting functions on the edges were compared, as well as functions of attraction to the sets of positive and negative words with the aim of composing the most accurate dictionaries of positive and negative adjectives for a specific subject area.
This study contributes to the research field of multimodal linguistics. Multimodal linguistics explores numerous channels involved in natural communication, such as verbal structure, prosody, gesticulation, mimics, eye gaze, etc., and treats them as parts of an integral process. Among the key issues in multimodal studies is the question of temporal coordination between the illustrative manual gestures (that is, spontaneous co-speech gestures) and elementary discourse units (that is, basic quanta of the local structure of spoken discourse). We address this issue with the help of a novel multimodal corpus “Pear chats and stories” that is currently under construction. It had been shown in a number of studies that gesture onset usually precedes speech onset. In order to verify this claim through our materials, we developed an analytic method that allowed to conduct a more detailed study. According to our results, it is only less than a half of all
gestures that are produced before the corresponding fragment of talk. The most likely explanation of the obtained results is associated with gestures’ affiliation in a certain functional class, that is strongly dependent on discourse genre and speakers’ individual differences.
In this paper we show that using deep textual parsing, which is finding complex features such as syntactic and discourse structures of the text, helps to improve the quality of style and genre classification. These results confirm achievements of many researches that have many times stated that using syntactic or morphological pattern for style and genre classification results in poor precision and recall. The best practice so far is to use n-gram patterns for this type of text classification problem. Syntactic and discourse structures allow however to capture some style of genre specific pattern of texts and to reach average precision higher than 95% on binary multi-genre classification.
In this paper, we use evidence from the Multimodal Russian Corpus (MURCO) to explore gesture properties that enable distinction between perfective and imperfective Russian verbs. The properties identified are duration, repetition, and energy. We show that repetition and energy differentiate perfective and imperfective verbs because these properties are salient in gestures accompanying one group of verbs but are not manifest in gestures accompanying verbs in the other group. Gesture duration, on the other hand, can be used to identify either aspect.
The paper discusses the problem of formal variability of Russian two-part correlative connectors on the example of ne to chtoby…no and ne to chto…a. The results of the analysis, carried out both with formal and functional-semantic criteria, allow to state that ne to chtoby… no and ne to chto… a are two separate linguistic units with the first expressing substitution aimed towards more descriptive adequacy and the second unit expressing, beyond that, substitution aimed towards more argumentative relevance. This semantic difference is due to the different scope of the negative particle ne, which is the part of both markers; even if in both cases the gradation is rising. The position of ne to chtoby and ne to chto is not fixed, and ne to chtoby can be characterized by phonetic (ne to chtob) and morphological (ne tak chtob(y)) variability. As forms ne to chtoby and ne to chto can express relations of substitution alone, they may be considered basic or minimal markers of such relations. The use of these forms as two-part correlative connectors
with adversative conjunctions no, a and other lexical units is dictated by the speaker’s communicative intention, the syntactical construction and other discursive parameters. The Russian National Corpus data confirms our statements.
When words have several senses, it is important to describe them properly in dictionary (a lexicographic task) and to be able to distinguish them in a given context (a computational linguistics task, WSD). Different senses normally have different frequencies in corpora. We introduced several techniques for determining sense frequency based on dictionary entries matched with data from large corpora. Information about word sense frequency is not only useful for explanatory lexicography and WSD, but it also may enrich language learning resources. Learners of a foreign language who encounter a word similar to one of their native language are often tempted to assume
that the foreign word and its equivalent have the same meaning structure. Sometimes, however, this is not the case, and the most frequent sense of a word in one language may be much less frequent for its cognate. We proposed a method for detecting such cases. Having selected a set of Russian words included into the Active Dictionary of Russian which have more than two dictionary senses and have cognates in English, we estimated the frequencies for English and Russian senses using SemCor and Russian National Corpus respectively, matched the senses in each pair of words and compared their frequencies. Thus we revealed cases in which the most frequent senses and whole meaning structures are, cross-linguistically, substantially different and studied them
in more detail. This technique can be applied not only to cognates, but also to pairs of words which are usually offered by the dictionaries as the translation equivalents of each other.
This paper provides an alternative method to extracting object-based sentiment in text messages, based on modified method previously proposed by Mingbo , in which we first parse the syntax, and then correlate the sentiment with the object of analysis (also referred to as entity by some, therefore, used in this article interchangeably). We show two approaches for the sentiment polarity classification: syntactic rule patterns and convolutional neural network (CNN). Even without domain specific vocabulary and sophisticated classification algorithms, rule-based approach demonstrates an average macro-F1 based rank among the participants, whereas domain-specific vocabularies show a slightly higher macro-F1 score, but still close to an average result. CNN approach uses syntax dependencies and linear word order to obtain more extensive information about object relations. Convolution patterns, designed in this approach, are very similar to rules, obtained with rule-based approach. In our proposed approach, the neural network was trained with different Word2Vec (WV) models; we compared their performance relative to each other. In this paper, we show that learning a domain-specific WV offers slight progress in performance. Resulting macro-F1 score show performance in the into top three of the overall results among the competitors, participating in 2016 SentiRuEval event. Originally, we have not submitted our results to this competition at the time it was held, but had a chance to compare them post-hoc. We also combine the CNN approach with the rule-based approach and discuss the obtained differences in results. All training sets, evaluation metrics and experiments are used according to SentiRuEval 2016.
The paper describes a new branch in corpus linguistics that deals with building and using large corpora. We introduce several new large Russian corpora that have recently become available. The paper gives a survey of the given corpora and analyzes a number of Russian nouns across the following corpora of different sizes: the Russian Web corpus by S. Sharoff (187.97 mln tokens), ruTenTen (18.28 bln tokens) and its sample (1.25 bln tokens). The research focuses on the discussion on these corpora, their comparison and the study of frequency properties for the high- and low frequency Russian nouns comparing them with data published in the Frequency Dictionary. The analysis shows the lists presented in the frequency dictionary of Russian differs from the corpus data depending on types of the nouns.
The paper reports some results of the research, aimed at finding out whether regressive and / or progressive voice coarticulation available in clusters of homorganic labiodental consonants /v/ + /v/ in an external sandhi position in Modern Standard Russian may serve as a cue for detecting the location and depth of prosodic breaks.
Combinations of labiodental fricatives /v/ + /v/ at the word junctures result in [ff], [vv] or [fv] pronunciation (with the decreasing abundance) in Modern Standard Russian. The percentage ratio of the above mentioned pronunciation types depends on the strength of the prosodic break between two words:
• in the position within an intonation group (no prosodic break) [ff] pronunciation appears fairly stable and makes about 70% of the total case number, while the percentage of [fv] pronunciation (corresponding to the absence of the coarticulation) varies in the range of 1% – 11%.
• in the position around prosodic break between two words group [fv] pronunciation detected in more then 80% out of the total case number studied.
The article explores pragmatic functions of the “communicative mummery” – phenomenon ob-servable in everyday communication and consisting in some referential slides that happen in the communicative act trivial schema “I talk to You Here and Now”: the speaker can communicate as if he wasn’t himself but someone else or if he was talking to another person but not to his real communicative partner. Unlike a simple trickery the communicative mummery isn’t hidden from the speaker’s interlocutor. Conversely, experiencing such transfigurations (I speak, if it wasn’t me…etc.) the speaker intentionally uses a range of prosodic and/or non verbal markers such as special gaze, gesturing that isn’t familiar to him, specific accent or prosodic contours for attracting his partner attention. The discursive manifestations of communicative mummery have some common features with the reported speech and the polyphonic conversational humor phenomena but, in the same time, display its own particulars properties and perform rather special functions in conversation. Firstly remarked in mother-child communication as a particular mother’s practice of child socialization the discussed phenomenon was also found in adults’ heterogeneous speech interactions that, after having been collected in a corpus of 52 items, served as a data for our analysis. It shows that the main pragmatic functions of communicative mummery is to prevent the speaker’s social face loss in case if he violates social conventions regulating communicative behavior in the situations of social enforcement, social guilt or self-praising.
Automatic assessment of sentiment in large text corpora is an important goal in social sciences. This paper describes a methodology and the results of the development of a system for Russian language sentiment analysis that includes: a publicly available sentiment lexicon, a publicly available test collection with sentiment markup and a crowdsourcing website for such markup. The lexicon is aimed at detecting sentiment in user-generated content (blogs, social media) related to social and political issues. Its prototype was formed based on other dictionaries and on the topic modeling performed on a large collection of blog posts. Topic modeling revealed relevant (social and political) topics and as a result – relevant words for the lexicon prototype and relevant texts for the training collection. Each word was assessed by at least three volunteers in the context of three different texts where the word occurred while the texts received their sentiment scores from the same volunteers as well. Both texts and words were scored from -2 (negative) to +2 (positive). Of 7,546 candidate words, 2,753 got non-neutral sentiment scores. The quality of the lexicon was assessed with SentiStrength software by comparing human text scores with the scores obtained automatically based on the created lexicon. 93% of texts were classified correctly at the error level of +/-1 class, which closely matches the result of SentiStrength initial application to the English language tweets. Negative classes were much larger and better predicted. The lexicon and the text collection are publicly available at http://linis-crowd.org.
In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora resolution as a way to exploit more co-occurrence data without directly increasing the size of the training corpus.
We replace three different types of anaphoric pronouns with their antecedents in the training corpus and evaluate the extent to which this affects the performance of the resulting models in lexical similarity tasks. CBOW and SkipGram distributed models trained on Russian National Corpus are in the focus of our research, although the results are potentially applicable to other distributional semantic frameworks and languages as well. The trained models are evaluated against RUSSE '15 and SimLex-999 gold standard data sets. As a result, we find that models trained on corpora with pronominal anaphora resolved perform significantly better than their counterparts trained on baseline corpora.
The sentiment lexicons are an important part of many sentiment analysis systems. There are many automatic ways to build such lexicons, but often they are too large and contain errors.
The paper presents the algorithm of sentiment lexicons creation for a given domain based on hybrid – manual and corpus-based – approach. This algorithm is used for the development of the sentiment lexicons by means of four human annotators each for five domains – user reviews of restaurants, cars, movies, books and digital cameras. Created sentiment lexicons are analyzed for inter-annotator agreement, parts of speech distribution and correlation with automatic lexicons.
The performance of the sentiment analysis based on the created sentiment lexicons is researched and compared with the performance of the existing sentiment lexicons. The experiments with text corpora on various domains based on SVM show high quality and compactness of the human-built lexicons.
The problem of speaker identification in whispered speech is of some interest for cognitive science, as well as for forensic and language minority. The work is devoted to an experimental study of the problem of familiar and unfamiliar speaker identification in
vocal speech and in whisper. The experiment simulated in some respects the task and conditions of identification of a speaker by a listener. The initial number of participants – 18 persons has been increased with the help of a questionnaire survey on the Internet. The number of remote auditors amounted to 125 people. The experiment took place in "online". The subject listened to and identified at least 16 pairs of entries. At the end of the experiment, he indicated if he had recognized any of the speakers. It was found that the main clues for correct recognition of a familiar speaker are the individual characteristics of his articulation, the components of the extraverbal part of his "speech portrait". The other features, such as individual speech style and individual manner of pauses and text macrosegmentation did not have any significant effects on speaker identification in whispered speech.
The paper deals with the problems of interaction between the natural language together with its analogue – the natural-like language of geometry – and the language of geometric sketches within the two domains of intellectual activities. These are (1) synthesis and analysis of the natural language texts together with the corresponding non-verbal signs, and (2) oral and written multimodal communicative acts. Some linguistic (morphologic, syntactic and semantic) as well as semiotic peculiarities (the use of special signs, font markup, color, etc.) of the languages are discussed. The correspondence between some fragments of the natural-like language of geometry and some sketches is established. The problem of the representations of logical connectives and quantifiers in the sketches is partially solved by constructing the sketch analogs of the natural-like units and their combinations. It is stated that the more profound understanding of geometric facts and problems can be achieved by fluent knowledge of both languages and the special translation skills.
In this paper we discuss the results of speech breathing research, undertaken to expand an empirical base for modeling of prosodic phrasing in Russian speech, The introductory section provides a brief description of the background, clarifies basic terms, explains the concept of breathing pause (BP) and its correlation with prosodic breaks and prosodic phrasing. In the second section we formulate the problems, discussed in this paper, with the main task to analyze the correlation of BP with the boundaries of the principal text units - paragraphs, sentences, clauses, taking into account the interspeaker variability in reading of the same text. The third section describes the material and methods of experimental analysis with particular attention to the possibilities of the computer detection of BP in a spoken text, as well as to the material adequate to the study. The fourth section outlines the general features of speech breathing in reading of the same text by different speakers. It is shown that one of the most common features is a different number of BPs that speakers make when reading the same text. It was also found that this variability is not related to the gender characteristics of the speakers or their place in the ranking of the best set of readings. Some correlation was found with the individual speech rate - the number of syllables spoken per second. However, despite this variability, all speakers use intonation pauses in the experimental text for breathing rather often. BP part of total intonational pauses averages 62% in the range from 52% to 74% by different speakers. The specific use of BP consists in the fact that they reflect the hierarchical structure of the text, with the individual clauses as the basis of it. Namely, text units, the end of which is accompanied by BP, are arranged in the direction of decreasing the probability of BP as follows (in parentheses the frequency of BP averaged by 10 speakers is given): paragraph (100%)> sentence inside a paragraph (94%)> clause inside a sentence (65%)> component in a clause (34%). In conclusion the study is summed up with the implication that BP in prosodic phrasing can serve as a sufficient signal of semantic text boundaries, but interspeaker variability shows that BP is not a necessary indication of them. The differentiating function of this prosodic marker is supported by the fact that BP with different text localization have stable differences in the overall phonetic picture and in such acoustic features as duration and intensity of breathing noise.
Pronominal complexes may be units of the pronouns’ system (drug druga; chto ugodno; neizvestno gde) or may be constructions. Construction occupies an intermediate position between the free combinations and idioms. This paper discusses the bipronominal complexes (cf.: Razbezhalis’ kto kuda), which have distributive meaning. Their first component is a distributive, the second one may be (1) interrogative pronoun (Kto kogda priehal? – ‘When everyone has arrived’); (2) indefinite pronoun (Zanimayutsya kto chem ‘everyone does something’), (3) relative pronoun (Pomogaet komu chem mozhet ‘He helps everyone with what may’).
Unlike other constructions with distributive semantics (cf.: Chemodany popadali na pol ‘Suitcases fell to the floor’; Kazhdy poluchil po 10 tsentov ‘They received ten cents each’), distributive bipronominal complexes are insufficiently investigated. The paper discusses the semantic and syntactic properties of bipronominal complexes - (2) and (3) types: are pronouns referential or non-
referential; what determines the choice of pronouns’ cases.
The paper is devoted to some polysemous Russian particles and the problem of lexicalized prosody. The phenomenon of lexicalized prosody in Russian drew the linguists’ attention about 30 years ago. The investigation is usually confined to phrasal stress as the most frequently lexicalized and therefore the most lexicographically interesting prosodic pattern. However, as far as discourse particles are concerned, not only phrasal stress but also intonation pattern is of greatest interest. Two Russian discourse particles will be discussed. One of them is the particle -to. Its different usages imply very different prosodic patterns. The other one is vot which can be used not only as a demonstrative particle, but also as a xenomarker (quotation marker), which requires a specific prosody.
This study is an extension of the author's works, presented at the “Dialogue 2014 and Dialogue 2015” conferences. According to the concept of universal melodic portrait (UMP), a phrase intonation can be described as a sequence of UMPs of accentual units (AUs) that make up the phrase. The present paper describes the results of pilot studies where melodic portraits for English and Russian language phrases were compared. The examined phrases were derived from simple situational dialogues and were spoken by native English and Russian speakers. The study was restricted only to phrases with a one-accent unit structure representing the three main types of phrase intonations: affirmative statements, special questions and general questions.
The described UMP model allows to investigate tonal differences within languages by applying precise quantitative assessments. The method can be used effectively for solving problems of language interference. Moreover, the UMP model could potentially find an effective application in foreign language studies. Using the appropriate software that realizes the described stages of UMP construction, a learner could be able to visually compare an intonation of the pronounced phrase with its target intonation portrait and work to eliminate a foreign accent by proper training.
Word sense disambiguation (WSD) methods are useful for many NLP tasks that require semantic interpretation of input. Furthermore, such methods can help estimate word sense frequencies in different corpora, which is important for lexicographic studies and language learning resources. Although previous research on Russian polysemous verbs disambiguation established some important and interesting results, it was mostly focused on reducing ambiguity or determining the most frequent sense, but not on evaluating WSD accuracy. To the best of our knowledge, there is no comprehensively evaluated method that can perform semi-supervised word sense disambiguation for Russian verbs. In this paper we present a WSD method for verbs that is able to reach an average disambiguation accuracy of 75% using only available linguistic resources: examples and collocations from the Active Dictionary of Russian and large unlabeled corpora. We evaluate the method on contexts sampled from the web-based corpus RuTenTen11 for 10 verbs with 100 contexts for each verb. We compare different variations of the method and analyze its limitations. Method’s implementation and labeled contexts are available online.
In this paper we have described the semi-automatic process of transforming the Russian language thesaurus RuThes (in version, RuThes-lite 2.0) to WordNet-like thesaurus, called RuWordNet. In this procedure we attempted to achieve two main characteristic features of wordnet-like resources: division of data into part-of-speech-oriented structures with cross-references between them and providing a set of relations similar to WordNet-like resources. The published version of RuWordNet contains more than 115 thousand Russian words and phrases presented in form of three lexical nets for nouns, verbs and adjectives. Between synsets such relations as hyponym-hypernym, meronymy, part-of-speech synonymy, antonymy are established. In the paper we compare web-page representations of RuThes 2.0 and RuWordNet. It can be seen that RuThes looks as an ontology describing concepts and their relations and RuWordNet looks as a net of words. Researchers can obtain both types of thesauri and compare them in applications. In future, we will continue to add new types of relations to RuWordNet including the domain relation, the cause relation, the entailment relation, etc.
In this paper we present the Russian sentiment analysis evaluation SentiRuEval-2016 devoted to reputation monitoring of banks and telecom companies in Twitter. We describe the task, data, the procedure of data preparation, and participants' results. At the previous evaluation SentiRuEval-2015, it was noticed that the presented machine-learning approaches significantly depended on the training collection, which was not enough for qualitative classification of the test collection because of data sparsity and time gap. The current results of the participants at SentiRuEval-2016 showed that they have made successful steps to overcome the above-mentioned problems by combining machine-learning approaches and additional manual and automatically generated lexical resources.
The article discusses what modern tools offer for a corpus-based lexical research in Russian. As an example we analyzed how the adjective gordy ‘proud’ is used in modern news texts. We studied data from such resources as two general Russian language corpora (RNC, GICR) and a corpus of syntactic co-occurrences containing information on syntactic relations of words for Russian (CoSyCo ). If a corpus includes a variety of genres and allows to make fine-grained distinctions between text sources, it helps to highlight important style- and genre-dependent differences. Our comparison has demonstrated that there are quite significant differences in the usage of gordy which become clear when we study general news and IT news corpora separately, however, in general they show certain similar tendencies. It is also shown that when more varied genres are taken into account it may make more visible such style and genre features which it is not so easy to notice otherwise.
The argument constructions of adjectives has largely been out of the scope of research on semantic roles both in theoretical and IT fields. Before adding the roles of adjectival arguments to the network of semantic roles it is important to determine whether the adjectival roles form a separate list or whether they can be seen as an extension of roles assigned to the patterns of verbs and nominalizations. We discuss the general principles of how the inventory of adjectival roles should be organized in comparison with the existing inventories of verbal roles. In order to verify our statements, we carry out an experimental survey aimed at measuring the similarity between adjectival and verbal roles. The results have shown that both semantic interpretation of roles and their typical morpho-syntactic expression are significant for the evaluation and should be taken into account in working out the inventory. Besides, the specificity of adjectives lies in their prototypical stative semantics, which favors some differences in assigning a semantic role as compared to verbs. The results of the survey also provide some evidence for verification and development the inventory of verbal semantic roles.
This paper aims at evaluating formal theories of case assignment with respect to their applicability to modeling of case variation. Crosslinguistically, differential case marking exhibit significant variation in many parameters, including licensing factors of case variation, correlation of case with linear position, and feeding of predicative or possessive agreement. In this paper, I consider the two most elaborated formal theories of case — the minimalist syntactic case theory and the configurational case theory — and explore their expressive power in modeling various types of differential case marking. I show that none of the theories is superior to the other — rather, each of them naturally accommodates a specific type of case variation but is unsuitable to express the other types. The minimalist syntactic case theory is more flexible in that it is compatible with additional mechanisms deriving the morphologically observable case variation, and more restrictive in that it predicts the one-to-one correspondence between case assignment and agreement. The prime advantage of the configurational theory is that it can represent directly the non-local dependencies between case-marking of different arguments.
For the last decade, grammatical dictionaries have become not only a thing of theoretical value but an essential tool used in many fields of applied linguistics. However, the procedure of manual creation of a grammatical dictionary remains time- and labor-consuming. In this paper, the two-stage algorithm of automatic dictionary compilation, not requiring annotated texts, is proposed. As the source data, this system requires a formalized grammar description and a frequency distribution of a relatively large (hundred thousand tokens) corpus. Extending the principles commonly applicable to Indo-European languages, the research focuses on machine learning methods of corpora-based dictionary formation. Four machine learning models — SVM, random forest, linear regression and perceptron— are tested on the material of four languages: Albanian, Udmurt, Katharevousa, and Kazakh, and compared to a heuristic approach. While the linear models proved to be ineffective, other models’ results were more promising: in an experiment with training and test sets formed from the same language’s material, random forest reached 63% F-score, and SVM’s results were also overdoing the baseline, however, the random forest model was unsuccessful. The best classifier in case of training and test sets based on the material of different languages was SVM. As a by-product of the experiments, the restrictions on the input were postulated: the approach ‘as is’ is not applicable to languages where inflections are strongly homonymic, and, on the contrary, is promising applied to an agglutinative language.
We present a corpus-based analysis of the use of possessive and reflexive possessive pronouns in a newly created English-Czech-Russian parallel corpus (PCEDT-R). Automatic word-alignment was applied to the texts, which were subsequently manually corrected. In the word-aligned data, we have manually annotated all correspondences of possessive and possessive reflexive pronouns from the perspective of each analysed language. The collected statistics and the analysis of the annotated data allowed us to formulate assumptions about language differences. Our data confirm the relative frequency of possessive pronouns in English as compared to Czech and Russian, and we explain it by the category of definiteness in English. To confirm some of our hypotheses, we used other corpora and questionnaires. We compared the translated texts in Czech and Russian from our corpus to the original texts from other corpora, in order to find out to what degree the translation factor might influence the frequency of possessives.
The paper presents quantitative data about the web segments in minority languages of Russia. An ad-hoc search procedure allows to locate sites and pages on social networks that contain texts in a certain language of Russia. According to our data, there are texts in at least 48 of the examined languages on the Internet. We compared the gathered statistical data with the data from Wikipedia and the number of native speakers and found out that none of the “live” online data has a good correlation with the offline-life of language.
In the context of a sentence grammatical aspect (apart from its function of expressing multiplicity y) characterizes a situation with respect to the moment of perception. In the context of discourse the moment of perception also honestly does its job in the case of a sequence of identical aspectual forms. In fact, the notion “moment of perception” makes it possible to derive the textual (discursive) meaning of the Perfective and Imperfective forms from their meaning in an isolated utterance. The question arises: what happens in the context of juxtaposed or conjoined different aspectual forms. The subject of attention in the paper is a morphosyntactic configuration “Perfective verb + conjunction i ‘and’ + Imperfective verb”. It is demonstrated that temporal relationships between situations in the context of this configuration heavily depend upon the moment of perception. If the perception moment of the Imperfective verb is synchronous to the perfective state of the Perfective one the sequential relationship between situations arises. If the perception moment of the Imperfective verb is synchronous to the event denoted by the Perfective verb the relationship between situations is synchronicity.
The paper presents the initial / preparatory stage of the study of variation of hard / soft consonants before e in loanwords (ka[f]e). The main goal is to compile a database of relevant words for use in sociolinguistic research. The database is based on the list of word forms containing relevant contexts in users’ queries to Yandex. All entries in the database are annotated for parameters that may be important in a variational study of the phenomenon. The article describes how the list was compiled and the principles of its annotation. The latter includes the consonant, the position of the consonant re the stressed syllable, the type of syllable where it occurs (open / closed), the year of the first occurrence of the word in Russian National Corpus; the language from which it was borrowed; its frequency. The database may be used to select stimuli for experimental studies of variation in modern speech and of its social correlates (age, gender, education, etc).
Russian lexical stress exhibits both inter-speaker variation, defined by the speaker’s regional affiliation, social status, age, etc., as well as intra-speaker variation. The latter is difficult to capture due to the need for large corpora of spoken text produced by one speaker. These are lacking, but can be replaced with poetic corpora. We use automatic analysis of poetic texts by 10 poets, drawn from the Russian National Corpus, in order to find word forms that can have stress variation. The number of such forms for an individual speaker ranges between 30 and 200 words, distributed among different parts of speech. We propose a quantitative measure of overall stress variability independent of the corpus size and show that there is a tendency for this variability to diminish over time, at least in poetic texts.
The paper focuses on Russian coordinate construction with clauses (or VPs) combined by means of the adversative conjunction NO. Prosodically, the construction may come up in two forms: (a) as a single illocution with the first clause pronounced with a rising pitch that projects discourse continuation, and (b) as two separate illocutions with the first clause pronounced with a falling pitch that projects no continuation. Basing on the data from the Prosodically Annotated Corpus of Spoken Russian, prosody and grammar of (a) and (b) were analyzed qualitatively and quantitatively. Type (b) appeared to be less frequent (comprising, approximately, 30% of the total number of occurrences) and systematically favored in contexts where the second clause is complicated by a “heavy” topical constituent.
It’s a well-known fact that working memory capacity correlates with individual differences in comprehending speech (Daneman, Carpenter 1980). At the same time, the relationship between working memory capacity and speech production remains relatively unexplored. In this paper, we attempt to partially fill the gap and check the hypothesis about correlation between working memory capacity and number of lexical and grammatical markers of difficulties in production of spontaneous narratives. 19 Russian participants took part in two tests: the "Speaking Span" test by which we have measured their working memory capacity and the speech production test based on retelling the Pear Film (Chafe 1980). The Speaking Span test was designed in (Daneman and Green 1986) for English-speaking individuals: during the test increasingly longer sets of words are presented to participants; at the end of each set, they are supposed to use each word to generate a separate sentence (the word should be in the same grammatical form as it has been presented). Speaking span is measured as the maximum number of semantically and grammatically correct sentences produced in the experiment. This test was adapted to Russian: words in the set were balanced by syntactic categories, frequencies of individual lexemes and frequencies of grammatical forms. Collected narratives have been transcribed and manually annotated for lexical and grammatical markers of production difficulties. The documented number of lexical and grammatical markers of speech production difficulties varied between 0,77 and 8,58 per 100 words, which matches average rates reported previously in the literature. The study demonstrates the statistically significant correlation between working memory capacity measured by the "Speaking Span" test and verbal fluency measured in number of lexical and grammatical markers of production difficulties.
In this paper, we propose a method of machine-translated text detection. By ‘machine-translated’ texts, we mean, principally, the output of statistical machine translation systems. We focus on syntactic correctness and semantic consistency of sentences that constitute a text. More specifically, we make an attempt of detecting a certain phenomenon typically occurring in machine-translated documents. This phenomenon comprises the cases when small parts of the sentence, correctly translated, are combined together in an improper way. The proposed method is based on a supervised approach with a number of handcrafted features. First, we construct N-gram language models on a set of authentic scientific papers and on a set of machine-generated texts and assess the probability of each sentence according to these models. In addition, we propose N-gram language models on part-of-speech tag sequences corresponding to the texts given. Furthermore, we explore the effectiveness of features obtained from two trained word2vec (CBOW and skip-gram) models. We assess quality of the method on a sample of Russian scientific papers, and English scientific documents machine-translated into Russian. Preliminary results demonstrate feasibility of the approach.
This paper presents a new set of basic tools for morphosyntactic tagging of Russian texts coming from social media. This has been developed within GICR, a project for creating a very large corpus of the Russian-speaking Internet.
This toolset includes a new tagset, obtained via extending and adapting the tagset proposed by Sharoff et al. It has been tested on a gold standard test corpus of modern social media of about 2 million tokens. A particular feature of our approach is a fully automated process for development of training corpora. Instead of manual annotation we started with the output of the syntactic parser of Compreno. This annotation has been subsequently improved by automatic correction of systematic errors detected through processing of texts from social media. In this paper we show that existing tagging tools (in particular, tnt) produce consistently better results if they are trained with our corpus rather with other available corpora, in particular, those using the disambiguated portion of the Russian National Corpus.
The resulting test corpus is available in open access.
The article is devoted to communicatives – special use of words, idioms and short sentences in dialogical positions of stereotype responses (response particles), intended to agree, disagree, answer some etiquette formulae, or to express different emotions. These are conversational formulae like Da; Net uzh; Kakoe tam; Obaldet’!; Na zdorovje!, etc. Communicatives consist of particles and idiomatic constructions; they are semantically empty and pragmatically specific. These units are regularly used in conversation, but not a single Russian dictionary has yet seen light where one could get compete information about communicatives and their occurrence in conversation. Only very few communicatives can be found in explanatory dictionaries and dictionaries of idioms. But their description in those rare cases is limited by their intention (affirmative response, doubt, etc.) or their function – either an etiquette answer (used to express thanks, regret) or an emotional response (used to express surprise, joy, grief). In some cases, communicatives may be marked as synonyms. Some words and idioms may function both as discourse words and communicatives, and some modern dictionaries claim to contain full information concerning their semantics and use. But the attention in the dictionaries is focused mostly on narrative, not dialogical, contexts, which distorts communicatives’ actual use. The objective of the article is to compare characteristics of discursive words and communicatives. In the few examples we try to demonstrate the differences between the meaning and usage of these units and argue for the compiling the special Dictionary of Communicatives.
The paper presents the most frequent words of everyday spoken Russian, that form the upper zones of several word frequency lists compiled on the material of Russian speech corpus “One Speaker’s Day” (the ORD corpus), containing real-life recordings of everyday communication. All speech data in the corpus is annotated in terms of communication settings, including 1) type of communication (language spoken style), 2) social role of speaker, 3) locus, etc. Such information allows speech to be filtered upon user request and therefore makes it possible to study speech variation depending on particular communication settings. The given study was made on the transcripts of 152 real-life macroepisodes and contains 232370 words. The sample presents speech of 209 persons (95 men, 94 women, 20 children). The following word frequency lists have been compiled: a) general frequency list, b) male frequency list, c) female frequency list, and d) four frequency lists for different styles of spoken speech: informal conversations, professional/business conversations, educational communication, and “customer-service” communication. Men's and women's frequency lists have been compiled on the subsamples of 83371 and 115110 words correspondently. The analysis of word lists has shown that Russian women pay more attention to maintaining the conversation, use fewer hesitations, and are more inclined to use in their speech intensifying words, emotional words, hedges and interjections. Men generally use fewer personal pronouns, while numbers and the expletives are among the most frequent words used by men in everyday conversations. In general, these observations are similar to those described earlier for gender variation by other linguists.
Our pilot study is aimed at building a lexicon of effective pronunciation variants on the basis of canonical pronunciations, for implementing it into the automatic speech recognition system for Russian. We focus on phonetic changes in word pronunciation caused by different factors operating in spontaneous speech. Our speech data includes three different corpora of the conversational type. Manual expert processing and analysis of the audio data are used. The lexicon construction procedure is given. Some statistics for pronunciation variation in Russian, obtained from the speech data, is presented. A description of frequent types of this phenomenon is given. Parallel and sequential pronunciation variants are discussed. Ways of formulating general phonetic variation rules and predicting potential contexts, in which pronunciation variation is likely to appear, are considered. Test data, phoneset used, and automatic speech recognition (ASR) parameters are described. Preliminary results for ASR and key word spotting (KWS) are shown. The appropriateness of using multi-pronunciation lexicon is discussed.
This paper studies different aspects of a linguo-political conflict concerned with choosing between two Russian toponymic variants – Belorussia and Belarus’ as well as adjectives belorusskij (Belorussian) and belarus(s)kij (Belarusian) and ethnonyms belorus and belarus. The core of the problem is that in the Russian language of Russia the variant Belorussia is used, which is considered to be insulting by many Belarusians, who prefer to use the variant Belarus while speaking Russian. In an attempt to understand the structure of this conflict, we analyze how and why the toponym Belarus appeared and spread through the newspapers of 1990-s, study the data from two online polls and the distribution of some words derived from the two toponymic variants, and finally discuss the scenarios of conflict communication in discussions in various social media. One of the polls shows the social distribution of the two toponymic variants and the other examines the attitude of the Belarusians towards the toponym Belorussia and its derivates. We show that each side of the conflict has its own limited set of ideas that reappear in conflict communication in comments under different articles on the Internet.
This paper reports on the first competition on automatic spelling correction for Russian language – SpellRuEval – held within the framework of “Dialogue Evaluation”. The competition aims to bring together groups of Russian academic researchers and IT-companies in order to gain and exchange the experience in automatic spelling correction, especially concentrating on social media texts. The data for the competition was taken from Russian segment of Live Journal.
7 teams took part in the competition, the best results were achieved by the model using edit distance and phonetic similarity for candidate search and n-gram language model for their reranking. We discuss in details the algorithms used by the teams, as well as the methodology of evaluation for automatic spelling correction.
This paper deals with automatic induction and prediction of morphological paradigms for Russian. We apply a method of longest common subsequence to extract abstract paradigms from inflectional tables. Then we experiment with the automatic detection of paradigms using a linear classifier with lexeme suffixes and prefixes as features. We show that Russian noun paradigms could be automatically detected with 77% accuracy per paradigm and 93% accuracy per word form, for Russian verbs per-paradigm accuracy reaches 76% and per-form accuracy is 89%. Usage of corpora information and character n-grams allows to improve these results up to 82% and 95% for nouns and 86% and 95% for verbs.
This paper describes an automatic spelling correction system for Russian. The system utilizes information from different levels, using edit distance for candidate search and a combination of weighted edit distance and language model for candidate hypotheses selection. The hypotheses are then reranked by logistic regression using edit distance score, language model score etc. as features. We also experimented with morphological and semantic features but did not get any advantage. Our system has won the first SpellRuEval competition for Russian spell checkers by all the metrics and achieved F1-Measure of 75%.
In this paper, we describe the rules and results of the FactRuEval information extraction competition held in 2016 as part of the Dialogue Evaluation initiative in the run-up to Dialogue 2016. The systems were to extract information from Russian texts and competed in two named entity extraction tracks and one fact extraction track. The paper describes the tasks set before the participants and presents the scores achieved by the contending systems. Additionally, we dwell upon the scoring methods employed for evaluating the results of all the three tracks and provide some preliminary analysis of the state of the art in Information Extraction for Russian texts. We also provide a detailed description of the composition and general organization of the annotated corpus created for the competition by volunteers using the OpenCorpora.org platform. The corpus is publicly available and is expected to evolve in the future.
This paper presents a rule-based approach to Information Extraction (IE) task within FactRuEval-2016 competition. Our system is based on ABBYY Compreno Technology. The technology uses the results of deep syntactic-semantic analysis, which leads to significant reduction of the number of necessary rules and makes them laconic. The evaluation was conducted on FactRuEval dataset. FactRuEval is an open evaluation of IE systems. The participants could take part in three tracks. The first track required to detect the boundaries and type of named entities in a text. The second track required to extract normalized attributes and perform local identification of named entities. The third track required to extract facts of certain types from a text. We took part in all three of the tracks with the nickname violet. Our method proved to be successful: we have achieved high F-measures in Named Entity Recognition tracks and the highest F-measure in Fact Extraction track.
The paper deals with motion-cum-purpose constructions in Russian. The constructions «causation-of-motion verb + infinitive» (vesti kogo-to delatj čto-to – lit. ‘to lead somebody to do something’) are observed on the data of Russian National Corpus. The problem of infinitive control is in focus. The Ø-subject of infinitive can be coreferential with the subject or with the object of the finite verb in such a construction: cf. on vedet jejo ubivatj lit. ‘hei is leading herj Øi to kill’ vs. on vedet jejo umiratj lit. ‘hei is leading herj Øj to die’. The frequency of uses with subject-control correlates with the degree of formal and semantic cohesion within the purpose construction. One of the parameters is word order (which reflects the degree of syntactic cohesion). The second one is the spatial type of causation-of-motion verb: neutral unprefixed (vesti) / lative prefixed (privesti) / elative prefixed (otvesti). It reflects somehow the degree of semantic cohesion (neutral > lative > elative). The choice of controller is conditioned also by the referential type of the object (person / non-person / inanimate) and by the semantics of purpose event.
Named entity recognition and classification is an important natural language processing task, aimed at finding words and word sequences, which denote named entities of different types in plain texts. This challenge was addressed in Task 1 of FactRuEval-2016 evaluation.
In the context of this evaluation, our team, acting for the Institute for System Programming of the Russian Academy of Sciences, proposed two approaches to exploiting information, mined from Wikidata and Wikipedia, for improving quality of named entity detection methods. In the first approach word2vec word embeddings, computed on Wikipedia, are used along with basic features in tokens classification. The second approach utilizes both Wikipedia and Wikidata to automatically construct a representative corpus for named entity recognition and classification training. Additionally, Wikidata, treated as a property graph, is used to collect named entity specific word dictionaries.
Our approaches (marked with identifier 'Orange' in FactRuEval-2016 organizers’ quality evaluation reports) show up promising results, doing especially good for such well-defined class as person, still being appropriate for detecting named entities of other types as well.
Thе paper provides evidence for the claim that the Russian conditional conjunction esli ‘if’ is itself devoid of either conditional or determiner semantics. The argument proceeds as follows. I demonstrate that with conjoined conditionals, just like with some NPs/DPs and with free relatives, one gets not only the immediately obvious “parallel” reading (‘for A and for B’) but also the “co-determinative” reading (‘for those A which are also B’). The sort of reading identified in the literature as “appositional” turns out to be a subclass of co-determinative readings.
It has been proposed that appositional readings for NPs/DPs result from the fact that the pertinent DPs denote properties, whereas their conversion into referring or quantificational expressions is performed by type-shifting rules. Applying the same technique to conditionals, I conclude that the conditional esli cannot have the semantics of the definite determiner in the domain of possible worlds. Given the influential view (Lewis etc.) that if does not quantify over worlds either (that work done rather by “adverbs of quantification”, which may be overt, e.g. always or usually, or covert), esli ends up free from any semantic duty, except that—as I argue—it determines whether the quantification over worlds or over time instances takes place (cf. esli vs. kogda ‘when’).
The proposed analysis may be used as guidance for the development of automatic recognition and analysis rules for such constructions.
The work deals with adapting the Russian coreference corpus RuCor annotation system (used for written Russian) to the corpus of Russian oral narratives from the Russian Clinical Pear Stories Corpus (Russian CliPS) (Khudyakova et al., 2016). Russian CLiPS is a corpus of Russian “Pear stories” movie (Chafe, 1980) retellings in clinical populations as compared to neurologically healthy people. The analysis deals with 11 texts by healthy people and 9 texts by people with various types of aphasia. The focus is on the specificity of reference choice in oral retellings and the parameters to be used for the annotation procedure to register deviations in referential choice in spoken discourse as compared to the written one. The specific features for annotation of referential choice in clinical populations are also under discussion. The main claims are as follows. Certain types of speech disfluencies should be integrated into the coreference annotation scheme. These are noun phrases, which are repetitions of a previous referent mention, referent renaming, or name correction. Such occurrences can influence the referent activation; on the other hand, they could shed some light on the process of the referential expression choice. The NP morphosyntactic structure and zero-anaphora should have more granulated set of features for coreference devices, as they are more diverse in spoken discourse. Moreover, certain structures, such as adjectives postposition etc. and some types of zeros are characteristic of referential expressions in spoken discourse.
This paper describes the extraction of multiword expressions (MWEs) from corpora for inclusion in a large online lexical resource for Russian. The novelty of the proposed approach is twofold: 1) we use two corpora – the Russian National Corpus and Russian Wikipedia – in parallel and 2) employ an extended set of features based on both data sources. To combine syntactic and statistical features derived from two corpora, we experiment with several learning-to-rank (LETOR) methods that have been proven to be highly effective in information retrieval (IR) scenarios. We make use of bigrams from existing dictionaries for learning, which leads to very sparing manual annotation efforts. Evaluation shows that machine-learned rankings with rich features significantly outperform traditional corpus-based association measures and their combinations. Analysis of resulting lists supports the claim that multiple features and diverse data sources improve the quality of extracted MWEs. The proposed method is language-independent.
The paper deals with Russian aspectual pairs like umirat’ – umeret’, risovat’ – narisovat’ (but not obizhat’sa – obidet’sa). The imperfective verb in a pair designates a process or an action, while the perfective verb designates the “resulting event” completing the process / action. It is well known that in some diagnostic contexts the imperfective member of a pair substitutes it’s perfective correlate and thus designates an event (Maslov criterion). It will be demonstrated that this substitution is due to certain semantic components in the lexical meaning of a verb. Fot this purpose the progressive meaning of imperfective verbs will be analysed. We will argue that the component ‘the resulting event’ is a part of meaning both of a perfective verb and the imperfective one in an aspectual pair. Status of this component in the lexical meaning of an imperfective verb will be discussed. Maslov’s diagnostic contexts will be observed. A criterion for determing the imperfective correlate to a given perfective verb in some controversial cases (cf. est’ / s’edat’ – s’est’) will be suggested in addition to Maslov criterion.
Corpus and experimental approaches in linguistics are often seen as incompatible, and there are very few studies of grammatical phenomena that rely on both of them, without one or the other being subsidiary. In this paper, we would like to show that they are complimentary and can be fruitfully combined on the example of Russian phrasal enclitic že. We analyze various factors influencing its position in the sentence, in particular, whether it obeys Wackernagel’s law, which applied to phrasal enclitics in Old Russian.
Data from the National Russian corpus show that že appears in the strict Wackernagel’s position in the absolute majority of cases in the main subcorpus and the newspaper subcorpus, while the subcorpus of spoken Russian exhibits more variation. Corpus data allow tracing diachronic tendencies and identifying several factors (primarily, the semantics of že). Experimental data let us estimate the role of these and other factors on a carefully balanced set of examples. Apart from syntactic and semantic factors, the age and educational level of participants was demonstrated to influence the results.
In most studies dedicated to tautologies it goes without saying that these constructions are commonly used in everyday speech. Further analysis is based on the hearer’s point of view concentrating mostly on possible ways of interpretation of tautologies. At the same time the perspective of the speaker remains largely unexplored. This study based on Internet and corpus data [RNC] deals with some aspects of use of tautologies in communication in order to understand why the speaker should opt for uttering tautologies instead of being more straightforward and what communicative profit he gets for that. It seems that advantages of using tautologies for the speaker are based on their structural and semantic features: (a) their recognizable form X cop X that makes tautologies look like a cliché; (b) their possibility to appeal to mutual knowledge; (c) the unquestionable truth of their literal meaning. First, when the speaker uses tautologies as clichés with expressions “as they say”, etc., he makes his personal opinion look like a common wisdom of linguistic community. Next, when the speaker emphasizes that he appeals to mutual knowledge, he makes the hearer look as like-minded person, therefore the hearer’s possible disagreement is regarded as a refusal of (expected) support and solidarity and requires more effort. Finally, the fact that the literal meaning of tautologies is undeniable helps the speaker escape of the responsibility of false implicature; defend his opinion using so-called deep tautologies; close the discussion whenever it is more convenient to him.
The paper presents the rationale for the decisions that were taken in the set-up and further development of a learner corpus of student texts written in English by Russian learners of English., the only Russian learner corpus in the open access. The tool of manual expert annotation is in the focus of the present observations, and after introducing categorization of errors applied in annotation the complicated cases that arose in annotation practices have been looked into followed by comparison of the annotation statistics over the three stages in the corpus development. For that purpose, texts annotated by different groups of particpants in the process of two experiments were used to spot the problematic areas in annotation. The main pedagogical applications of the learner corpus in teaching EFL – the opportunities to create automated training exercises and placement and progress tests custom-made for specific groups of students - are outlined in the concluding part of the paper.
This paper is aimed at investigating two Russian pitch accents never discussed in the Russian linguistics. These are: the gradual rise found in one the Russian regional variants, namely in Odessa Russian, and the prosody of breaking information into portions in standard literary spoken Russian. The gradual rise has a rising tone on the tonic syllable and gradually rising frequency changing on the post-tonic syllables. The prosody of breaking information into portions has a falling tonic syllable and slightly rising post-tonics. It also has a prolonged time of articulation. The tonal and temporal parameters of the pitch accents in consideration, their functions in discourse, and their phonological status are discussed. The criterion for the pitch accent to be viewed as an autonomous phonological unit of a language is whether the pitch accent has permanent means of expression and a stable function, or a limited set of functions in discourse. For describing the newly introduced pitch accents, the transcription based on the Pierrehumbert’s autosegmental notation of prosody was used. For investigation, a minor working corpus of the Russian speech recordings was set up. It comprises two components. The first component of the corpus consists of short stories about Odessa told by the citizens, jokes, and funny stories. The second component includes recordings of friendly talks and radio conversational programs in standard Russian. The software program Praat was used in the process of analyzing the sounding data. The results presented here are exemplified by frequency tracings of records taken from the corpus.
The paper outlines the principles of creation of a Database of Russian language-specific units and their French equivalents and the possibilities of its use as a tool of linguistic analysis. The entry of the Database is a monoequivalence (ME), i.e. a dyadic tuple, which consists of a Russian sentence including a language-specific unit and its French translation (automatically extracted from the Russian-French subcorpus of Russian National Corpus), including a functionally equivalent fragment (FEF) of the Russian language-specific unit. Both constituents of the ME are annotated with two-level characteristics, ensuring their faceted classification: “basic type” and “additional feature”. The paper indicates relevant quantitative parameters that can be extracted from such a database and can be accounted for in the analysis of language-specific units; it demonstrates that quantitative methods can be effectively used only in combination with proper methods of semantic analysis. The reliability of statistical data will increase with the extension of the volume of the parallel corpus.
The discussion of the so-called conative pairs in the Russian language, i.e. pairs including an imperfective and a perfective verb where the former expresses the idea of an attempt to reach a goal while the latter designates a successful fulfillment of that goal, has a long tradition.
However, the aspectual status of pairs of verbs semantically related in this way remains unclear. In particular, the aspectual status of the verbs iskat'ipf – najtipf (to search – to find) has not found consensus among scholars of the Russian language. The paper provides answers to the following questions: Why the verbs iskat' – najti do not function as an aspectual pair in Russian? Why however Russian speakers perceive this pair as an aspectual one? How the non-aspectual pair iskat' – najti is related to conative aspectual pairs lovit' – pojmat' и reshat' – reshit'? How the non-aspectual pair iskat' – najti is related to telic aspectual pairs razyskivat' – razyskat' and otyskivat' – otyskat'?