Сборник 2011

Содержание

Формат PDF

Полная версия сборника

Alexeev A. A. Loukachevitch N. V. Lomonosov Moscow State University

Automatic detection of near-synonyms in news clusters

The paper presents a method for extraction of alternative names of a concept or a named entity mentioned in a news cluster. The method is based on the structural organization of news clusters and exploits comparison of various contexts of words. Word contexts are used as basis for multiword expression extraction and detection of alternative names. As a result of cluster processing we obtain groups of near-synonyms, in which the central synonym of each group is determined.

Avgustinova T. DFKI GmbH & Saarland University

Parallel construction of Slavic grammatical resources

We present the idea of parallel construction of HPSG-based grammatical resources for Slavic languages using a common Slavic core module in combination with language specific extensions and corpus-based grammar elaboration at all stages of the project.

Baranov A. N. Dobrovol'skĳ D. O. Russian Language Institute, Russian Academy of Sciences

Semantic relations in phraseology

Traditionally, the following types of semantic relations in the lexical system are distinguished: synonymy, antonymy, polysemy, hyponymy, conversion, and causativity. In the field of phraseology, these phenomena display some specific properties. The focus of our paper is on revealing and discussing some of these properties. The starting point of the discussion is the category of semantic field. It provides the theoretical framework for considering semantic relationships between idioms. The semantic field is defined as a set of lexical units which are connected with each other by some salient semantic features. The totality of semantic fields along with the conceptual links between them constructs the thesaurus of a given language, which can be represented in the form of a semantic network. The most important type of semantic relations within the semantic field is synonymy. Full synonymy is a rare phenomenon in phraseology, because the meaning of an idiom contains additional semantic features, namely the so-called image component. Idioms with identical actual meanings often reveal differences in their image components, and are perceived as near-synonyms, rather than full synonyms. Antonymy is not typical of phraseology because in most cases it is impossible to single out the central semantic feature that could be considered responsible for meaning contrasts. Although traditionally idioms were mainly regarded as monosemous units of the lexicon, the results of our recent research prove that idioms’ polysemy is a quite typical phenomenon.

Belikov Vladimir I. Russian Language Institute, Russian Academy of Sciences

What are sociolinguists and lexicographers lacking in a digitized world?

It is a common belief that text corpora provide the best testing ground for solving any kind of linguistic problems. As far as grammar is concerned, this may be true, but if we focus on investigating the lexicon the results often appear to be rather superficial. WWW contains some relatively homogeneous arrays of texts formed independently of linguists, in some cases emerging quite spontaneously. Text arrays with the most prominent social characteristics of their authors are regarded as independent Internet segments (digitized classical literature and 2010 teenager blogs are the most contrasting examples). Frequencies of the same lexical items differ greatly from one segment to another, and this statistics is very significant for sociolinguistics. The main problem in applying the method of segmental statistics is the lack of a suitable instrument for automatic data processing. Several case studies are presented, and the results of segmental statistics seem to be more indicative than those obtained from the Russian National Corpus.

Benigni V. Università Roma Tre, Cotta Ramusino P. State University Milan

Italian constructions with support verb “fare” in comparison with Russian

The paper deals with Support Verb Constructions (SVC) in Italian that are formed by the verb fare ‘to make’ and its nominal object (V+NOBJ) in an interlinguistic perspective with the Russian SVC with the verb delat’. The study has been carried out for Italian on ITWac (gathered by Baroni) and, for Russian, on the Russian Web Corpus (gathered by Serge Sharoff, University of Leeds), both are available as pre-loaded corpora within The Sketch Engine corpus query system (http://the.sketchengine. co.uk). About 280 types of SVC with a token frequency ≥200 resulted from the query in the Italian corpus. The Italian SVC have been classified into lexical-semantic patterns, on the basis of Nsubj and Nobj semantic features and the Support verb lexical-semantic meaning. Subsequently, the patterns have been grouped into the well-known actional classes of accomplishments, achievements, semelfactives, activities and states (Vendler 1967, Comrie 1976). The overall classification shows that most SVCs go hand in hand with the features of telicity (as regards verbs) and of concreteness and referentiality (as regards NOBJ), and in these classes (accomplishments, achievements) there is a partial parallelism with Russian, whereas fewer Russian SVCs can be found in the activity and states verb classes. Moreover, the presence of a high number of SVCs in the Russian corpus may be considered as a further evidence of the typological shift towards the analytic type that contemporary Russian is apparently undergoing (see e. g. the simplification of noun declension, the expansion of invariable words and the increasing number of bi-aspectual verbs).

Berdicevskis A. University of Bergen, Norway

E-mail vs. Chat: The Influence of the communication channel on the language

Does the mere change of the communication channel, unaccompanied by any other changes in situational characteristics, affect the language? Quantitative analysis of two corpora of Russian texts that differ solely by the communication channel from which they originate (e-mail vs. chat) proves that it does.

Bergelson M. B. MSU, Moscow

Modern Russian public discourse: do changes in information technology lead to new discourse strategies, or to new worldview?

This study aims at looking into various formats of modern Russian-language internet communication in order to discover changes in sociocultural patterns and models of the discourse behavior that characterize values and norms of the contemporary Russian public life. Specifi c public discourse genres — high offi cials’ internet blogs — are analyzed with a special emphasis on whether the public discourse represented in the modern electronic modes is different in the language used from that of the traditional offi cial discourse. This analysis should allow to better understand ideas and beliefs prevailing in the Russian public opinion, to trace its changes and emerging linguistic patterns.

Bocharov Victor Mathlingvo, Bichineva Svetlana Granovsky Dmitry Ostapuk Natalia Stepanova Maria OpenCorpora

Quality assurance tools in the Opencorpora project

OpenCorpora is a project that aims at creating an annotated corpus of Russian texts, which will be fully accessible to researchers, the annotation being crowd-sourced. The article deals with annotation quality assurance tools.

Bogdanova N. V. Osmak N. A. Faculty of Philology, Saint-Petersburg State University

Some lexical “discoveries” on the material of Russian spontaneous speech, a corpus study

The article presents results of the fi rst attempt at lexicographical description of Russian spontaneous speech. Analysis is based on the material from the Corpus of Spoken Russian "One Speech Day". New linguistic units (words and phrases) not represented in dictionaries yet, new meanings and definitions or connotations of “old” words are described along with the trends of use in everyday speech. It is shown that a new area of lexicography, which could be called “speech lexicography”, is emerging. Its overall principles have not been completely determined yet, although some of the directions can already be specified: 1) creation of a dictionary of common Russian colloquial speech, which should reflect linguistic units used in everyday speech; 2) creation of a dictionary of context-dependent expressive units; 3) creation of a dictionary of discursive units, and 4) collection of a corpus of aphetic and reduced units. The paper outlines controversial problems for each direction and provides linguistic examples.

Bolshakov Igor Independent Researcher, Gelbukh Alexander National Polytechnic Institute, Mexico

A large electronic dictionary as a polythematic guide and shaper of queries to the Web

A large Russian electronic dictionary is presented. It contains both fundamental information on the Russian language (grammatical and combinatory properties of words, semantic and paronymic relations between words) and ample encyclopedic information on geographical objects, famous people, organizations, and artifacts. The dictionary includes technical terms and basic concepts of science, humanities, business, and economy. Among its applications is the possibility to form queries for Internet search engines on medicine, commerce, tourism, and other topics.

Boriskina О. О. Voronezh State University

A corpus-based study of noun cryptotypes in English

We develop a method of identifying noun cryptotypes in English, relying primarily on the Corpus of Contemporary American (COCA) and the results of typological studies. The study uses data-oriented and theory-oriented approaches to linguistic description. A cryptotype is referred to the principle of distribution of nouns among classes in accordance with a certain semantic feature and with reference to the typological principle of contrastive grammar. The class membership of a noun is evidentially revealed in syntax, particularly in collostructions which bear the classifying function of the noun class. The semantic, morphologic and syntactic criteria for identification of a noun class are discussed. The study of cryptotypes concerns the issues of grounding, recognition, and reasoning. An adequately formalized description of cryptotypes can be used in computational modeling and text processing.

Borisova E. G. The Moscow City Teachers’ Training University, Ovchinnikova T. E. MSLU

Parameter of nearness in the metaphorical space

In this work a conception of using deictic means as indicators of spatial relations for function of modal (intensifying) particles is developed. The relation with the deictic function implies metaphorisation of spatial relations and transfer of their parameters on relations in the field of a discourse, the speaker’s and listener’s general knowledge and more delicate semantic relations connected with the degree of importance for Speaker and Listener. They are able to be metaphorised through the concepts “speaker’s space”, “speaker’s and listener’s common space”. Obviously, the modal particles VOT and VON are connected with index (spatial) particles. The modal meanings, different from index ones, are the approximation meanings, the intensifying meanings and a number of other ones. We are guided by the opposition of the spatial index particles VOT and VON connecting “indication near subject — indication distant subject” with opposition. As identification is an action quite widespread in metaphorical space, we often meet with the use of VOT for indicating an object or phenomena so that we can speak about metaphorical “sense space” (the thesaurus of conversation participants) or “speech space” (i. e. the semantic network of discussed events). The situation with the indication VON being a sign (or an instruction) of searching the required object, is quite different. Naturally, the question on searching arises usually when the object is far. However, it is possible to search in rather near space. Therefore the description of difference in use of particles VOT and VON in modal functions can be compared with concepts of nearness and distance, but taking into account the described distinctions in their semantics. It turns out that the features of metaphorical use of VOT and VON are connected with their meanings in terms of indication, and not just with opposition on degree of distance, but with ways of indication. Therefore the opposition is metaphorised, too. We can say that particles specify different ways of searching objects in metaphorical spaces. We had sense spaces (the speaker’s and listener’s thesaurus), intercourse space, that is meaning of messages during the intercourse. In all cases the concept of nearness and distance (as derivatives of identification and searching operations) is not connected with the participants of the intercourse, but with distance between the actual representations realized at present intercourse, and new concepts, objects, properties involved for semantic, emotional and other problems.

Braslavski Pavel Yandex, Kiselev Yuri Ural Federal University

To find out or to buy? Product review vs. Web shop classifier

We examine two categories of search results retrieved in response to product queries. This classification reflects the two main kinds of user intents — product reviews and online shops. We describe the training and test samples, classification features, and the classifier structure. Our findings demonstrate that this method has good quality and performance, suitable for real-world applications.

Bylinina E. G. Institute for Linguistics OTS, Utrecht University

“Functional” standard in Russian and English degree constructions

We develop a notion of functional standard, which refers to the ‘functional standard degree construction’ (John is a little bit too tall for this job). The construction involves a ‘purpose’ proposition parameter that determines the set of degrees compatible with the purpose. The maximal degree belonging to this set serves as a standard in the construction. We argue against contextual and comparative analyses either explicitly or implicitly assumed in the literature. Instead, we propose that the purpose is an argument of (certain) gradable adjectives, and the whole construction is a positive construction. We try to pinpoint the difference between Russian and English functional standards.

Chetviorkin I. Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Loukachevitch N. Research Computing Center, Lomonosov Moscow State University

Three-way movie review classification

We consider a three-way classification approach for Russian movie reviews. All reviews are divided into groups: “thumbs up”, “so-so” and “thumbs down”. To solve this problem we use various sets of words together with such features as word weights, punctuation marks and polarity influencers that can affect the polarity of the subsequent words. We also estimate the maximum upper limit of automatic classification quality in this task.

Davydov A. G. Kiselev V. V. Kochetkov D. S. Speech technologies LLC, Minsk, Belarus

Voice emotion classification: problems and solutions

An algorithm for automatic emotion recognition from the speaker's voice has been developed. A number of tests were performed using the widely known corpus of Emotional Speech — Berlin Database (Emo-DB). The classification efficiency for different acoustic features was estimated and a very small set of the most reliable characteristics was extracted in order to obtain a robust and quick emotion state classification. Using the SVM classifier with quadratic kernel and this feature set provides the recognition accuracy of approximately 96 %. between «anger» and «neutral» emotional states. GMM classifier was less effective and demonstrates a classification error of up to 6 %. A brief comparison of this feature set and SVM kernel effectiveness was performed using the Munich openEAR toolkit. A recommended set of 384 features and linear-kernel SVM was used to solve the same problem. The classification efficiency of such algorithm reached 98 %. This value is only ~2 % higher than the respective value for the designed feature set and classifier. Under the several conditions, such as in the case of obtaining a decision support factor in the systems of real-time speech analytics the simplified classification scheme would be more preferable than a complex one.

Hovy Eduard Information Sciences Institute University of Southern California

A new semantics: merging propositional and distributional information

Despite hundreds of years of study on semantics, theories and representations of semantic content — the actual meaning of the symbols used in semantic propositions — remain impoverished. The traditional extensional and intensional models of semantics are difficult to actually flesh out in practice, and no large-scale models of this kind exist. Recently, researchers in Natural Language Processing (NLP) have increasingly treated topic signature word distributions (also called‘context vectors, ‘topic models,‘language models, etc.) as a de facto placeholder for semantics at various levels of granularity. This talk argues for a new kind of semantics that combines traditional symbolic logic-based proposition-style semantics (of the kind used in older NLP) with (computation-based) statistical word distribution information (what is being called Distributional Semantics in modern NLP). The core resource is a single lexico-semantic ‘lexicon’ that can be used for a variety of tasks. I show how to define such a lexicon, how to build and format it using tensors, and how to use it for various tasks. I discuss some of the recent work on composing vectors and tensors in attempts to produce statistically-based compositional semantics. Combining the two views of semantics opens many fascinating questions that beg study, including the operation of logical operators such as negation and modalities over word(sense) distributions, the nature of ontological facets required to define concepts, and the action of compositionality overstatistical concepts.

Erekhinskaya T. N. Titova A. S. Okatiev V. V. Dictum Ltd., Nizhny Novgorod, Russia

Syntax parsing for texts with misspellings in dictascope syntax

The paper deals with syntax parsing of natural language texts with misspellings and misprints in DictaScope Syntax. We propose a method for integration of a spellchecker and parser, which allows us on the one hand to correct typographical errors considering the context and on the other hand to increase the robustness of the parser. We start by outlining various types of misprints and ways to correct them, taking account of the specifi c character of keyboard typing and typical mistakes. To correct the misspellings and misprints we propose to use a modified Levenshtein algorithm, in which each pair of characters involved in calculation of the Levenshtein distance is assigned a specific weight from the interval. This accounts for keyboard typing, phonetically similar characters, similarity between Russian and Latin alphabet symbols, numbers and other symbols. The paper states the need to take into account the lexical context of the words to be corrected in order to achieve the maximum accuracy of correction, which helps correct words used in an unusual context. As a result we get a number of correction options for the words. The fi nal choice is made by the Dictascope parser Basing on the modified Eisner algorithm, the parser builds a dependency tree for the sentence. The modification includes punctuation checking and some additional linguistic limitations. In our model several vertices of interpretations correspond to one word, and variants of spell correction could be processed in the same way as morphological interpretations. The integration of misprint correction and syntactic analysis is illustrated by a simple case (correcting a single word) and a more complex case — splitting a word in two or merging two words into one. The proposed method of integration of the parser and the spellchecker modules was implemented in the Dictascope Syntax system. This made it possible to considerably increase the stability of the parser and provided an opportunity to use it as a component of the opinion mining system for monitoring of blogs and forums.

Fedorova O. V. Uspenskaja A. M. Lomonosov Moscow State University

Experimental analysis of discourse: the impact of a potential referential conflict on the choice of the referring expression (on the material of Russian)

The paper describes an experiment carried out in order to study the referential choice in the situation of potential referential conflict. The results showed that in the situation participants choose full NPs. The results confirmed that referential choice depends on the participants’ working memory and made some additions to the model of referential choice.

Frolova Tatiana Podlesskaja Olga Laboratory of Computational Linguistics, Kharkevich Institute For Information Transmission Problems, RAS

Tagging lexical functions in Russian texts of SynTagRus

The paper describes the process and the results of tagging with Lexical Functions the texts of SynTagRus (Syntactically tagged Russian corpus available at www.ruscorpora.ru). The work, begun in 2009, is still in progress. The lexical items which are identified as values and arguments of collocate Lexical Functions (LFs) are tagged in syntactically annotated Russian sentences. So far, about 4,300 sentences (5,500 LF collocations) have been supplied with LF annotation. Examples of possible linguistic and educational uses for the corpus with LF tagging are given.

Gilyarova K. A. Russian State University for the Humanities, Moscow

Characteristics of student-professor e-mail communication

We analyse student-professor e-mail interaction in Russian universities in terms of Field, Tenor and Mode [Halliday 1978]. According to their content, we classify all e-mail messages into three types: “container e-mails”, “organizational e-mails” and “essential e-mails”. Even though the e-mail correspondence is a variety of the written communication mode, in organizational e-mails many speech-like features are present. They contain temporal and spatial deixis, anaphora and references to common ground. The word order is typical for colloquial speech, which makes organizational e-mails closer to phone calls. E-mails series resemble oral dialogues. Both students and professors use different discourse styles: formal, informal, slang, etc. The mode of writing depends more on the authors’ age and computer skills rather than on their social status. However, the differences in tenor between the e-mails of students and professors do exist. They are explained by the different perceptions of the norms of social communication and politeness. The analysis of opening and closing formulae also shows that there is no significant difference between the mode of writing e-mails by students and professors. Nevertheless, some specific traits can be found.

Grashchenkov Pavel Institute of Oriental Studies, Ionov Maxim MSU, Malyutina Svetlana MSU

Semi-tagged corpora method exemplified with a study of Ossetic nominalization

We propose the method of Semi-Tagged Corpora (STC) for grammar research in languages that are not expected to have corpora in the nearest future. We exemplify this method with an STC study of internal structure of nominalization in Ossetic. The research was implemented in three major steps: 1) a set of valid surface structures was established; 2) theoretical predictions were made; 3) the initial hypothesis was tested on the text corpora. The corpora were created in two steps. First we selected a significant amount of texts available for Ossetic and merged them in a single text collection. Then we supplied the collection with specific search tools. The initial hypothesis was confirmed that made our field results more accurate and allows a further elaboration of the syntactic structure that we proposed for Ossetic nominalizations.

Corbett Greville G. Surrey Morphology Group

Lexical splits and morphological complexity

A key notion in understanding and modelling language is ‘possible word’. While some words (lexemes) are internally homogeneous and externally consistent, we fi nd others with splits in their internal structure (morphology) and inconsistencies in their external behaviour (syntactic requirements). I begin with the characteristics of the simplest lexemes, adopting the approach of Canonical Typology. In this approach, we push our defi nitions to the logical limit, in order to establish a point in the theoretical space from which we can calibrate the real examples we fi nd. Defi ning canonical infl ection, allows us to schematize the interesting phenomena which deviate from this idealization. These include suppletion, syncretism, deponency and defectiveness. I then look at the different ways in which lexemes are ‘split’ by these phenomena.

Grishina E. Institute of Russian Language

Multimodal clusters in spoken Russian

The paper introduces the notion of multimodal cluster (MMC). MMC is a multicomponent spoken unit, which includes diads “meaning + gesture”, “meaning + phonetic phenomenon” (double MMC) or triad “meaning + gesture + phonetic phenomenon” (triple MMC). All components of the same MMC are synchronized in the speech, gestural and phonetic components conveying the same idea as the semantic component (naturally, with available means). To put it another way, MMC is a combination of speech phenomena of different modi (semantic, visual, sound), which are closely connected in the spoken language, and roughly speaking mean the same, i. e. convey the same idea by their own means. The paper describes some examples of double and triple MMCs specific for the Spoken Russian.

Iomdin Boris IRL RAS, Piperski Alexander Moscow State University, Russo Maksim Somin Anton

How different languages categorize everyday items

Classifications of everyday items (category words for clothing, stationery, personal hygiene, beauty products etc.) are studied. A survey of 41 languages was performed. Several results are reported, in particular: 1. Speakers of some languages provide generic terms relatively easy, while for speakers of other languages it is often difficult to perform this task. 2. Some items (such as keys, ear plugs, umbrellas) are virtually unclassifiable in all languages. 3. All languages have covert classes without well-established names (such as personal hygiene or data storage), and people either resort to awkward official phrases like Russian предметы личной гигиены or highly colloquial occasional words like Russian мыльно-рыльное. For items belonging to such classes, high variation of category words was observed. 4. Classes existing in several languages often overlap and include different items. So, посуда in Russian corresponds to dishes, cookware and cutlery in English. Possible areas of further research are discussed, including studies of language acquisition and bilingualism and comparisons with folk biology and folksonomies.

Iomdin L. L. A. A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Lobanov B. Hetsevich Y. United Institute of Informatics Problems, National Academy of Sciences of Belarus

The talking ETAP. Using the ETAP parser in Russian speech synthesis

The paper presents an attempt to create an experimental hybrid system of Russian speech synthesis, which makes use of surface-syntactic analysis of the text to be read. The syntactic structure of the sentence, a labeled dependency tree formed by the parser, provides better speech parameters as compared to the classical system of speech synthesis, which does not take explicit account of the information on how words are related in a sentence. The hybrid algorithm works as follows: the text to be read is sent to the parser of the ETAP-3 linguistic processor sentence by sentence; the ready syntactic structure of each sentence is treated by a number of specially designed rules that mark certain elements of the sentence as prosodically salient, specifying several element types like sentence head, last element of noun phrase etc. The Multiphone speech synthesis module uses this information to produce intrasentential pauses and emphasize certain words or word groups.

Karpenko M. P. Protasov S. V. Rambler

Some methods for language model pruning

This paper describes a pruning system of statistical language models. We present a method for pruning the internal vocabulary which is made completely automatically based on users requests and texts drawn from the Internet. The spellchecker system is one of the components of a search engine, and uses this dictionary for language modeling. The described methods can significantly reduce the size of a language model, and open the possibility to improve spellchecker quality. Experimental results show an improvement in the efficiency of the spellchecker. In our tests the pruning method removed 48 % of the language model without sacrificing the quality (in fact, the quality went up 2.7 %). This reduction resulted in the speed increase by 87 %. Pruning the model allows using a greater volume of query logs in the scenario when the amount of available RAM is fixed. This in turn can improve the quality of the spellchecker.

Karpova O. S. RSUH, Rakhilina E. V. IRL RAS, Reznikova T. I. VINITI RAS

Meaning of estimation in semantic shifts of rebranding type in adjectives and adverbs (on the material of the Database of semantic shifts in Russian adjectives and adverbs)

The article is focused on the description of meaning of positive and negative estimation of rebranding type in qualitative adjectives and adverbs (for example, bezumnyj ‘mad’: chelovek ‘man’ / plat’e ‘dress’; blestyashchĳ ‘shining’: pugovica ‘button’ / obrazovanie ‘education’; dikĳ ‘wild’ zver’ ‘animal’ / pricheska ‘hairdo’; zolotoj ‘golden’: slitok ‘bar’ / detstvo ‘childhood’; uzhasnĳ ‘terrible’: zver’ ‘animal’ / vkus ‘taste’ etc.). The investigation is fulfi lled on the material of the Database of semantic shifts in Russian adjectives and adverbs. The work contains analysis of different aspects of functioning of estimation meanings derived by the semantic shift “re-branding”: semantic zones as sources of estimation meanings, mechanism of their generation, lexical combinatory. We also discuss the interaction of estimation meanings with other meanings of re-branding type: combinations of estimation meaning with meanings of intensity, quantity, size, and variety.

Kibrik A. E. MSU

The basis of natural human language and its main parameters

The paper discusses some, although not all, basic properties of language. I discuss language and sign systems (symbolic signs, indexical signs, iconic signs), as well as the functions of language, including the primary (epistemic, cognitive, and communicative) and the secondary ones (the functions of: social solidarity, individuation, support of social comfort, getting in contact [phatic]; the aesthetic function, the fascination function, the emotional function, and the metalinguistic function). I also treat the main social registers of language (idiolect, subdialect, dialect, language, literary language) and the issues of language death, language change, and linguistic diversity.

Kisseleva X. Vinogradov Institute of Russian Language of the Russian Academy of Sciences

Antonyms in phraseology: formal similarity as a condition of the semantic oppositeness

The paper deals with semantic oppositeness on various levels of the phraseological system. The data come from an attempt to look at the Russian phraseology in one particular perspective, i. e. to investigate the role that the oppositeness plays between and within idioms. We propose that the semantic oppositeness between idioms can be formed by lexical, contextual and grammatical means. The paper argues that strict antonymy emerges when two idioms have similar structures and are based on the same image. The paper focuses, in particular, on the cases when this oppositeness is formed by negation. Some ways to represent different degrees of negative polarity in the phraseological dictionary are discussed. Different semantic effects related to the negative particle ne in idioms, as in blizhnĳ svet — neblizhnĳ svet, k licu — ne k licu, are examined. Finally, we account for the oppositeness as a regular model of the inner form that manifests itself in series of idioms like k mestu i ne k mestu, star i mlad, ni sest’ ni vstat’ etc.

Kotov A. A. National Research Centre “Kurchatov Institute”, Moscow, Russia

Types of simulated emotional expressive states in the Russian emotional corpus

People often simulate expression of emotions in communication without actual emotional arousal. We suggest that such simulation is forced by other hidden reactions and propose an initial classification. We also extend the architecture of a computer agent to make it able to produce simulated emotions.

Kozerenko A. Russian Language Institute RAS

Gesture idioms and gestures: types of correspondence

The paper considers semantic analysis of Russian idioms, depicting gestures in their inner form. The relationship between the meaning of a gesture and that of the corresponding idiom is examined, as well as polysemy and synonymy relations between idioms, corresponding to the same or different gestures. Definitions of some idioms of the semantic field SADNESS, REGRET, DESPONDENCY are demonstrated. Statements made on the semantics of idioms are illustrated with examples of idiom usage in contemporary texts.

Kozerenko E. B. Institute for Informatics Problems, Moscow

Linguistic motivation for statistical translation models

The paper deals with the problems of parallel texts alignment for enhancing the accuracy and adequacy of translation. Statistical and heuristic models of alignment and transfer are given. The solutions are proposed on the basis of a hybrid grammar, which includes linguistic rules and probabilities of language structures. The goal of the current development is the establishment of matches at the level of meaning, i. e. semantic matches. The meaning can be “packed” in different language structures, so the establishment of cross-language matches and inter-structural synonymy is of prime importance.

Kreydlin G. E. Russian State University for the Humanities

Nonverbal dialog in the history of kinesics

The traditional distinction between synchronic and diachronic gesture studies, which has been the cornerstone of nonverbal semiotics and kinesics, is being partly erased if one regards the refl ection of gestures in fi ction. This paper analyses the descriptions of several somatic signs and nonverbal forms of dialog in the two books of J. Swift — "Gulliver's Travels" and "Gulliver's Erotic Adventures". It argues that these remarkable works of art implement both some of the most common 17th and 18th century gestures and Swift's philosophical and scientifi c ideas concerning public morals, social and personal activities, communication and notions of language and gestures as lingua franca. I mean to discuss nonverbal acts of that time, purposely performed or uncontrollably leaked, that enhance, improve or disguise verbal messages in the texts. Nonverbal behavior can replace, multiply, or complement language, and Swift's books demonstrate these primary functions of nonverbal sign units vividly and convincingly. In Swift's time the gestural, or body language, as opposed to the natural languages has been considered common, plain, comprehensible, pure, forthright, and therefore the most effective in human communication. Face-to-face dialogs of the author's characters and their corporeal activity incorporate many nonverbal signs that most of his contemporaries regarded as universal. However, Swift mocks and even jeers sometimes at these prevalent viewpoints because he would not believe in the uniqueness and in the universality of the body language.

Kryuchkova O. Y. Goldin V. E. Saratov State University

A corpus of Russian dialectal speech: the concept and parameters of evaluation

The concept and parameters of evaluation of a corpus of Russian dialectal speech are discussed based on the comparative assessment of two dialect corpora — dialect corpus within the National Corpus of the Russian Language (DC NCRL) and the Saratov Dialect Corpus (SarDC): the principles of selection of the dialect materials and the criteria of the dialect corpus representativeness; the principles of the speech continuum partition in the corpus; the parameters of textual fragments return; the forms of representation of the dialect texts in the corpus; the types and rules of annotation of the corpus textual basis; the parameters of the dialect texts meta-marking; the representation of nonlinguistic information in the corpus; the possibilities of retrieval queries, optimal for dialect research. The paper proves that the dialect corpus cannot be based on the same model as the corpus of standard language because of the specific character of the dialect material. The dialect corpus must be modeled as a system of corpora of different dialects, representing the main dialect types of the Russian speech. According to the proportionality principle, the textual basis of the corpus of a separate dialect must be aimed at the modeling of communication in this specific dialect, reflecting the main types and forms of the dialect speech, as well as social differentiation of the dialect native speakers and genre and theme structure of the dialect communication.

Kudinov A. S. Voropaev A. A. Kalinin A. L. Project Search@Mail.Ru, Moscow

A high precision method for the recognition of sentence boundaries

We present a machine-learning method of sentence boundary recognition. The approach successfully identifies punctuation marks, such as periods or question marks that are not sentence boundary markers. In spite of a relatively small initial learning set (which was prepared manually), the accuracy of this approach appears to be no less than 99 % when applied to an average web document. The method is based upon the decision tree technique combined with a tiny set of manually constructed rules that play the role of classification features. The rules are built using a dedicated declarative language, which is briefly described. A comparison of accuracy of the approach with two freely accessible software products is provided. According to our estimates, the algorithm provides good enough performance to be used in real-time environment such as indexer component of a web search engine. It can also be used to produce large learning sets to train faster machine learning models such as the maximum entropy model.

Kustova G. I. Moscow State Pedagogical University

Constructions with abstract nouns in an electronic database

The paper discusses the types of abstract noun constructions and the types of information in an electronic dictionary (lexical database). The electronic dictionary includes ≪non-nominative≫ items which are used as predicates (e. g. X в плену, в обмороке, в отчаянии, на тренировке, под арестом), as sentence modifi ers (В заключении он научился шить рукавицы), as adverbial modifiers (спрыгнул на ходу; ушел со службы под предлогом болезни), as parentheses (во всяком случае, он нам ничего не обещал). The electronic dictionary includes such types of information on abstract noun constructions as the formal structure, the syntactic function, and the semantic type.

Kuznetsov I. P. Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Identifying role functions of people on the basis of knowledge structures

The linguistic processor which extracts knowledge structures (information objects and their links) from natural language texts is considered.The development of the processor is connected with extracting implicit information, e. g. role functions of people. The proposed extraction methods are based on the analysis of knowledge structures. The methods are used for identifi cation of role functions of people involved in criminal cases reported in law texts.

Letuchiy A. B. Russian Language Institute of Russian Academy of Sciences, Moscow Higher School of Economics

Pronominalization of sentential arguments in Russian

The article deals with the distribution of the three Russian pronouns referring to a sentential argument (e. g. — Vasja ne priedet. — Ja eto znaju ‘- Vasja will not come — I know it’) — namely, eto, tak and takoe. Each of them has its particular distribution, including contexts where none of the other two pronouns can be used. I show that the pronoun takoe is usually used in the context of negation and modal operators, but only rarely occurs in affi rmative sentences with a verb in the indicative mood. The pronoun eto, contrary to tak and takoe, can be used with concrete descriptions of speech acts, including supplementary characteristics of speech, such as loudness, whereas tak is incompatible with these characteristics. The individual properties of the pronouns are refl ected in their distribution in the corpus data. For instance, the proportion of infi nitive clauses among the uses of the pronoun takoe is much greater than for the two other pronouns, which nicely agrees with the tendency of takoe to be used in modal contexts. Finally, I show the difference between the uses of takoe where the pronoun refers to an NP and those where it refers to a sentential argument. In the former case, takoe always denotes a class of entities, whereas in the latter case, takoe can denote one particular content of a speech act. This difference has to do with different referential properties of NPs (objects) vs. propositions.

Levontina Irina Russian Language Institute

On some non-assertive verbs

The meaning of the word is determined not only by the components it consists of but also by the status of each component in the logical structure of this word’s meaning. The paper deals with a group of Russian verbs with a very peculiar logical structure and unusual syntactic properties. Their meaning is confi ned to non-assertive components, while the assertion is conveyed by the subordinate verb. In their semantic structure they are therefore similar to some discourse markers (particles, etc.). The verbs in question are udat’sia‘manage’, ugorazdit’, udosuzhit’sja, spodobit’sja, zablagorassudit’sja, soizvolit’, soblagovolit’, posmet’ [≈dare], imet’ smelost’, vzjat’ (vzjal i sdelal) etc., most of them hard for translation. Some of such phrases can be approximately translated into English with the verb to do [On spodobilsia prĳ ti ≈ He did come]. Thus the meaning of the sentence On soblagovolil prĳ ti is confi ned to the message ‘He came’ and a combination of speaker’s attitudes and expectations. Partly these verbs are negative polarity items (e. g. udosuzhit’sja), while others have positive polarity (e. g. ugorazdit’). Special attention is given to the verbs udosuzhit’sja and potrudit’sja, which express the idea of being ready to make efforts. Interestingly, the meaning of these two verbs, including its logical structure, has been changing during the last 200 years. The paper demonstrates how their actual meaning has taken shape.

Litvinenko Alla M. V. Lomonosov Moscow State University

Speech reporting strategies in Russian comics-based stories

The paper considers the factors that infl uence the choice of speech-reporting strategies in Russian spoken discourse. 10 speakers were asked to produce stories based on a series of pictures that included empty speech ‘bubbles’. The experiment resulted in 2 sets of stories, the fi rst one being produced while looking at the pictures, and the second one — several hours later, without using the pictures. In order to be able to analyze the matching instances of reported speech from different speakers, we marked 10 positions in the pictures, where speech was possible. We will show that not all such positions are actually used by the speakers to produce reported speech; that direct speech seems not to be a prevailing type, at least in this case; that there is no signifi cant difference between telling and retelling a story as regards the choice of speech-reporting strategies. It is discussed that the importance of an episode for the story, the need to portray the characters and personal preferences in style should be considered as significant factors for a speaker choosing the most adequate form of speech reporting.

Lobanov Boris Hetsevich Yury United Institute of Informatics Problems NAS Belarus

Statistical characteristics of syntagmatic segmentation of utterances from the viewpoint of expressive text-to-speech synthesis

We describe the results of a statistical study of text segmentation into phrases that occurs during expressive reading of Russian fi ction by a professional speaker(actor),The pupose is to fi nd out whether part-of-speech tags could be used to predictbreaks between phrases in a sentence. The experimental material was Anton Chekhovs story. A Hunting Drama, presented in text (54 thousand words) and sound formats (an audio book with 7 hrs playing time). This material was divided into two parts: the initial segment of the tagged text of the story containing 420 sentences (ca. 6000 words) and the rest of the text (untagged). The untagged part was used for model evaluation. Prosodic phrases were manually tagged by a professional auditor — phonetician who listened to the text. The total number of tagged phrases in the initial 420 sentences was 1516 (of which 710 had pauses no longer than 100 msec and 380 had longer pauses). The average number of phrase breaks in a sentence was 3.6, while the average length of a phrase was 4 words. Pairs consisting of words belonging to 11 different parts of speech or POS-like morphological classes were investigated: adjective, adverb, conjunction, gerund, interjection, parenthetical word, noun, numeral, participle, pronoun, and fi nite verb. In addition to POS information, the statistical analysis takes account of punctuation marks appearing in the sentence (commas, hyphens, dashes, colons, semicolons and parentheses).. Quantitative distributions have als been obtained for phrase breaks occurring in the pairs: “punctuation mark — part of speech", "part of speech — punctuation mark", "space — part of speech", "part of speech — space". Potentials of using this data in expressive text-to-speech synthesis system are considered.

Logacheva V. K. Klyshinsky E. S. Keldysh IAM RAS

Non-stochastic learning of cross-language transliteration rules from a small dataset

We present a language-independent method of generating rules for machine transliteration. The generation of rules is based on the analysis of a test dataset, which contains names written in the source language and their transliterations into the target language.

Loukachevitch Natalia V. Lomonosov Moscow State University; Dobrov Grigory B. Lomonosov Moscow State University; Kibrik Andrej A. Institute of Linguistics, Russian Academy of Sciences; Linnik Anastasia S. Lomonosov Moscow State University; Khudyakova Mariya V. Lomonosov Moscow State University

Factors of referential choice: computational modeling

Referential choice between various referential expressions, such as descriptions, proper names, and pronouns, depends on a variety of factors. We present recent results of our modeling study into referential choice, based on the RefRhet corpus. The account of additional factors and the employment of mixed machine learning techniques enabled an improvement of referential choice prediction. This applies both to the two-way choice between full NP and pronoun and to the threeway choice “descriptive full NP vs. proper name vs. pronoun”. We have demonstrated that the great majority of the factors taken into account are significant for modeling the referential choice.

Lukashevich N. Yu. Kobozeva I. M. Moscow State University

Character nominations in ontological perspective

The focus of this research is on ways to represent the meaning of character nominations –words naming either a person according to the person’s traits of character, or the characteristic itself, and providing an insight into naive psychology. An important feature of this lexical semantic group is that we attribute characteristics denoted by them to a person by generalizing from specifi c cases of the person’s behaviour. Therefore the meaning of such words can be understood correctly only when both linguistic and extralinguistic information is taken into account. The paper analyses how knowledge in this sphere can be represented in an ontology.

Lyudovyk T. V. Robeyko V. V. Pylypenko V. V.

Automatic recognition of spontaneous Ukrainian speech based on the Ukrainian broadcast speech corpus

The paper focuses on automatic recognition of spontaneous Ukrainian speech, introducing the Acoustic Corpus of Ukrainian Media Speech (ACUMS) Three confi gurations of a speech recognition system are considered. Special attention is paid to training basic and thematic acoustic and linguistic models as well as to the lexicon that contains word transcriptions refl ecting spontaneous pronunciation. The basic acoustic model was trained on recordings from approximately 2,000 speakers (52 hours). The basic language model was trained on ACUMS texts and on texts taken from Internet (400 Mb). Spontaneous variants of word transcriptions were obtained automatically based on standard Ukrainian pronunciation. Experimental results show that clear normative speech is recognized 50 % better than less intelligible speech with hesitations and reductions. Errors are due mainly to erroneous speech corpus annotation, non-vocabulary words (proper names in particular), spontaneous manner of pronunciation, short reduced words (conjunctions and prepositions), and a strong impact of language model on the algorithm searching for the best word sequence.

McCarthy Diana Lexical Computing Ltd.

Exploiting distributional similarity for lexical acquisition

Lexical acquisition has been dubbed the bottleneck of large scale robust natural language processing applications for at least two decades. There is now a substantial body of research dedicated to this important subfield of computational linguistics. Since the 1990s, researchers have turned to corpora for automatic lexical acquisition, rather than rely on extraction from existing online lexical resources. This allows for coverage of new domains, genres and languages without existing resources and where available resources do not provide sufficient coverage or require tailoring to the specific text type. A large body of lexical acquisition from corpora uses distributional similarity whereby the similarity between two words is calculated from the extent that the words have similar contexts of occurrence. Distributional similarity approaches are used for smoothing unseen events using data from seen events. They are also used as an approximation of semantic similarity since there is a strong tendency for words that exhibit similar distributional behaviour to share in their underlying semantics. This paper provides a summary of research that I, along with various collaborators, have conducted using distributional similarity to automatically acquire sense frequency information, selectional preferences and estimates of semantic non-compositionality of putative multiwords.

Mikheev M. Ju. Moscow Lomonosov State University

Multiple narrators in Varlam Shalamov’s texts

I examine the author’s point of view in 135 prosaic texts taken from the Kolyma Tales (KT) by Varlam Shalamov. I consider certain characteristics of non-trivial cases that might be called I-narration and He-narration (fi rst- and third-person narration) considering not just the narrator’s perspective alone, but also whether that person is called by another name by others or if the person remains nameless. The result: the stories in the primary cycle contain a few types of narration at different levels, but by the end of the KT, the multiple incarnations of the author start to decrease and the text gradually approaches traditional autobiography.

Nikolaeva Y. MSU

Illustrative gestures as markers for discourse macrostructure

The paper explores interrelations between discourse structure and gestures accompanying oral narration. It shows how illustrative gestures the reveal discourse macrostructure. Certain issues of speech production and comprehension are discussed with regard to nd the role played by the gesture.

Paducheva E. V. Russian Academy of Science, VINITI

Meanings, diatheses and ontological categories of the Russian word vpechatlenie ‘impression’

The Russian word vpechatlenie ‘impression’ is usually included in the class of emotions, as well as the verb vpechatljat ‘to make impression’. But derivational relationship between the noun and the verb remains unclear: dictionaries explicate the meaning of the verb vpechatljat with the help of the verb phrase proizvodit’ vpechatlenie ‘produce impression’, which does not help. The noun vpechatlenie is characterized by an idiosyncratic combinability (non-attested by other nouns of emotion) and an irregular polysemy. In this paper vpechatlenie is treated as motivated not by the verb vpechatljat, but by the verb vpechatlet’ ‘to produce an imprint’, which existed in the Russian language up to the beginning of the 19th century but later disappeared. This verb belongs to the class of image creation verbs, such as depict (something as something), represent (something as something), etc. It used to have an uncommon diathesis: Аvpechatlel na /v Y-e obraz Z Х-а = ‘A created on /in Y the image Z of X’. Or, take a non-agentive variant: X vpechatlel na /v Y-e svoj obraz Z = ‘X created on /in Y its image Z’. The participant X is, as a rule, the consciousness of a human being. The verb vpechatlet’ makes all the relationships transparent. It becomes possible (i) to reveal the derivational patterns corresponding to the different meanings of vpechatlenie and to assign ontological categories to these meanings; (ii) to describe combinability of the word as an effect of its ontological categories; (iii) to uncover semantic relationships between different meanings. In this way we get an account of the unique position of the word vpechatlenie among the nouns of emotion. Still the language of the Internet demonstrates that the word vpechatlenie experiences a pressure from its neighbors and gradually acquires the combinability characteristic of prototypical nouns of emotion, namely, of the names of states. In particular, the verb phrase ispytat’vpechatlenie, lit. ‘experience impression’, becomes frequent, by analogy with ispytat’ udovol’stvie ‘pleasure’, ispytat’ radost’ ‘joy’, etc.

Pazelskaya A. Solovyev A. ≪I-Teco≫, Moscow, Russia

A method of sentiment analysis in Russian texts

This paper presents an overview of methods of sentiment analysis. It also describes our experience of building a system for detecting sentiment in natural Russian texts (mass media). The system uses rule-based approach, calculating sentiment within a simple clause on the basis of word sentiment, output of a Natural Language Processing (NLP) module, and rules of sentiment combination. Word sentiment is determined in sentiment dictionaries created and regularly updated by experts (more than 15000 words and collocations by now). The system uses separate dictionaries for different parts of speech: nouns, verbs, adjectives, adverbs, verbal and non-verbal collocations. Every word and collocation in the dictionary is marked for its sentiment polarity and sentiment strength. The NLP module provides morphological and syntactic information (NPs, complex verbs, syntactic roles, clause types and boundaries, etc.). This information is further used to combine word sentiment and to identify sentiment of subject and object within a clause, as well as of the clause as a whole and of the monitored object within the clause. The system is regularly tested by experts on new mass media texts, it shows about 80 % recall and 90 % precision.

Piperski A. Ch. Moscow State University

Generic terms in everyday vocabulary as a sphere of subtle diff erences between Serbian and Croatian

There are significant differences in the everyday vocabulary of Serbian and Croatian. The speakers are aware of diverging specific terms (e. g., words for ‘spoon’, ‘glasses’, ‘passport’), but they fail to notice some diverging generic terms (words for ‘kitchenware’, ‘cutlery’, ‘writing supplies’). This is explained by the fact that generic terms show considerable amount of variation even within one language and cannot serve as markers of identity.

Podlesskaya V. I. Russian State University for the Humanities

Relative clauses in spoken Russian and elsewhere: a corpus approach

The paper addresses the problem of discrepancy between syntactic and prosodic grouping in Russian relative clauses. Basing on oral corpora systematically annotated for prosodic details, the paper demonstrates structural and prosodic “autonomy” of relative clauses from their heads, which previously remained unnoticed in the literature on relativization, which is mainly based on written data.

Potemkin S. B. Kedrova G. E.Philological faculty, Moscow State University, Moscow, Russia

Exploring semantic orientation of adverbs

Sentiment analysis often relies on a semantic orientation lexicon of positive and negative words. Determining the semantic orientation of words is necessary for correct estimation of the content of statements in the media, Internet, in the writings and speech. Qualitative adverbs expressing evaluation, intensity, direction of action are important as the modifi ers of the main sentence predicate. In this paper we propose a method for extracting a seed set of adverbs from a collection of pairs of antonym. A model based on the representation of a set of synonyms from the Russian lexicons as a graph, and determination the semantic orientation of the adverbs concerning three main dimensions of the semantic differential are also demonstrated. The assessment of performance of the method in comparison with the dictionary data shows the effectiveness of the method obtained.

Renkovskaya E. ABBYY

Some peculiarities of the syntactic structure of Russian proverbs: a study of one-predicate sentences

The paper discusses some peculiarities of the syntactic structure and word order in Russian proverbial sentences with one verb as a predicate. We argue that the syntactic structure of proverbs is dependent on their general pragmatic purposes. The paper focuses on the syntactic features that make proverbs a specifi c type of Russian sentences.

Romanov Aleksandr S. Mescheriakov Roman V. Tomsk state university of control system and radioelectronics

Gender identification of the author of a short message

Gender identification of the author of a short message (20–200 characters) is studied. The paper describes a set of experiments with short message texts performed using a support vector machine approach. The task is viewed as a classifi cation problem with two possible alternatives: male and female. Important features of short messages to be considered when determining the author’s gender are singled out. The database of electronic communications collected for research included 41780 posts by 15 men and 15women. Experiments used a software system Avtoroved developed by the paper’s authors. Altogether, about 50 text attributes at the level of symbols, words, sentences and their combinations were studied. As a result, relevant characteristics of short messages were identifi ed: unigrams and trigrams of symbols, function words, punctuation and emoticons. The total accuracy of gender identifi cations was 0.74.

Savchuk S. O. Institute for Russian Language RAS

A corpus-based study of morphological variability: variation in gender forms of Russian nouns

The paper presents the results of a corpus-based study on gender variation in Russian nouns. The list of variants was composed by analyzing textbooks and dictionaries compiled at the beginning and the 2nd half of the 20th century. The total number of gender variants. including outdated and substandard ones, exceeds 600. The variants are classified according to their morphological and semantic features. The next stage of the research is focused on gender variants within the group of indeclinable nouns. The usage of every lexeme from the list was analysed in the texts of the Russian National Corpus, all gender variants was registered in the database and the correlation between variants was determined. The comparison of corpus data with the data derived from dictionaries made it possible to find out the changes in correlation between variants within the studied period and to formulate some trends in variants functioning.

Seriy A. S. Sidorova E. Institute of Informatics Systems SB RAS

Object identification in problem of automatic document processing

The paper presents an approach to automation of filling of an information system with the data obtained as a result of automatic document processing. The extracted data must be standardized as a network of information objects of a certain format. The backbone of such technique is to build so called focus set for every information object found in a text. Focus set for a single information object consists of all of the relations between this object and other input entities. There are several separate data processing stages: the search for duplicates, direct search, the search for similars and the search via the focus sets technique. A degree of data reliability is also provided. Thus an obsolescence of information, occurrence of the inexact and duplicated data, and conflict of new data with legacy information is taking into consideration.

Sharoff Serge University of Leeds, UK, Nivre Joakim Uppsala University, Sweden

The proper place of men and machines in language technology. Processing Russian without any linguistic knowledge

The paper describes several experiments aimed at designing tools for processing Russian texts, namely for Part-Of-Speech tagging, lemmatisation and syntactic parsing, exploiting exclusively statistical approaches without coding any linguistic rules specifically for Russian. While not claiming any new ground for machine learning research, the results demonstrate the possibility to create state-of-the-art tools for Russian in very short time using only machine learning and no hard-coded linguistic knowledge. One of the results of this study is a set of publicly available resources which can be used in standard pipelines for processing Russian. However, they also demonstrate hidden costs associated with the use of purely statistical methods and the need to integrate linguistic parameters into statistical procedures.

Shmeleva E. Ya. Shmelev A. D. Vinogradov Institute of Russian Language, Russian Academy of Sciences

Interlingual puns in Russian jokes

The paper deals with a certain mechanism of verbal humor used in Russian jokes, namely, interlingual puns, that is, the contrast between two linguistic expressions of different languages that sound alike. The interlingual puns, entailing the interplay between languages, are based on interlingual homonymy or paronymy and are the products of a transaction between languages. The paper describes various types of interlingual puns.

Sizov V. G. Podlesskaja O. Y. Laboratory of Computational Linguistics, Kharkevich Institute for Information Transmission Problems, RAS

Reflecting accentuation in the Russian morphological dictionary of the multifunctional linguistic processor ETAP-3

Our work is aimed at the introduction of accentual information into the morphological dictionary of the multifunctional linguistic processor ETAP-3. A special formal description language has been created, and special rules for most of the basic accentual schemes have been designed. Special algorithms have been written for morphological analysis and synthesis.

Skatov Daniel S. Liverko Sergey V. Dictum Ltd, Nizhny Novgorod, Russia

Anaphora resolution of the third-person pronoun in texts from narrow subject domains with grammatical errors and mistypings

Third-person pronoun anaphora resolution in texts from Internet sources (forum comments, opinions) belonging to specifi c subject domains (cars, household appliances etc.) is discussed. A concrete solution is offered. High precision with acceptable recall (and vice versa) is illustrated by an example of opinions on cell phones.

Smirnova N. Chistikov P. Speech Technology Center

Software for automated statistical analysis of phonetic units frequency in Russian texts and its application for speech technology tasks

Currently the development of most speech technology applications is based on the use of pre-recorded speech data produced by one or several speakers. The principal requirement to the speech corpus is suffi cient coverage of speech units involved in a specifi c task. The type of units may differ depending on the approach adopted. The most popular way of obtaining speech material is through making speakers read some text, since read speech allows strict control over unit coverage (phonetic, prosodic and the like). For the purpose of automating and facilitating the acquisition of text corpora of desired phonetic composition and coverage, a special tool “TextAnalyser” has been developed. The software is primarily intended for the development of automatic speech recognition and synthesis systems. It makes use of an electronic dictionary containing 180 000 Russian word forms and is based on an automatic transcription tool developed for the Russian TTS system. It allows the generation of texts with required phonetic coverage, the assessment of several types of phonetic unit frequencies in Russian texts (monophones, diphones, triphones, syllables) and the reduction of data redundancy. TextAnalyser was applied for statistical analysis of a large text corpus in Russian comprising 460 965 words (2 500 288 phonemes). As a result of text processing, frequencies of occurrence were obtained for all relevant kinds of Russian-language phonetic units. In the paper we present ordered monophone and diphone frequency lists. The obtained monophone statistics is compared to previously published data.

Sokolova E. G. Semenova S. Yu. Russian State University for Humanities, Moscow, Zagorulko Yu. A. Kononenko I. S. A. P. Ershov Institute of Informatics Systems SB RAS, Novosibirsk, Zakharov V. P. Saint-Petersburg State University, St. Petersburg, Krivnova O. F. Lomonosov Moscow State University, Moscow

Selection and preparation of terms for the Russian-English thesaurus of computational linguistics

The initial phase of the development of Russian-English thesaurus on terminology in the fi eld of computational linguistics is described. One of the fi rst tasks is the choice of candidate sources of terms allowing for the bilingual nature of the electronic resource. Other problems to be solved are those of terminology extraction and selection of basic term list as well as the study of peculiarities of representation of terms and relations between them. The diversity of the fi eld of computational linguistics, its interdisciplinary nature and the lack of Russian terminological sources and term definitions due to certain lagging of the fi eld in Russia as compared to the English-speaking countries — all these factors explain the kind of decisions made at this stage. One of them concerns the use of the Russian-language corpus of papers presented at the International Conference “Dialogue” (2000–2010). This corpus proved to be a helpful source of terms in real use. Besides, dictionaries as well as indices and glossaries of textbooks and manuals have been examined in order to derive definitions. As an additional source of terms for the Russian part of the thesaurus the English-language terminological sources have been utilized and their terms and definitions translated into Russian. This is especially important for the terms in some empirical and technologically advanced subfields, such as speech technologies.

Testelets Y. G. Russian State University for Humanities

Case as a characteristic of identity under ellipsis in Russian

In Russian, all elliptical operations except N'-ellipsis require the identity of case values in NPs of the antecedent and the elliptical gap, and identity of role or grammatical relation does not suffi ce when cases are different. Ellipsis follows the six-case model, the peripheral cases, like partitive or ‘the case of expected object’, pattern with their more standard counterparts. With direct and indirect object NPs, however, case values may be different in the antecedent and the gap, e. g. with recipients, addressees, and the genitive of negation.

Trub V. M. The Institute of Ukrainian language of the National Academy of Ukraine, Kyiv, Ukraine

On the dynamic semantics of the word миф

One of the main problems of modern semantic research is polysemy. The way to explain polysemy consists in the representation of meaning of a polysemantic word as a result of semantic transitions. The derivative meanings can be explained as a result of transferring attention on one of the components of the initial meaning and suppressing another meaning. We illustrate this principle by the Russian polysemantic word миф ‘myth’. It is shown that different meanings of this noun can be interpreted as a result of focusing attention on the different components of its initial defi nition such as the content of the myth, the disproof of this content, the positive axiological evaluation of different aspects of this content.

Uryson E. V. V. V. Vinogradov’s Russian Language Institute, RAS, Moscow

Concessive conjunction khotia‘though’ and “cancelled expectation”

The paper is focused on the semantics of the concessive conjunction khotja ‘though’; cf. (1) Khotja pogoda byla ochen’ plokhaja (P), oni kazhdyj den’ kupalis’ (Q) ‘Though the weather was very bad (P), they bathed every day (Q)’. Different definitions of the meaning ‘khotja’ are found in the literature. In typology it is generally agreed that its main components are implication and negation: Though P, Q = ‘usually if P, then not-Q; in this case P and Q’. In traditional Russian grammar it is commonly supposed that the basic semantic component of khotja ‘though’ is “cancelled expectation”: situation P in the subordinate clause induces the expectation not-Q, and this expectation fails in the main clause. Both defi nitions are adequate for examples like (1). Some new material enables to choose between them. I analyze sentences like (2) Khotja bol’shinstvo potukhshikh vulkanov — eto gory konusoobraznoj formy (P), ne vsjakaja takaja gora — byvshĳ vulkan (Q) ‘Though most of the extinct volcanoes are conical mountains (P), not every conical mountain is an extinct volcano (Q)’; (3) Khotja Fjodor zarabatyvaet bol’she Ivana (P), on tozhe ne mozhet soderzhat’ semju (Q) ‘Though Fjodor earns more than Ivan (P), he cannot provide for his family either (Q)’. I show that the main semantic component of ‘khotja’ is “cancelled expectation”. In cases like (1) “cancelled expectation” is due to our knowledge of usual casual relations between situations (fi xed in frames or scenarios). In cases like (2)–(3) “cancelled expectation” is not due to any frames or scenarios but, rather, to some properties of human consciousness. For instance, (2) presupposes a stable association between volcanoes and conical mountains. This association is the basis of our “expectation” and the consessive conjunction khotja marks that this expectation fails. (3) presupposes a comparison of two people, and in such a context situation P induces our “expectation”, which fails in the main clause (cf. Grice’s informativity postulate). Thus, in these cases “cancelled expectation” is due to our standard line of reasoning. The concessive conjunction khotja is a means for marking that reasoning of this type is wrong. Some linguistic problems of description of the concessive conjunction khotja are discussed.

Voznesenskaya M. M. V. V. Vinogradov Russian Language Institute, Russian Academy of Sciences

Enantiosemy in Russian phraseology

The phenomenon of enantiosemy in Russian phraseology, its sources and types are considered. Enantiosemy is shown to be connected with different interpretation of an inner form (antiphrasis, implications of different kinds etc.). In some cases enantiosemy arises as a result of pragmatic interpretation of a corresponding situation (emotive connotations).

Yagunova E. V. Pivovarova L. M. Saint-Petersburg State University

A study of the news text structure as a consequence of connected segments

The main object of this study is connected segments (collocations, compound nominations, predicative constructions, multiword expressions, etc.) extracted from the text by different statistical measures and during experiments with native speakers. This paper deals with news texts: i) 2010 news from lenta.ru (40000 texts, 9.5 million tokens); ii) a small highly homogeneous corpus that deals with some particular event: Schwarzenegger in Moscow (360 texts,110 thousand tokens) and The appointment of Sobyanin (660 texts,170 thousand tokens); iii) three individual texts about Schwarzenegger and two individual texts about Sobyanin. These texts are part of both the small homogeneous corpus and the large news corpus. In this paper we use an open-source “Cosegment” system (http://donelaitis.vdu.lt/~vidas/tools. htm). The program cuts the text into strongly connected segments depending on the corpus. We study different types of context using overlapping corpora as the input of the system. We also compare result based on the whole corpus and on individual texts from this corpus. During the experiments with native speakers we ask 18 students to put a number from 0 to 5 between every two words in the text. 5 means that these two words are strongly connected, 0 that there is no connection at all. Then we use a cutoff 3.7 to divide a text into connected segments. Our results are the following: i) Longer connected segments are found in the more homogeneous corpus; ii) Frequent connected segments in highly homogeneous corpora (as opposed to lenta. ru corpus) are mostly predicative constructions; iii) The computer processing data are very close to the native speakers’ data; iv) Native speakers tend to extract longer segments; they also prefer predicative constructions to collocations.

Yanko T. E. Institute for Linguistics, Russian Academy of Sciences

Accent placement principles in Russian

The basic constituents of intonation structure are pitch accents. Pitch accents designate topic-focus distinctions, contrast, and discourse structure. The question arises as to what phonetic words the accents are placed on. This paper gives an account of various accent placement principles in modern Russian.

Zalizniak A. A. Institute of linguistics, Russian Academyof Sciences, Mikaelian I. L. Penn State University

Rvat’ zuby and myt’ den’gi: one use of simple imperfectives in Russian

The article discusses a specific stylistic effect produced by the use of simple (non-prefi xed) imperfective verbs when, under special conditions, they substitute for prefi xed imperfective verbs, as in myt’ den’gi used instead of otmyvat’ den’gi (“to launder money”), varit’ truby, instead of svarivat’ (“to weld pipes”), etc. There are three main features that characterize this effect: (1) The verb and its complement constitute a more or less idiomatic expression. (2) The simple imperfective belongs to a lower or substandard register. Thus žeč’ gorški is a low-register use with respect of obžigat’ gorški (“to fi re pots”); myt’ den’gi is cruder than otmyvat’ den’gi, and pisat’ pulju is even lower than raspisyvat’ pulju (“to play a game of preference”). (3) Simple imperfectives signal hermetic or jargon-like expressions: they are used in the speech of professional communities or communities of interest, cf. varit’ truby instead of svarivat’, gruzit’ film instead of zagružat’ (“to download a film).” Such pairs of imperfective verbs are not very numerous, but they constitute an open list, which proves that the Russian verb system possesses a mechanism that generates this kind of “quasi” non prefixed verbs.

Zevakhina Natalia Lomonossov Moscow State University

Exclamatives in Russian: a corpus study

The paper investigates the exclamative use of wh-words in the National Corpus of Russian Language. Exclamation is a verbally uttered speaker-oriented emotion occurring when the state of affairs in the real world violates the speaker’s expectations. The paper shows that, basically, Russian wh-words as exclamatives can be split into four groups according to four criteria: (i) independent exclamative use; (ii) 'anaphoric' use (i. e., with reference to phrases within a particular discourse); occurrence in special syntactic constructions (iii) with the particle tol’ko, (iv) with the particle vot. The fi rst group of wh-exclamatives satisfi es all of the criteria, the second group complies with (ii)-(iv), etc. We also discuss the exclamative use of Russian wh-words in sentential arguments and lexico-grammatical properties of matrix predicates allowing for such contexts that rather exhibit a continuum than clear-cut classes.

Zimmerling A. V. Moscow State University for the Humanities, MGGU

Scrambling types in the Slavic languages

The paper discusses the types of scrambling in the Slavic languages and in Universal Grammar. It is argued that all kinds of scrambling may be explained as instances of optional movement. Scrambling types are classified on the basis of final and initial movement domains in the clausal complex where sentence categories move. Slavic languages have all four theoretically possible scrambling types of non-clitic elements and both types available for clitic elements. The diagnostic features of clitic scrambling are described for the first time.

Сборник 2011

Содержание

Формат PDF

Коллекция сборников