Proceedings 2014

A
Antonova A., Misyurev A.
AUTOMATIC CREATION OF HUMAN-ORIENTED TRANSLATION DICTIONARIES
This paper addresses the issue of automatic acquisition of a human-oriented translation dictionary from a large parallel corpus. Automatically generated dictionary entries can enrich the output of a statistical machine translation system. We describe an automatic approach to the extraction of translation equivalents, and dictionary entry construction: grouping of synonymic translations, selection of illustrative context examples. The extraction of possible translations is based on statistical machine translation methods. The selection of lemmatized and linguistically motivated phrases is done with the help of morpho-syntactic analysis. In contrast to human-built dictionaries, an automatic dictionary usually contains a certain amount of noisy translations, as a consequence of systematic alignment mistakes and corpus imperfections. A noise reduction approach is proposed. We also provide the result of an evaluation experiment and the comparison of frequency distribution of words in the queries to the dictionary and the frequency distribution of words in plain text.
Apresjan V. Yu.
IDIOMATIZATION AND GRAMMATICALICATION IN NON-STANDARD CONSTRUCTIONS
The paper is a corpus research of the Russian construction wh-word + negative particle X,P (as in Kak ni trudno, nado starat'sja 'However difficult, one has to try'; Cto on ni prosil, vse emu davali 'Whatever he asked for, he was given') as a typical representative of a certain class of syntactic objects, namely, non-standard constructions, which reveals the following properties: 1) only one or several lexemes ("favorites") account for up to a half of all encountered realizations; 2) non-standard constructions are non-compositional; 3) realizations with certain "favorites" result in idiomatization and grammaticalization of particular expressions which become separated from the "mother" construction. The choice of "favorites" is triggered by the process of mutual semantic attraction: the interaction of the construction semantics and the semantics of filler lexemes. This choice is also influenced by the linguistic worldview typical of a particular language.
Astrakhantsev N. A., Fedorenko D. G., Turdakov D. Y.
AUTOMATIC ENRICHMENT OF INFORMAL ONTOLOGY BY ANALYZING A DOMAIN-SPECIFIC TEXT COLLECTION
The core part of an entity linking system, in particular one oriented to wikification, is ontology, which is often informal and supports semantic relatedness as the only type of relation. Most of these systems suffer from the problem of ontology incompleteness. It is especially important for specific domains, since often the only source of extractable knowledge is plain text. This paper formulates the incompleteness problem as a task of ontology enrichment from domain-specific texts and presents a novel approach that combines state-of-the-art methods for terminology enrichment, our own ML-based method for homonymy detection, and methods adopted from the related field for relations extraction. Experimental evaluation shows that the bottleneck is terminology enrichment step: its average precision is about 35%, which is inapplicable for automatic usage, especially taking into account the strict requirements for ontology correctness; however, recall is high enough to help semi-automatic terminology enrichment. We also show that the best features for terminology enrichment differ from those for classic terminology recognition task.
B
Baranov A. N.
ACTIVITY OF PARTICIPANTS IN A CONVERSATION: METHODS OF LINGUISTIC ANALYSIS
The paper deals with the phenomenon of activity of dialogue participants. Analysis of participants' activity in a conversation is of great importance for theoretical linguistics as well as for applied linguistics. In forensic linguistics, analysis of activity can be used as an objective parameter for the qualification of real communicative goals of participants. The paper introduces three major methods of analysis of the phenomenon discussed: 1) the method of communicative activity, i.e. the amount of illocutionary independent speech acts of a participant in a dialogue or in its relevant part; 2) the method of thematic activity, the analysis of which enables the detection of exactly which participants independently introduces the main themes in a conversation; 3) the method of quantitative activity, based on calculating the amount of words associated with a specific theme in a conversation. We discuss the different types of correlation between the three methods.
Baroni M.
MULTIMODAL AND CROSS-MODAL DISTRIBUTIONAL SEMANTICS: TOWARDS COMMON SEMANTIC SPACE FOR WORDS AND THINGS
Distributional semantic models (DSMs) capture various aspects of word meaning with vectors summarizing their patterns of co-occurrence in large text corpora, under the assumption that the contexts in which words occur are good cues of what they mean. DSMs have been very successful empirically, and they have been used to model increasingly sophisticated linguistic and cognitive phenomena. However, current DSMs account for linguistic meaning entirely in terms of linguistic signs (the "meaning" of a word is a summary of the linguistic contexts in which the word occurs). This leads to two big conceptual problems: lack of grounding and lack of reference. Concerning the former, cognitive scientists have accumulated plenty of evidence that, for human beings, meaning is strongly embodied in the sensory-motor system, so a semantic theory that completely dissociates meaning from perception and action is, a priori, a rather implausible model of how humans work — a fact that has also empirical consequences in the surprisingly bad performance of DSMs on simple tasks requiring perceptual information. Lack of reference is perhaps an even more serious problem. A theory that has no way to connect the semantic representation of a linguistic expression to states of the world is clearly missing something fundamental about language, as it has no way to explain how we can talk about things! Interestingly, in the last decade, it has become common in computer vision to represent images through vectors recording the distribution of automatically extracted discrete visual features in them — a representation that is very similar to the one that DSMs assume for words. This suggests that we might be able to free DSMs from their textual cage by establishing a connection with the visual world by means of such vector-based image-representation techniques. In my talk, after a brief general introduction to distributional semantics, I will discuss experiments we carried out in the last few years in which we tackle the grounding problem (DSMs with richer multimodal semantic representations that combine linguistic and visual features), and recent work in which we started dealing with the reference issue (how to map images and linguistic expressions across modalities to a common space, in order to link language to the world out there). The case studies I will present include simulating human semantic similarity judgments, predicting the color of objects, modeling brain data and learning names and verbally-expressed attributes of objects present in pictures from indirect evidence.
Belikov V., Kopylov N., Selegey V., Sharoff S.
VARIATIONAL CORPUS STATISTICS USING AUTHOR PROFILES
This paper is based on research carried out in the framework of our project on the General Internet Corpus of Russian (Geekrya) . The need to use large-scale corpora automatically collected from the Web was first recognized in computational linguistics. Recently, the lack of data in "manually-built" corpora led to recognition of the importance of Web-derived corpora in traditional linguistic research.
Blinov P. D., Kotelnikov E. V.
USING DISTRIBUTED REPRESENTATIONS FOR ASPECT-BASED SENTIMENT ANALYSIS
The article is focused on aspect-based sentiment analysis, which is a specific version of the general sentiment analysis task. Its goal is to detect the opinions expressed in the text on the level of significant aspects of the specified entity. An overview of the existing approaches and previous work is presented. The main result of our work is a new method of aspect-based sentiment analysis based on the distributed representations of words. Such representations are obtained by using deep learning algorithms. The method includes the well-known algorithm of training distributed representations of words, two new techniques for constructing the aspect and sentiment lexicons, and an algorithm for calculating aspect scores. Examples of aspect and sentiment terms are given. The vectors of resulting terms are visualized using the t-SNE method. The article presents the results of experiments on a test corpus for three aspects — "food", "interior" and "service", which yield aF1-measure increase of 11 to 16% as compared to the baseline.
Bogdanova-Beglarian N. V.
ONE OF THE MOST FREQUENT ITEMS IN RUSSIAN SPONTANEOUS SPEECH: БЛИН FROM LINGUISTIC AND SOCIOLINGUISTIC POINTS OF VIEW
The paper is dedicated to some peculiar functions of one of the most frequent items in Russian spontaneous speech, "6JIHH", which formally being a word, is in fact more of a functional item). Using a Russian speech corpus ('One Speech Day' sub-corpus) we explored the historical change of the item; from an interjectionally used euphemism for an extremely rude slang word meaning 'whore'—through an acceptable colloquialism—to an almost meaningless clitic. So the evolution of this word begins at the point of being absolutely unacceptable in everyday speech, continues through being common and existing in any kind of neutral speaking, and ends as an ornamental word that probably lost the connection with its first meaning completely. The final item does not have any meaning, lacks grammar categories, is not marked by intonation and has almost no emotional connotation. Normally such words are mostly used by men; but in this particular case gender does not play any role.
Bogdanov A. V., Dzhumaev S. S., Skorinkin D. A., Starostin A. S.
ANAPHORA ANALYSIS BASED ON ABBYY COMPRENO LINGUISTIC TECHNOLOGIES
This paper presents an anaphora analysis system that was an entry for the Dialog 2014 anaphora analysis competition. The system is based on ABBYY Compreno linguistic technologies. For some of the tasks of this competition we used basic features of the Compreno technology, while others required building new rules and mechanisms or making adjustments to the existing ones. Below we briefly describe the mechanisms (both basic and new) that were used in our system for this competition.
Borisova E. G.
THE DISCOURSE WORDS AND REFERENCE IN THE PROCESS OF UNDERSTANDING
The article addresses some aspects of the process of understanding, namely the reference of the components of utterances. The referential activity of the Hearer is regarded as a part of his actions in analyzing the sentences. These actions of the Hearer can be corrected by the Speaker with the help of discourse markers, modal particles etc. These entities are no markers of a referential status of nouns, still they can help reveal this status in some complicated cases, as follows: A non-actualized (though definite) name is used as a topic of an utterance. The Russian particle—to that marks such topics can help reveal the definite status of the name. Some other complicated cases of topic formation can be marked by particles vot and von. The word that denotes something known (maybe mentioned in the previous context) can be revealed with the help of the particles vot, imenno, kak raz. It concerns not only nouns but also predicates. The indefinite status of a noun can be demonstrated with the help of the particle tam, which is used to denote unimportance of some fact or a noun.
Borschev V. B., Partee B. H.
ONTOLOGY AND INTEGRATION OF FORMAL AND LEXICAL SEMANTICS
Formal and lexical semantics can be integrated if they speak the same language. We claim that a substantial part of lexical semantics can be incorporated into formal semantics without adding to the latter any new mechanisms. This talk continues the authors' work on the ontology and the semantics of measure constructions in Russian. The work concerns expressions like dva stakana moloka, polkorziny gribov, tri meshka muki (two glasses of milk, half a basket of mushrooms, three bags of flour), etc., describing various kinds of containers, or corresponding measures based on them, and their contents—portions of substances. In our previous works, describing ontological information, including sorts of things and the words and expressions that designate sorts, we did not include those sorts in our formal semantic analyses. We do that in the present work, declaring sorts as types and thereby significantly expanding Montague's system of types. On the one hand this gives us the means for specifying various aspects of the ontology, and on the other hand it lets us more fully specify the semantics of the constructions under consideration. The substantive goals of this research are, in part, to be able to describe and explain co-occurrence constraints and ideally to be able to formally distinguish well-formed from ill-formed expressions in this domain.
D
Dikonov V. G., Poritski V. V.
A VIRTUAL RUSSIAN SENSE TAGGED CORPUS AND CATCHING ERRORS IN A RUSSIAN ↔ SEMANTIC PIVOT DICTIONARY
There are areas in computational linguistics, where a word-sense tagged corpus becomes a necessary prerequisite or gives a significant boost to research. Unfortunately, publicly available corpora of this kind are extremely rare and making them from scratch is a very long and costly process. No corpus of Russian with unambiguous word-sense tags has been published so far. This paper describes an experimental approach of creating a virtual equivalent of a Russian sense tagged corpus and putting it to some real use. The virtual corpus was created using two public resources: the English SemCor corpus and our free multilingual semantic pivot dictionary, called the "Universal Dictionary of Concepts". The dictionary provides information sufficient to find sense-specific translations for nearly all sense-tagged words in SemCor. However, the pivot dictionary itself is under development and we are looking for the ways to improve it. We used the existing Russian volume of the pivot dictionary to calculate lexical context vectors for individual senses of 13,832 Russian words, supposedly equivalent to the vectors that could be obtained from a real Russian translation of SemCor. Another set of vectors representing real usage of the same Russian words was extracted from a medium-size corpus of Russian without any semantic markup. The vector similarity score proved to be a useful factor in judging the correctness of links between Russian words and word senses similar to ones registered in the Princeton Wordnet. It helped to rank over 21,000 of such links out of 56,000 known and significantly reduce the amount of the manual work required to proofread the dictionary
Dobrovol'skij D. O., Levontina I. B.
DISCOURSE WORDS IN GENERAL QUESTIONS: RUSSIAN-GERMAN NEAR-EQUIVALENTS
The paper discusses Russian discourse words such as razve, neuzheli, chto, chto li, kak, etc. Cf. Ty chto / chto li / kak, s nami idesh'? ~ 'What about you, are you coming along?', and their German near-equivalent etwa (cf. Gehst du etwa mit?). Our data show that translatability and semantic equivalence are different phenomena. Both Russian and German possess a rich inventory of question particles, which makes it possible to find a suitable translation for nearly every utterance, even a translation containing a particle. However, this does not imply that the corresponding particles are semantically equivalent. The analysis shows that such particles, being functionally equivalent, i.e. interchangeable in particular utterances, display rather remote semantic resemblance. The German particle etwa is conceptually based on the idea of approximateness. That is why it weakens the illocutionary force of the utterance, whereas the Russian particles chto, chto li, kak directly appeal to the interlocutor and, therefore, reinforce the speaker's attitude. However, both German etwa and Russian chto, chto li, kak stress the speaker's involvement in the situation. This property determines their functional similarity.
Dobrushina N. R.
MODALS AND THE SUBJUNCTIVE
I consider constructions that involve the modal verb moch or the modal adjective dolzhen and the subjunctive particle by. I argue that, with respect to the subjunctive, these modals behave differently from regular verbs. Their subjunctive is often functionally identical to the indicative; in contexts where other verbs obligatorily take the subjunctive form, these two predicates may use the indicative. The main factor that controls omissibility of the subjunctive particle is shown to be an epistemic interpretation. I consider some typical cases where the subjunctive and the indicative are synonymous for these predicates, and those where they are not. Thus, in the apodosis of conditional constructions the particle is often omitted, although, in general, Russian prefers a symmetrical use of the subjunctive in both protasis and apodosis. On the other hand, when in the protasis, the particle is not omitted. The subjunctive is often used with the modals for pragmatic purposes, such as politeness. The paper is based on the data from the Russian National Corpus.
E
Ermakova L. M., Mothe J., Ovchinnikova I. G.
QUERY EXPANSION IN INFORMATION RETRIEVAL: WHAT CAN WE LEARN FROM A DEEP DNALYSIS OF QUERIES?
Information retrieval aims at retrieving relevant documents answering a user's need expressed through a query. Users' queries are generally less than 3 words which make a correct answer really difficult. Automatic query expansion (QE) improves the precision on average even if it can decrease the results for some queries. We propose a new automatic QE method that estimates the importance of expansion candidate terms by the strength of their relation to the query terms. The method combines local analysis and global analysis of texts. We evaluate the method using international benchmark collections and measures. We found comparable results on average compared to the Bo2 method. However, we show that a deep analysis of initial and expanded queries brings interesting insights that could help future research in the domain.
F
Fedorova O. V., Potanina Ju. D.
WORKING MEMORY AND RUSSIAN LANGUAGE: FROM COMPREHENSION TO PRODUCTION
Working memory and long-term memory differ in many ways. One difference is in the storage capacity of each. Traditionally, the capacity of the working memory has been measured by a memory span task in which the individual hears series of items and must repeat them. Most of the research has focus on individual differences in working memory capacities. Daneman and Carpenter (1980) developed the Reading span test, which they interpret as providing a measure of an individual's working memory capacity. The subject is given a series of sentences to read, and then must recall the last word from each of the preceding sentences. Span is calculated as the maximum number of sentences on which the subject can perform this task perfectly. In 1986 Daneman and Green developed the Speaking span test. Most of the research has done on English-speaking individuals. The main goal of this paper is to provide and describe the Verbal span tests on Russian material. The present study shows how the use of the notion of verbal working memory contributes to our understanding the individual differences in language comprehension and language production mechanisms. Using Russian adaptations of the working memory reading span and speaking span tests we demonstrated that the working memory capacity is really correlated with some referential processes, as well as it is a predictor of verbal fluency.
G
Grishina E. A.
RING AND GRAPPOLO: FINGERTIP CONNECTIONS IN RUSSIAN GESTICULATION AND THEIR MEANINGS
The study analyzes the main types of Russian gestures, which are based on the connection of one's fingertips (configuration exactly, feather, bunch). We distinguish five semantic groups, which correspond to these configurations ('exactness', 'small object', 'object', 'center', 'connection'). We also compare the linguistic functions of the fingertip connections and the hand physical contact.
I
lomdin B. L., Lopukhina A. A., Nosyrev G. V.
TOWARDS A WORD SENSE FREQUENCY DICTIONARY
Analyzing several Russian nouns denoting everyday life objects, we explain why a word sense frequency dictionary is necessary. Techniques of calculating the approximate frequencies are proposed, based on the analysis of native speaker surveys and the annotation of the most frequent collocations in a large text corpus (we used the huge RuTenTen11 corpus integrated into the Sketch Engine system). A word sense dictionary could be used in a variety of NLP tasks, in particular for a probabilistic word sense disambiguation without available context, in creating second language learning resources, as well as in academic lexicography. Besides, studies of sense sets of polysemous words and their comparative frequencies are important for the linguistic theory, because they shed light on the evolution of the lexical system.
lomdin L. L., lomdin B. L.
VALENCIES OF RUSSIAN PREDICATE NOUNS AND MICROSYNTACTIC CONSTRUCTIONS
The paper discusses valency realizations of Russian predicate nouns in certain types of syntactic constructions (mainly, existential ones like Mne net neobxodimosti sdavat ekzamen 'There is no need for me to take the exam'; lit. 'to me there is no necessity...') where these realizations are not directly linked with the nouns concerned. In these cases, subcategorization frames of nouns are insufficient to account for the correct semantic interpretation of the construction in text analysis, or the adequate choice of valency implementation in text generation. For every word, detailed information on how its valencies are implemented within particular constructions should be supplied in the dictionary.
lonov M., Kutuzov A.
THE IMPACT OF MORPHOLOGY PROCESSING QUALITY ON AUTOMATED ANAPHORA RESOLUTION FOR RUSSIAN
The paper deals with the problems of creating and tuning a system of automated anaphora resolution for Russian. Such a system is introduced, combining rule-based and machine learning approaches. It shows F-measure from 0.51 to 0.59. Freeling serves as an underlying morphological layer and an account of its quality is given, with its influence on anaphora resolution workflow. The anaphora resolution system itself is available to download and use, coming with online demo.
K
Kamenskaya M. A., Khramoin I. V., Smirnov I. V.
DATA-DRIVEN METHODS FOR ANAPHORA RESOLUTION OF RUSSIAN TEXTS
The paper considers two data-driven methods for anaphora resolution of Russian texts. These methods are based on machine learning with annotated corpora and using no additional information except linguistic features. The first method uses Support Vector Machine as learning and classifying algorithms, the second method uses Decision Tree inducer. We evaluate the performance of the methods with several feature sets and corpora. Feature sets included morphological, syntactic and semantic features. In this paper we also evaluate how semantic features, namely semantic roles, impact the performance of anaphora resolution in Russian. We used our manually annotated corpus as well as a corpus provided by the organizing committee of the forum for the evaluation of linguistic text analysis systems, an event of Dialogue 2014. Experiments showed that precision of SVM is higher on experimental data for almost all cases. It was shown that semantic features enhance the performance of the methods for anaphora resolution of Russian texts. We have also calculated the optimal distance between the anaphor and the hypothetic antecedent and used it in our methods.
Kononenko I. S.
PRAGMATIC ASPECTS OF INTERNET COMMUNICATION: TOWARDS WEBSITES GENRE MODELS
A two-level multifaceted genre classification is proposed to cover pragmatic aspects of communication on the Web. Genre categories of websites and genre types of site constituents (pages and structural blocks) are represented as vectors of relevant pragmatic features. Praxeological parameters (activity subject, beneficiary, product, environment) are involved to represent human activity that underlies communication and manifests itself in the site structure, content and form of site constituents. Communicative parameters encompass the hierarchy of communicative tasks (including anticipated reactions of the target audience), functionality of site constituents, and the affordances of communication channel (interactivity, multimodality, and dynamics of content). Functions of site constituents together with medium features are exemplified to determine genre types of pages. The type of a textual page corresponds to a certain genre schematic structure composed of content blocks. The extraction of genre schemata is possible using the so called genre markers (cue words and constructions) that are formalized as lexico-grammatical patterns provided with format conditions.
Kravchenko A., Pivovarov V., Zharikov A.
PRACTICAL ASPECTS OF LONG-TERM ONTOLOGY-BASED INFORMATION EXTRACTION
Ontology-based information extraction' is a subfield of information extraction, where ontologies play an essential role in the process, shaping both system input and target output. There are many different approaches to creating and maintaining an ontology and little work has been done to evaluate and compare the effectiveness of those approaches. In addition, the practical applications of those systems differ drastically from theory. Architecture that shows good performance in a single test does not necessarily perform as well in the long term. We conducted an experiment to explore the issues that arise during practical application of OBIE methods and to describe the behavior of ontologies maintained during a long period of time. In this article we discuss emerging problems and propose working solutions for them as well as the way of evaluation of OBIE systems. Those solutions were successfully implemented in the scan-interfax.ru project and have provided sufficient quality for the commercial use of an advanced entity-based search engine extracting information from news.
Kreydlin G. E., Pereverzeva S. I.
HUMAN BODY IN A DIALOG: THE ORIENTATION OF SOMATIC OBJECTS IN ITS CONNECTION WITH HUMAN RELATIONS
The main objective of the paper is to examine relations between corporeal, or somatic objects and some psychological aspects of human behavior, namely the relations between the communicators in a dialog. Somatic objects have been investigated from different points of view. Mostly, linguists and specialists in nonverbal semiotics have described names, features and significant actions performed by or with different somatic objects, virtually leaving aside sign manifestations of correlations between physical (corporeal) and psychological (ethical, aesthetical, etc.) aspects of human behavior. It is well-known that if a man is lengthy or extremely short or if he is too fat or scrawny he feels bad about his deficiency. Also, it is known that many corporal defects impede or aggravate proper communication. Here we undertake a few preliminary steps in solving the problem of the systematic description of the correlations between physical and psychological properties of humans. We consider one corporal feature — "spatial orientation" — that many body parts possess and describe its relations with the psychological characteristics of interlocutors. The explication of the notion "orientation of somatic object" is given and two Russian linguistic representations of spatial orientation are discussed. The linear representation corresponds to the linguistic construction X V Y-Instr Prep Z, where X is an oriented object, V is a verb of orientation (e.g. smotret' na chto-libo 'to face smth. (e.g. about the buildings)', byt napravlennym na chto-libo 'to be directed at smth.'), Y is a <so called> salient part of the object X, Prep is a preposition and Z is an orienting point. The angular representation accords with the constructionX VPrep Zpod uglom 'at an angle' Q (where Q is a degree of the angle). The basic part of the paper is devoted to the correspondence between the corporal orientations which are computed by these representations and which are expressed either verbally or non-verbally (based on the Russian body language) and ethical features of humans participating in an actual dialog. Thus, different types of bows conform regularly to the features 'respect to the addressee', 'veneration of the addressee' or just 'warm feeling to him / her'.
Kruzhkov M. G., Buntman N. V., Loshchilova E. Ju., Sitchinava D. V., Zalisniak A. A., Zatsman I. M.
A DATABASE OF RUSSIAN VERBAL FORMS AND THEIR FRENCH TRANSLATION EQUIVALENTS
The paper presents the results of a project aimed at the development of methodology and information technology for the creation of a corpus-based linguistic database of verbal forms with their translation equivalents (with bilingual grammatical search functions). Within the scope of the project the following results have been achieved: 1. Methodology and information technology for the creation of linguistic databases based on bilingual parallel corpora have been developed (including corpora with multiple translation variants). 2. The polyvariant parallel subcorpus which includes Russian literary works with French translations has been created within the Russian National Corpus (RNC). Some of the parallel texts in the subcorpus include multiple translation variants. 3. On the basis of the polyvariant Russian-French corpus a database of Russian verbal lexico-grammatical forms (LGFs) and their French translation equivalents has been created. Equipped with bilingual grammatical search functions, the database is a unique resource that can be used for investigating a wide range of various cross-linguistic problems. 4. A number of concepts in the areas of Russian verbal categories and Russian-French contras-tive grammar have been refined.
Kudinov M. S., Romanenko A. A., Piontkovskaja I. I.
CONDITIONAL RANDOM FIELD IN SEGMENTATION AND NOUN PHRASE IINCLINATION TASKS FOR RUSSIAN
We propose solutions of several NLP problems for Russian making use of the conditional random fields (CRF) framework, including: shallow parsing (chunking), temporal expressions extraction and noun phrase inflection. Each of the three problems are important in speech generation, data mining and spoken dialogs systems design. The purpose of shallow parsing is to extract from the text syntactically related word forms (e.g. noun phrases) without full parsing. It may be useful in data mining applications. Temporal expressions extraction is important for natural language understanding modules of spoken dialog systems. Usually rule-based methods are used to address this problem. Noun phrase inflection is needed for speech generation modules.The main problem is to detect word forms for inflection. For all three problems statistical approach was taken. We use simple version of CRF named linear-chain CRF. In shallow parsing and time expressions extraction state-of-the-art results were achieved. In noun phrase inflection, the level of F1-measure exceeded 95.
Kustova G. I.
CONSTRUCTIONS WITH THE CONJUNCTION CHTOBY: RESOURCES AND CORRELATIONS
The article deals with complex sentences with a noun in the main clause and the conjunction chtoby in the subordinate clause. Construction «desirable feature» (Gde najdesh sidelku, chtoby xorosho ladila s Vasej?) and construction «functional inconsistency)) (On ne dama, chtoby emu cvety darit') are compared with the resource construction (U nas est' vremy'a, chtoby sxodit' v kino).
L
Leontiev A. P., Petrova M. A.
THE DESCRIPTION OF LOCATIVE DEPENDENCIES IN A NATURAL LANGUAGE PROCESSING MODEL
The paper suggests semantic and syntactic descriptions of locative dependencies in an NLP model and focuses on the problems which locative adjuncts evoke for a system aimed at different tasks based on semantic analysis, especially at machine translation. A formal description of locative groups faces several problems. The first is the definition of locative semantic relations between words, as locative dependencies can have different meanings, such as the meanings of initial and final points (walk [from/to the door]), route (walk [across the room]), and others. Second, one has to define the set of words that can fill locative adjuncts, and the border between the locative and non-locative groups is not always distinct: in the street is definitely a locative, but what about on the Internet or in a meeting? Third, the syntactic realizations of locative senses are rather numerous. On the one hand, locative adjuncts include many prepositions with different semantics—like on, in, under, above, etc. On the other hand, different nouns combine with different prepositions to denote the same meaning, like in the country, but on the island. The current paper suggests a formal approach appropriate for dealing with all these difficulties.
Lobanov B. M., Okrut T. I.
UNIVERSAL MELODIC PORTRAITS OF INTONATION PATTERNS IN RUSSIAN SPEECH
We proceed from the model of intonation patterns by Elena Bryzgunova, which is widely used in the teaching of Russian speech intonation. This model includes seven patterns: IP1 (the falling tone), IP2 (the falling tone with a certain prosodic emphasis), IP3 (the rising tone with subsequent fall), IP4 (the falling-rising tone). IP5 (combination of the rising, smooth and falling tones), IP6 (combination of the rising and smooth tones), IP7 (combination of the rising tone with the glottal stop). We present a model of intonation portraits of accentual units (the PAU model), proposed by one of the authors of this paper and effectively used in the practice of Russian speech synthesis for a long time. The PAU model assumes that, for a certain intonation type, the topo-logical properties of the melodic contour are independent of the quantitative and the qualitative characteristics of the pre-nucleus, the nucleus and the post-nucleus of accentual units. The methodology of an experiment of integration of the two models into a unified model of Universal melodic portraits of intonation patterns (UMP-IP) is discussed. The new model is shown to effectively represent the tonal structure of Elena Bryzgunova's intonation patterns and ensure the invariance of the quantitative and the qualitative constituents of the sentence pronounced as well as the pitch and the range of the speaker's voice. The obtained results are discussed from the viewpoint of applicability to the practice of teaching Russian as the second language.
Loukachevitch N. V., Dobrov B. V., Chetviorkin I. I.
RUTHES-LITE, A PUBLICLY AVAILABLE VERSION OF THESAURUS OF RUSSIAN LANGUAGE RUTHES
The paper presents RuThes-lite, a publicly available version of RuThes linguistic ontology, which has been developed for more than fifteen years and is intended for automatic document processing. RuThes has considerable similarities with WordNet: inclusion of concepts based on senses of real text units, representation of lexical senses, detailed coverage of word senses. At the same time the differences include attachment of different parts of speech to the same concepts, formulating names of concepts, attention to multiword expressions, intentional inclusion of terms of the sociopolitical domain, a set of conceptual relations. RuThes-lite was generated from RuThes on the basis of the most frequent words in a contemporary news collection. Besides, we describe additional data, which have been specially prepared for RuThes-lite publication: morph-syntactic labeling of thesaurus text entries and assignment of glosses to concepts.
Lukashevich N. Ju., Kobozeva I. M.
DESIGNING "HUMAN CHARACTERS" LEXICAL DATABASE
The paper discusses a general layout of "Human Characters" lexical database specifically developed to study the meanings of words from the semantic field of human character traits. It is intended as a resource providing a format for a comprehensive analysis of character words usage in different languages. A database with contexts from large modern corpora is considered a convenient tool for semantic analysis which offers such advantages as facilitating data storage and presentation, and keeping the analysis consistent while making changes possible at the same time. It is shown how several issues which significantly influence the analysis procedures are resolved in the pilot database version. These include identifying relevant contexts, describing features of a typical situation in which the character trait in question is exhibited, and comparing contextual meanings of the studied words. The suggested technique provides a more flexible tool for capturing similarities and differences between contexts within one language on the one hand, and gives ground for comparing the usage of translation equivalents on the other.
Lyashevskaya O. N., Kashkin E. V.
EVALUATION OF FRAME-SEMANTIC ROLE LABELING IN A CASE-MARKING LANGUAGE
The paper discusses evaluation techniques for semantic role labeling in Russian. It has been shown that the quality of FrameNet-style semantic role labeling largely depends on the quantity of roles and may decrease if the inventory of roles in the training set differs from that in the output resource. Our study is the first step towards the 'smart' evaluation tool which would introduce linguistically relevant criteria to evaluation; be able to put the mistakes on a scale from minor to critical ones; make evaluation easier in case the grid of roles varies. We run an experiment based on the data from the Russian FrameBank, a FrameNet-oriented open access database which includes a dictionary of Russian lexical constructions and a corpus of tagged examples. The semantic role is one of the parameters that define the predicate-argument patterns in FrameBank. The inventory of roles is modeled hierarchically and forms a graph. We explore the cases when the role induced by the system and the answer of the gold standard do not match. We analyze the statistical criteria of distribution of roles in the patterns and the distance between the source and the target in the graph of roles as a mean to assess the goodness of fit.
M
Magomedova V. D., Slioussar N. A.
INTERNET DATA IN THE STUDY OF LANGUAGE CHANGE: A CASE STUDY OF ALTERNATIONS IN RUSSIAN COMPARATIVES AND A PROGRAM TO WORK WITH SUCH DATA
The Internet is a unique source of non-standard forms, which gives us a novel opportunity to analyze fine-grained dynamics of language change. We used this opportunity to study the decay of historic consonant alternations in Russian. In standard Russian, these alternations are present in some verb forms and in comparatives (e.g. suxoj 'dry' — sushe 'drier', ljubit' 'to love' — ljublju 'I love'), as well as before certain derivational suffixes. Verb forms have been recently studied by Slioussar and Kholodilova (2013), and we looked at comparatives. Two groups of adjectives were selected: ones that have normative comparatives with alternations and ones that do not, but native speakers still try to generate such forms. In the first group, some adjectives like ubogij 'poky' have up to 30 % of comparatives without alternations, but, unlike with verbs, no significant correlation with adjective frequency or its other characteristics was found. The second group consisted primarily of compound adjectives ending in -gij, -kij, -xij. Here, the most important factor is whether the second part of the compound is used as an independent adjective. If it is not (e.g. as in dlinnorukij 'long-armed'), most comparatives lack alternations. Searching for forms on the Internet, we faced many problems. The counts provided by search engines are extremely inaccurate, only the first thousand results are shown, they cannot be downloaded in a convenient format, contain a lot of typos and other irrelevant data etc. We present a program called Lingui-Pingui that we developed to solve these and some other problems.
McShane M.
A MULTI-FACETED APPROACH TO REFERENCE RESOLUTION IN ENGLISH AND RUSSIAN
This paper argues that the detection and resolution of referring expressions can be profitably distributed across modules of a language processing system, rather than being bunched at the end of a text analysis pipeline. The approach is being implemented within the OntoAgent cognitive architecture, which supports the development of multi-functional, language-endowed agents that can collaborate with people in task-oriented applications. Although current development within OntoAgent orients around English, the architecture itself and most of its knowledge bases are language-independent. Drawing upon my past descriptive work on reference and ellipsis in Russian, I will suggest how the same reference resolution strategies might be applied to this and other languages. More generally, I will motivate the need to approach linguistic phenomena in a holistic paradigm, rather than as highly compartmentalized subtasks, which has become the norm for natural language processing applications.
Mikheev M. Ju.
DUSHISIRENEVAJA CVET'... OR JUST A NONSENSE (KAKAJA-TO KHREN')? NOUNS WITHOUT SUFFIXS IN THE TEXTS OF RUSSIAN AUTHORS
Nouns without suffixes, feminine, the 3rd declination (e. g. ludskaya molv') and masculine, the 2nd declination (e. g. konski top) were honored by Alexander Pushkin as legitimate root Russian words. His friend, Vladimir Dal managed to «expand» his dictionary precisely thanks to these words. At the turn of the XIX-XX century these words, especially the first group, became very frequent in Russian poetry and prose. Some of them were recreated. We can find many interesting examples in Sergei Yesenin's and Mikhail Sholokhov's texts. The latter author, made out of dialect and colloquial words distinct markers of his style.
Milichevich J., Timoshenko S.
TOWARDS A FINE-GRAINED DESCRIPTION OF INTENSIFYING ADJECTIVES FOR TEXT PROCESSING
We address collocations of the type "Intensifying Adjective + NOUN", such as heavy RAIN and complete DISAGREEMENT, known as Magn type collocations. Such a collocation can be represented as a functional dependency: Magn(RAIN) = heavy, where Magn is a (lexical) function responsible for the meaning 'very/'high degree', and heavy the value that Magn has with RAIN, its keyword. The formalism of lexical functions has proved its usefulness in various NLP tasks, but on close inspection its semantic granularity turns to be insufficient. We propose a refinement of the notion of Magn by distinguishing Magn's semantic subtypes. Our description, which proceeds from the assumption that a choice of a Magn type collocate is not arbitrary, takes into account the following factors: • semantic class of the keyword (= its semantic label, corresponding to the generic semantic component of its definition) and/or its actants; • semantic component(s) in the keyword's definition targeted by intensification; • semantic contrasts observed among Magn type collocates of a given keyword. We tested our approach on data from the Russian and English explanatory-combinatorial dictionaries developed for the multi-purpose language processing system ETAP-3. As our results show, Magn's semantic subtypes we have identified allow for the encoding of lexicograpahic information in a way that is not only precise but also has predictive power.
Muravyev N. A., Panchenko A. I., Obiedkov S. A.
NEOLOGISMS ON FACEBOOK
In this paper, we present a study of neologisms and loan words frequently occurring in Facebook user posts. We have collected a dataset of over 573 million posts written during 2006-2013 by Russian-speaking Facebook users. From these, we have built a vocabulary of most frequent lemmatized words missing from the Opencorpora dictionary (http://opencorpora.org/dict.php) the assumption being that many such words have entered common use only recently. This assumption is certainly not true for all the words extracted in this way; for that reason, we manually filtered the automatically obtained list in order to exclude non-Russian or incorrectly lemmatized words, as well as words recorded by other dictionaries or those occurring in pre-2000 texts from the Russian National Corpus (http:// www.ruscorpora.ru). The result is a list of 168 words that can potentially be considered neologisms. We present an attempt at an etymological classification of these neologisms (unsurprisingly, most of them have recently been borrowed from English, but there are also quite a few new words composed of previously borrowed stems) and identify various derivational patterns. We also classify words into several large thematic areas, "internet", "marketing", and "multimedia" being among those with the largest number of words. We consider our results preliminary, but believe that, together with the word base collected in the process, they can serve as a starting point in further studies of neologisms and lexical processes that lead to their acceptance into the mainstream language.
Muzychka S. A., Romanenko A. A., Piontkovskaja I. I.
CONDITIONAL RANDOM FIELD FOR MORPHOLOGICAL DISAMBIGUATION IN RUSSIAN
We consider the problem of morphological disambiguation in Russian using statistical methods; specifically, we apply conditional random field (CRF). We propose a new modified model of linear chain CRF, which demonstrates results close to the state-of-the-art. We also propose a new statistical approach to text normalization problem using CRF. Namely, we solve the problem of normalization of numerals written as digits. Our approach allows for the consideration of both cardinal and ordinal numbers. In order to train and test our models we used Russian text corpora. For morphological disambiguation, we used data from OpenCorpora and the Syn- TagRus linguistic corpus. For number normalization we used the Russian National Corpora (RusCorpora). A brief overview of the CRF model is given, followed by a detailed description of the applied algorithm, assumptions on the training and test set, and a description of features for each particular issue.
N
Nedoluzhko A. Yu., Khoroshkina A. S.
"VCHERA NASOCHINYALSYA VOROH STROK": PRODUCTIVE CIRCUMFIXAL INTENSIFYING PATTERNS IN RUSSIAN
The current paper addresses verbal circumfixal derivation patterns in modern Russian. The discussion is focused on a series of circumfixes which trigger the intensified usage of the basic verb (~'keep doingP too much'). Derivatives built up by adding a prefix and a reflexive -OR to an imper-fective verb are examined. Although each prefix adds specific shades of meaning to the verb, such patterns are, however, claimed to share common features at different levels of linguistic analysis, such as morphology, syntax, and semantics. Furthermore, such patterns are highly productive in modern language; once certain constraints are fulfilled, an intensified derivative can be formed from any imperfective verb. This fact, along with the patterns in question sharing certain common features, allow us to argue that they can be considered inflectional, rather than derivational.
O
Osminin P. G.
A SUMMARIZATION MODEL BASED ON THE COMBINATION OF EXTRACTION AND ABSTRACTION
We suggest a model of automatic summarization for scientific and technical texts. This model combines extractive and abstractive approaches for summarization and was developed on the basis of comparative analysis of authors' summaries and full texts of corresponding papers. The model consists of three main components: a keyword extractor, a domain and task oriented static knowledge base and a summarization algorithm. The keyword extractor is off-the shelf tool LanAKey_Ru, adapted to the application. Static knowledge includes stop lexicons, conceptual net, templates for summary content selection and rules for the generation. Stop lexicons are used for removing text segments irrelevant for the document summary. The conceptual net is used for semantic analysis of a document text helping content selection. Templates for information extraction are frame structures. Their slots are to be filled with extracted fragments of document sentences. Rules for summary generation define the grammar of summary sentences and their order. The summarization algorithm consists of four top level procedures—preprocessing, analysis, content selection and summary text generation. The model is described on the example of Russian scientific papers in mathematical modeling domain.
P
Paducheva E. V.
SUSPENDED ASSERTION AND NONVERIDICALITY
Two notions are compared: suspended assertion and nonveridicality. It is argued that these notions, though used in the frameworks of different linguistic theories, are applied to similar linguistic phenomena. In this paper the notion of nonveridicality is applied to one group of Russian indefinite pronouns — namely, to negative polarity pronouns (NPP). Four groups of non-referential indefinite pronouns are differentiated in Russian: negative pronouns (ni- series), non-specific indefinite (-nibud' series), free choice (ugodno series and ljuboj) and negative polarity pronouns (-libo and by to ni bylo series). Following Giannakidou 1998, I reject the hypothesis that NPPs are licensed in the context of downward entailment operators only. I also argue against what is claimed in Giannakidou 2011, that NPPs are licensed in the three types of environments: negative, downward entailing and nonveridical: all contexts of the Russian NPPs can be demonstrated to be nonveridical, and the context of negation is one of them. The list of contexts licensing all the four classes of non-referential pronouns is suggested. Each of the four classes of pronouns chooses its own subset of contexts from the list.
Panchenko A. I.
SENTIMENT INDEX OF THE RUSSIAN SPEAKING FACEBOOK
A sentiment index measures the average emotional level in a corpus. We introduce four such indexes and use them to gauge average "positiveness" of a population during some period based on posts in a social network. This article for the first time presents a text-, rather than word-based sentiment index. Furthermore, this study presents the first large-scale study of the sentiment index of the Russian-speaking Facebook. Our results are consistent with the prior experiments for English language.
Pestova A. R.
GOVERNMENT OF THE BORROWED NEOLOGISMS DENOTING OBJECTS OF FILM INDUSTRY
The present paper deals with the government of borrowed neologisms, denoting objects of film industry: mpeujep 'trailer', mu3ep 'teaser', peMeux 'remake', cuxeej 'sequel', npuxeej 'prequel', mpwceeji 'trequel', xeadpuxeeji 'quadriquel', Mudxeeji 'midquel' and unmepxeeji 'interquel'. Dictionaries don't give any information about syntactical features of these words. The study shows that government of these nouns is variational and the revealed constructions are synonymous and redundant. As language tends to eliminate redundancy, we tried to find the most popular variant for each word. The Statistics of Internet resources "Yandex.Novosti" (news segment), "Yandex.Blogi" (blogosphere) and corpus RuTenTen was analysed. All listed nouns tend to govern non-prepositional genitive. The used method can be applied to other borrowed neologisms, for example to nouns, referring to music scene (peMuxc 'remix', xaeep 'cover version', (eudeo) Kjun 'video clip' and (eudeo)pojux 'video clip'). They prefer prepositional-case construction na + accusative.
Piperski A. Ch., Somin A. A.
PRAGMATICS OF STRIKETHROUGH: NORMS OF COMMUNICATION AND OPTIMALITY THEORY
The paper presents a description of intentional strikethrough on the Web using a combination of theories from pragmatics and phonology, namely the theory of implicature by Grice (1975), the politeness theory by Brown and Levinson (1978, 1987), and Optimality Theory. We argue that the study of this phenomenon can shed light on some more general aspects of communication theory, such as the mechanism of choosing one viewpoint among many options. We also describe graphical and verbal substitutes for strikethrough in blogs and literary works.
Podlesskaya V. I.
"THEY SHOT HIM DEAD, OH, NO, THEY KNIFED HIM DEAD WITH A SABER": SELF-REPAIRS IN ORAL STORIES
The paper introduces a discourse oriented classification of repair types in Russian by addressing, inter alia, the following questions: (i) whether or not self-repairing entails speech disflu-ency; (ii) whether or not the fragment under repair and its repaired correlate are structurally isomorphic; (iii) does the speaker revise a lexical, a morpho-syntactic, or a phonologic shape of the reparandum. Basing on the data from the Prosodically Annotated Corpus of Spoken Russian, established classes of repairs were analyzed qualitatively and quantitatively. Fluent isomorphic repairs appeared to be the most frequent in the corpus, although fluent non-isomorphic repairs, as well as disfluent isomorphic and disfluent non-isomorphic repairs are also attested.
Protopopova E. V., Bodrova A. A., Volskaya S. A., Krylova I. V., Chuchunkov A. S., Alexeeva S. V., Bocharov V. V., Granovsky D. V.
ANAPHORIC ANNOTATION AND CORPUS-BASED ANAPHORA RESOLUTION: AN EXPERIMENT
The paper describes the noun phase and anaphora annotation in OpenCorpora and compares it to that in other corpora. We discuss the choice of representative texts for anaphoric annotation and the basic principles of syntactic annotation. In case of noun phrase annotation we followed the scheme introduced earlier for morphological annotation: it was carried out in two stages: firstly, all noun phrases and some other syntactic units were annotated by a heterogenous group of people, then a linguist compared all markup results and found the best one, or corrected mistakes. We present some annotation results and cases of annotator's disagreement and proceed to introduce our data-driven anaphora resolution system based on decision trees. We then list the features used to fit the classificator and discuss their relevance and some changes which improved the classificator performance. We also present out rule-based approach to automated noun phrase extraction using Tomita parser. A baseline for anaphora resolution is introduced and we compare it with our results.
S
Schütze H.
RECENT ADVANCES IN (DEEP) REPRESENTATION LEARNING
Traditionally, natural language processing (NLP) systems have made use of resources compiled by (computational) linguists based on linguistic theory that provide rich information about linguistic objects. For example, computational lexica specify morphological paradigms and sub-categorization frames of verbs. In contrast, statistical NLP systems frequently start out with no explicit representation of linguistic objects and instead learn what they need from training data on a task-by-task basis. A third approach—which has gained much interest recently—is to learn generic representations of linguistic objects and then reuse them for a wide variety of tasks. Its premise is that giving an NLP system non-task-specific generic information about words and other linguistic objects will help it in performing well at a particular task. Examples of such generic representation models include the vector space model, dimensionality reduction, clustering and deep learning. I will review recent research results in representation learning and discuss benefits and drawbacks of the three approaches.
Semenova S. Ju.
ON THE CLASS OF RUSSIAN PARAMETRIC ADVERBS
The paper deals with Russian parametric adverbs i. e. those revealing the values of the quantitative parameters (gluboko [deeply], chasto [often / thickly / frequently], redko [rarely / seldom], bystro [rapidly / quickly], izdaleka [from afar] etc). Characteristics of parametric adverbs seem to be much less investigated (in particular, in the perspective of information extraction) than those of parametric nouns, adjectives, and verbs. A number of grammatical and semantic groups of adverbs are presented. The parametric meaning is found to be distributed among various traditional grammatical and semantic classes of adverbs. For parametric adverbs morphologically derived from adjectives, we discuss semantic priority or lack thereof with respect to adjectives. The parametric meaning can take place for a secondary sense of an adverb, so that ambiguity, connotations, and implication are essential in the descriptions aimed at information extraction. The correspondence between the quantitative meaning of the adverb and the name of the physical value (izdaleka — rasstojanie [distance]) are considered. Corpora examples of various types of parametric data coded with the help of parametric adverbs are presented.
Shaikevich A. Y., Savchuk S. O.
DISTRIBUTIONAL-STATISTICAL ANALYSIS OF REGIONAL PRESS (NEWSPAPERS OF GRODNO REGION OF BELARUS)
The paper is an application of distributional-statistical analysis (DSA) to the sub-corpora of Grodno region newspapers corpus. The sub-corpora under study are district newspapers, "The Evening Grodno" and commentaries to the latter. With the help of DSA hundreds of keywords have been elicited for each sub-corpus. The linguistic interpretation of those three lists showed that the keywords grouped into clusters reflect both thematic and stylistic features of the sub-corpora. The district newspapers are specific in the choice of domains (mostly of local interest) and stylistic flavor (mostly official and bookish, to some extent resembling Soviet use). "The Evening Grodno" is more colloquial stylistically; its domains are naturally connected with the day-today city life and some topics which were unexpected, such as a large cluster of words denoting places of interest for tourists and inhabitants of the city. The keywords of the commentaries brings the stylistic trend of "The Evening Grodno" to its logical end. The method may be used for comparative analysis of other corpora, which might bring about new results depending on the composition of the corpus.
Shatunovskiy I. B.
PERLOCUTIONARY SPEECH ACTIONS AND PERLOCUTIONARY VERBS
Perlocutionary verbs like ubezhhdat' 'to convince / persuade', nastaivat' 'to insist', ugovarivat ~'to persuade', uspokaivat' 'to calm', objasn'at' 'to explain', xvastatsy'a 'to boast' etc. are verbs denoting perlocutionary actions. Perlocutionary actions, as defined in the paper, are unconventional actions performed by means of conventional illocutionary acts. Perlocutionary actions are aimed to achieve certain effects, goals, but they do not necessarily achieve them. Perlocu-tionary verbs such as preduprezhdat' (to warn), nastaivat' (to insist), uveryat' ('to assure') can turn into illocutionary verbs. In this case the perlocutionary text is contracted and some parts of it are taken in the meaning of the verb becoming a sign of that contraction. Perlocutionary actions and verbs can be divided into several groups according to supposed goals and effects of a perlocutionary action. They are: (1) perlocutionary actions having a clear aim which is embedded, fixed in the meaning of the verb denoting that action; this aim can be achieved or not; (2) perlocutionary actions that do not have a clear aim, but have a bundle of possible aims that are not fixed in the meanings of the corresponding perlocutionary verbs; (3) perlocutionary (and some illocutionary) actions that have a clear aim, and that aim is achieved any time the speaker does that action. These groups differ with respect to the meaning of their perfective forms. In the paper these differences are described and explanations for semantic peculiarities of the perfective forms are proposed.
Shelmanov A. O., Smirnov I. V.
METHODS FOR SEMANTIC ROLE LABELING OF RUSSIAN TEXTS
The paper introduces two methods for semantic role labeling of Russian texts. The first method is based on semantic dictionary that contains information about predicates, roles and syntax-eme features that correspond to the roles. It also uses heuristics and integer linear programming to find the best joint assignment of roles. The second method is data-driven semantic-syntactic parsing, which was implemented using MaltParser. It performs transition-based data-driven parsing simultaneously building a syntactic tree and assigning semantic roles. It was trained with various feature sets on SynTagRus Treebank, which was automatically enriched with semantic roles by the dictionary-based parser. We managed to automatically alleviate mistakes in the training corpus using output of the data-driven parser. We evaluated the performance of the parsers on the subcorpus of SynTagRus, which we manually annotated with semantic information. The dictionary-based parser and the data-driven semantic-syntactic parser showed close performance. Although the data-driven parser did not outperform the dictionary-based parser, we expect that it can be beneficial in some cases and has potentials for further improvement.
Sitchinava D. V., Kachinskaya I. B.
THE DIALECTAL SUBCORPUS WITHIN THE RUSSIAN NATIONAL CORPUS: TODAY AND TOMORROW
The main results of the project aimed at developing the dialectal subcorpus of the RNC were the creation of a pilot corpus and the change of the markup principles encompassing many dialec-toloigical parameters. A working place program was created and many texts were marked up using the new technology. The present goal of our team is a considerable increase of the corpus, its representativeness and the depth of linguistic processing. The dialectal texts available for search in the RNC (www.ruscorpora.ru/search-dialect.html) will be considerably updated, with the overall corpus size reaching 1 mln tokens. The texts, mainly unpublished or published in rather obscure editions, are to be made available for a wider circle of dialectologists. Some texts are to be accompanied with video and audio. Alongside with word-by-word grammarical markup with resolved homonymy, the texts are to be tagged extensively on the metalevel (data of creation, dialect, overall phonetical properties and others). The accumulation of dialectal texts will be continued, the dialectologists who had collected valuable texts are invited to share their results with the professional community.
Solovyev A.
USING LATENT SEMANTIC ANALYSIS FOR SIMULATING OF CHILDREN'S COGNITIVE DEVELOPMENT
In the 20th century Noam Chomsky formulated the so-called Plato's problem: why is the amount of our knowledge much greater than we can extract from our everyday experience? For example, the vocabulary of preschool children (aged 6-7) averagely increases by 3-8 words every day, and not every word refers to any reality or action (for example, abstract concepts, words carrying "phatic" or uninformative assignment, etc.). How does the child recognize each new meaning of the word and its relation to others, or why are new "meanings" formed? We propose a method to simulate associative-semantic relations between words. On the one hand. it eliminates rigid binding of a lexical unit to any cluster, and on the other it presents a complete system of relationships between words. The paper presents the results of three experiments with cognitive development of 4-7-year-old children using a Latent Semantic Analysis (LSA) that permits comparisons of semantic similarity between pieces of textual information. We used a technique developed by G. Denhiere and B. Lemaire. The principal distinctions of our research are that for the first time, the experiments were performed 1) on the Russian language; and on pre-school children. The children were grouped into two categories: 4-5 and 6-7 years, which corresponds to age variability of cognitive development. Two experiments describe semantic and associative similarity between LSA models and the children's cognitive development. The third experiment describes using LSA to measure the children's semantic memory. The results are compared to children's model data and adults' model data. The computational models are built from the LSA of a multisource child corpus and of an internet mass media corpus. Our findings confirm that: 1) LSA can be used to simulate a variety of children's cognitive processes; 2) LSA models represent the development of different age groups children's cognitive processes, in particular associative semantic processes and short-term and long-term memory work; 3) this method may be recommended for the comparative study of children's cognitive development, in particular, the development of associative-logical thinking, verbal discourse, the development of memory.
Sorokin A., Katinskaya A., Sharoff S.
ASSOCIATING SYMPTOMS WITH SYNDROMES. RELIABLE GENRE ANNOTATION FOR A LARGE RUSSIAN WEBCORPUS
The paper describes several experiments aimed at establishing the parameters for genre annotation of potentially any text which can be collected from the Russian web. We started with a set of text classification parameters, refined them iteratively in several studies and established a reliable framework, which was further subjected to clustering analysis. Overall, we obtained the level of agreement for Krippendorff's a to be in the range of 0.51<a<0.84. We have also discovered the most common combinations of parameters in the test corpus, which should form the basis for classifying very large samples of the Russian web.
Starostin A. S., Smurov I. M., Stepanova M. E.
A PRODUCTION SYSTEM FOR INFORMATION EXTRACTION BASED ON COMPLETE SYNTACTIC-SEMANTIC ANALYSIS
The article presents a mechanism for information extraction from unstructured natural language data. The key feature of this mechanism is that it relies on deep syntactic and semantic analysis of the text. The system takes a collection of syntactic-semantic dependency trees as input and, after processing them, outputs an RDF graph consistent with certain domain ontology. The mechanism was implemented within a deployable information extraction system, which is a part of ABBYY Compreno technology—a powerful tool for a broad range of NLP-tasks that include machine translation, semantic search and text categorization. The description of the extraction algorithm and the results of the system performance evaluation are given. Evaluation tests were conducted on the MUC-6 corpus. The overall F-measure we achieved using Compreno technology was 0.83, which is lower than the best results claimed by the researchers using machine learning approaches. Our system is still under development at the moment and we hope to improve its performance in the future. One of the advantages of Compreno technology is that, unlike many statistical approaches, it does not show an abrupt performance drop if the test corpus is changed. Thus Compreno demonstrates little dependence on the exact textual data it receives and therefore might be seen as a more universal and less domain-dependent solution. Our tests on the CoNLL corpus yielded an F-measure of 0.75 with no prior adjustments made.
Strebkov D. Y., Hilal N. R., Redjaimia A., Skatov D. S.
THE EXPERIENCE OF BUILDING INDUSTRIAL-STRENGTH PARSER FOR ARABIC
We present a propagation of a hybrid approach for natural language parsing on Semitic languages on the example of the Arabic language. The hybrid approach proposes a way for acquiring dependency and constituency parses simultaneously at every step of the analysis. The result of the propagation is represented by a syntactic parser for Arabic language and the fact that the parser shows quite satisfactory results and belongs to the group of rule-based parsers actually forms scientific novelty of this article. We give a short review of Arabic Natural Language Processing (NLP) technologies and their current state and then describe steps that were required for our propagation: choosing of morphological analyzer, morphological index compression scheme, description of rule base system that is used by the parser, modifications that were needed for tuning in the core parsing algorithm. We also designate problems that we faced during the propagation and the results that we finally achieved. In the end we provide results of brief evaluation of the parser and give information on its current usage.
T
Toldova S. Ju., Roytberg A., Ladygina A. A., Vasilyeva M. D., Azerkovich I. L., Kurzukov M., Sim G., Gorshkov D. V., Ivanova A., Nedoluzhko A., Grishina Y.
RU-EVAL-2014: EVALUATING ANAPHORA AND COREFERENCE RESOLUTION FOR RUSSIAN
Toldova S. Ju., Roytberg A., Ladygina A. A., Vasilyeva M. D., Azerkovich I. L., Kurzukov M., Sim G., Gorshkov D. V., Ivanova A., Nedoluzhko A., Grishina Y. The paper reports on the recent forum RU-EVAL—a new initiative for evaluation of Russian NLP resources, methods and toolkits. The first two events were devoted to morphological and syntactic parsing correspondingly. The third event was devoted to anaphora and coreference resolution. Seven participating IT companies and academic institutions submitted their results for the anaphora resolution task and three of them presented the results of the coreference resolution task as well. The event was organized in order to estimate the state of the art for this NLP task in Russian and to compare various methods and principles implemented for Russian. We discuss the evaluation procedure. The anaphora and coreference tasks are specified in the present work. The phenomena taken into consideration are described. We also give a brief outlook of similar evaluation events whose experience we lay upon. In our work we formulate the training and Gold Standard corpora construction guidelines and present the measures used in evaluation.
U
Uryson E. V.
ON DERIVED PREPOSITIONS: ADVERBAL PREPOSITIONS
The object of this paper is so called adverbal prepositions in Russian; such as VOKRUG (kostra) 'around smth.', DALEKO OT (doma) 'far from smth.', etc. By definition, an adverbial preposition coincides with an adverb (cf. VOKRUG) or contains an adverb and a preposition (cf. DALEKO OT). In most cases, an adverbial preposition and the underlying adverb have the same meaning and the same semantic actants. The only difference between an adverbial preposition and the underlying adverb is the mode of expression of the main semantic actant. Cf. GOREL KOSTER, VOK-RUG (preposition) KOSTRA STOJALI LIUDI 'A fire was burning, people were standing around it' vs GOREL KOSTER, VOKRUG (adverb) STOJALI LIUDI 'A fire was burning, people were standing around'. Both the adverbial preposition VOKRUG and the adverb VOKRUG have a semantic ac-tant 'reference point' and in both examples the word 'fire' expresses this actant. But the adverbial preposition governs this noun predicting its case-form and its linear position in a sentence. The adverb does not govern the noun; the only requirement is that this object must be already mentioned (so the noun must be somewhere in the preposition to the adverb). In this regard the adverbs under discussion are similar to connectors. Adverbial prepositions are easily described in the frameworks of valency theory. I argue that some refinements of valency theory are necessary for representing syntactic properties of underlying adverbs. I also demonstrate that it is more convenient to represent so called adverbal prepositions as adverbs but not as prepositions.
V
Vorontsov K. V., Potapenko A. A.
REGULARIZATION OF PROBABILISTIC TOPIC MODELS TO IMPROVE INTERPRETABILITY AND DETERMINE THE NUMBER OF TOPICS
Probabilistic topic modeling is a rapidly developing branch of statistical text analysis. The topic model uncovers a hidden thematic structure of the text collection. Learning a topic model from a document collection has an infinite set of solutions. The nonuniqueness results in a weak interpretability and instability of the solution. To tackle these problems we use a new multi-objective approach—Additive Regularization of Topic Models (ARTM). ARTM is a non-Bayesian framework free of redundant probabilistic assumptions, which dramatically simplifies the inference of topic models and makes topic models easy to design, infer, and explain. With ARTM we combine four regularizers to concentrate common vocabulary words in background topics, to make domain topics sparse and distinct, and to eliminate insignificant topics. In our experiments the combination of the regularizers improves sparsity, coherence, purity, and contrast criteria at once almost without any loss of the perplexity.
W
Waldenfels R. von, Daniel M., Dobrushina N.
WHY STANDARD ORTHOGRAPHY? BUILDING THE USTYA RIVER BASIN CORPUS, AN ONLINE CORPUS OF A RUSSIAN DIALECT
The paper describes a corpus of dialectal Russian speech under development. The corpus relies on interviews conducted by a joint Swiss-Russian team in the summer of 2013 in a small cluster of North Russian villages with the goal of studying the local dialect from a sociolinguistic and dialectological perspective. The interviews are transcribed into standard Russian and thus do not involve a detailed phonetic representation. The text is then lemmatized and grammatically annotated with standard tools and fed into a corpus. The corpus can be queried via a web-based interface which provides the user with access to the original sound recordings on a per-utterance level. This design, the paper argues, allows for a rapid development of the corpus without a major loss in usability, since the audio data are readily available. Future plans include more field trips as well as a more convenient interface providing, among other features, for user correction of the transcription.
Y
Yanko T. E.
CORPUS AND INSTRUMENTAL METHODS IN ANALYSING FICTION AUDIO RECORDINGS
This paper aims at analyzing the communicative structure of sentences with new information placed sentence-initially. The point of departure is the analysis of the examples from Pushkin and L. Tolstoy discussed in [Kovtunova 1979]. Russian linguists traditionally used Russian classical texts for verifying and exemplifying their scientific hypotheses. Currently, a vast amount of fiction read by the best actors and the development of convenient systems of spoken speech analysis, such as Praat or Speech Analyzer, open an easy access to the prosodic structure of a sentence. The prosodic structure, in its turn, allows of modelling the communicative structure, since prosody is the main means to manifest the communicative division of a sentence into theme and rheme. Availability of modern corpora and tools for investigation demonstrate the prosodic and the corresponding communicative structures really employed by the speakers who voiced over the texts of Pushkin and L. Tolstoy, and who presumably never read I. I. Kovtunova's papers. For investigation, a minor working corpus of sounding texts was set up. The new tools completely confirmed the basic I. I. Kovtunova's findings obtained in the second half of the 20th century by the method of introspection. Nevertheless, some new additions acquired by the use of the new sources of material and the corpus techniques correct the results achieved in [Kov-tunova 1979] and substantially widen the variety of theme-rheme structures applicable to the sentences with new information placed sentence-initially.
Z
Zangenfeind R., Sonnenhauser B.
RUSSIAN VERBAL ASPECT AND MACHINE TRANSLATION
Rule-based machine translation still offers some very beneficial facets for linguistic theory, because by implementing rules on the computer linguistic theory can be verified in practice. One of the most intricate problems for machine translation is grammatical aspect in Russian when it has to be translated into a language either lacking aspect or having a different aspect system. On the categorical level, aspect has only approximate equivalents in non-Slavic languages, such as the progressive form in English, for instance. In addition, language-internally, its semantics and interpretation cannot be sufficiently captured with only one specific characteristic feature. In this paper, we aim at establishing a basis for the machine translation of the Russian aspect. To do so, we discuss an approach to describe the interaction of verb and aspect semantics in a systematic way. Moreover, we describe a possible annotation for the aspectual information that is provided by further lexical components contributing to the meaning computation. This allows for the formulation of rules for machine translation into target languages where the grammatical category of aspect is realized differently or not present at all.
Zimmerling A. V.
SENTENTIONAL ARGUMENTS AND EVENT STRUCTURE
This paper is addressed the interaction of subject marking and event structure in languages, which allow sententional arguments in the subject position. In Russian and other Slavic languages sententional subjects share a number of formal and semantic properties with zero subjects with role-and-reference features and with so called oblique subjects, i.e. subject-like arguments marked with an oblique case. I argue that sententional subjects represented by bare that-clauses (Rus. cto-clauses) cannot have the roles of Agent/Causer, while zero subjects can. I also argue that the capacity of taking to, cto P- clauses, i.e. that-clauses headed by a correlative pronoun to serves as diagnostics for a number of verbal classes. Causative predicates like vynudit', zastavit', sklonitj k cemu-l. only take to, cto P- clauses, but not bare cto P- clauses as surface subjects. Factive predicates like znat', razdrazat' etc. take to, cto P- clauses, but not bare cto P-clauses as surface subjects while non-factive predicates like dumat', merescit'sa only take bare cto P- clauses. Nominal predicatives forming Dative-Predicative-Structures (DPS) with an oblique subject marked with dative case and specified as {+ animate; + referential} split into two groups. Russian DPS predicatives from the stydno, dosadno, protivno, vse ravno group only take bare cto P- clauses and invariably behave as non-factive verbs in all contexts with an overt oblique subject. Russian DPS predicatives from the izvestno, neizvestno, stranno, bezrazlicno group both take bare cto P-clauses and headed to, cto P-clauses, i.e. can be used in factive contexts as well. That means that their sententional argument can both get the status of a fact i.e. verified proposition P, logical truth, and an intentional situation, e.g. subjective evaluation of P, inner vision of P etc. Russian has two expletive elements—eto and to, but their syntax is different. Eto behaves as surface subject of the matrix clause and alternates with oblique subjects and sententional arguments in the subject position while to cannot be separated from the complement clause and reaches the subject position only in combination with the CP.
Zelenkov Yu. G., Zobnin A. I., Maslov M. Yu., Titov V. A.
ILYA SEGALOVICH AND DEVELOPMENT OF IDEAS OF COMPUTATIONAL LINGUISTICS TO YANDEX
In the article the most important and interesting linguistic projects led by Ilya Segalovich (1964–2013) — one of the founders of the Yandex search engine — are considered. He also took part in their development. The following projects are among them. Development of the morphological analysis and synthesis of Russian words with a possibility of processing «new» words not included in the dictionary; solving the problem of morphological ambiguity for the Russian language with the help of normalizing substitutions; practical transcription of foreign, individual and common words; automatic positioning of stresses and the analysis of poetic texts; creation of efficient methods of recognizing fuzzy duplicates for textual documents; development of the information and require system «The National Corpus of Russian», etc. Key ideas and approaches connected with the searching of solutions to complicated linguistic problems are described, and Ilya's role in the invention of these approaches and their further development is stated. Examples of non-trivial linguistic algorithms developed by Ilya in collaboration with his colleagues are given.