Proceedings 2012

Format PDF

Additional

Online articles

Anisimovich K. V., Druzhkin K. Ju., Minlos F. R., Petrova M. A., Selegey V. P., Zuev K. A.

Syntactic and semantic parser based on ABBYY Compreno linguistic technologies

The paper presents Abbyy Syntactic and Semantic Parser that was a participant of the Dialog 2012 Syntactic Parsers Testing Forum. We will refer to the parser technology (both parsing algorithms and linguistic model) as Compreno technology. We do not touch on any evaluation issues, as they are tackled by the Forum panel. Instead, the paper makes public some underlying principles of the parser. What we want to communicate directly concerning the testing are the features of the project which are both relevant to the comparison of our results with the “gold standard” adopted by the panel and, at the same time, important for the whole architecture of our technology.

Antonova A. A., Misyurev A. V.

Russian dependency parser SyntAutom at the DIALOGUE-2012 parser evaluation task

Apresjan V.Yu.

Tut ‘here’ as a temporal proximity marker

While space-time metaphor is a source of regular prepositional and adverbial polysemy in many languages, spatial deictic words develop temporal meanings much more rarely. However, space-time metaphor in deixis is not as exotic as it might seem, as demonstrated by the Russian spatial proximity marker tut ‘here,’ which develops a meaning of temporal proximity. Its meaning, though, is different from the meanings of classical deictic markers of temporal proximity, such as sejchas ‘now.’ Tut develops a synthetic meaning of actuality, which comprises the following semantic elements: (a) time period which includes the moment of speech, and such moments preceding and following it that are sufficiently close to the moment of speech to retain connection with it; (b) physical or mental space that includes the speaker; (c) a situation where the speaker is either a participant or an observer. Besides semantic differences, tut has a special semi-grammaticalized status (it is a particle turning into a semi-formed grammatical marker); communicative peculiarities (it cannot be part of a rheme or contrastivetheme); prosodic properties (it is a clitic).

Arkhipov A. V., Zakharov L. M., Krivnova O. F., Kodzasov S. V., Lebedev A. A.

Russian intonation corpus: a preliminary report

The paper presents Russian Intonation Corpus (RINCO), the multimedia speech corpus designed for the study of prosody in Russian, its history, purpose, and functionality. RINCO started as a series of MS Access databases developed in 2004–2009 by a research team at the philological faculty of Moscow State University led by S. V. Kodzasov. The databases contained descriptions of most important prosodic characteristics (several tonal features, tempo, loudness, phonation types) for quasi-natural dialog utterances, as well as for a selection of spontaneous and prepared spoken narrative texts of different genres. These descriptions have been transferred to a multilayer time-aligned markup format, suitable for local use in ELAN annotator and for online browsing, playback and search using the LAT server platform. The corpus in its current state should be fully operational at http://languedoc.philol.msu.ru/rinco by the end of 2012. Various possibilities of enhancing the corpus are discussed, as well as some findings with a sample query for “IK-5” intonational construction.

Baranov A. N., Dobrovolskii D. O., Kiseleva K. L., Kozerenko A. D., Voznesenskaia M. M.

Towards a frequency dictionary of Russian idioms

Up to now we have not seen any frequency dictionary of Russian idioms. The reasons why this task seems so challenging are multiple. First of all, idioms appear in texts relatively rare if compared to “normal” words; next, their frequency depends heavily on the type of discourse as well as on the speaker’s (or writer’s) linguistic preferences. Our task is to compile a frequency dictionary of Russian idioms based on the list of idioms developed by our team for the Thesaurus of Modern Russian Idioms (2007). This list amounts to 8000 items and is being constantly revised and updated. We also have 4 text corpora elaborated especially for the needs of idiom analysis. They embrace the spheres of modern Russian relevant for idiomatic expressions: prose, drama, journalism, and detective fiction. We are going to take into account specifics of individual style of different authors: only those who use idioms regularly will be part of the text sample. The main type of information that the dictionary will provide is the frequency of each idiom in the corpus, taken as a whole, as well as in various genres. We will also make a list of word forms that are part of the idioms and provide statistical information on the productivity of various semantic classes and grammatical forms in Russian phraseology. This parameter, quite novel for this field, was labeled phraseological activity. The results of this project can link together phraseology and other aspects of language system and facilitate contrastive idiom analysis.

Belikov V. I., Selegey V. P., Sharoff S. A.

Preliminary considerations towards developing the General Internet Corpus of Russian

This talk presents the project for creating the General Internet Corpus of Russian (GICR). We start with analysing technological, structural, functional and content problems of existing Russian corpora. Then we discuss the need and the possibility for creating a new corpus which is based on the Russian Internet. Finally, we define the principles for text collection, classification and annotation, as well as the necessary parameters and functions of the interface of GICR.

Bocharov V. V., Alexeeva S. V., Granovsky D. V., Ostapuk N. A., Stepanova M. E., Surikov A. V.

Text segmentation in the OpenCorpora project

By segmentation we mean text tokenization and sentence splitting. By default most text processing pipelines start from these two jobs before any morphology or syntax is invol ved. Moreover, later analysis stages depend on the way tokenization and sentence splitting are performed. In OpenCorpora project we have done manual tokenization and sentence splitting work on about 600К text forms.The paper presents a machine learning technique trained on this data, the results obtained, a segmentation standard and its motivation.

Bogdanov A. V.

Description of gapping in a system of automatic translation

In this paper we discuss a description of one type of ellipsis in a Russian-English and English-Russian automatic translation system. Short definition of this type of ellipsis (gapping) is given, as well as an overview about architecture of a system of automatic translation. The method of gapping description itself is considered and then its advantages and disadvantages are discussed in detail. Examples from real texts illustrating advantages as well as disadvantages of description are given.

Bogdanova N. V.

A dictionary of discourse elements of Russian speech: project description (based on corpora material)

The paper is basically dedicated to the creation of Dictionary of Discourse Elements of Russian Speech. The author considers expanded hesitation constructions to be the markers of spontaneous speech. These constructions may consist of one or several words which actually fill pauses; we consider their occurrence to be discourse elements of a specific nature, typical for speech production. Our conclusions are based on observations of the Russian Speech Corpus (balanced annotated text collection and its component «One speech day»). The first glossary version of the dictionary and the structure of the dictionary entry are given.

Bolshakov I. A., Bolshakova E. I.

An automatic morphological classifier of noun phrases in Russian

А morphological classifier of Russian noun phrases of almost arbitrary composition and length is described. Our study covers all peculiarities of Russian declination: fleeting vowels, multiplicity of declination patterns, animacy / inanimacy, specific additional cases, simultaneous occurrence of declinable nouns, adjectives and/or numerals in a phrase. The formula of Russian noun declination is a mere concatenation: word-form = pseudo-stem + pseudo-ending. The declination class is usually determined by the final 1 to 5 letters of the nominative. However exceptions are numerous and they are collected in many built-in lists. We consider separately isolated nouns, groups with no more than two declinable words, and extra long groups. The classifier is a part of CrossLexica, a large electronic dictionary of Russian, and is tested on 115,000 noun phrases of 1 to 6 words included in the CrossLexica dictionary. Based on the built-in classifier, an autonomous module is also constructed, which additionally declines extra-CrossLexica noun phrases.

Borisova E. G.

Discourse markers used for governing understanding of texts

The article deals with speech particles (mostly modal particles and adverbs), that can be used by the Speaker for governing the Hearer in understanding speech. The units used for correcting or denying possible implications or presuppositions of the Hearer are investigated. These are particles like prosto (just), adverbs like sobstvenno (actually), voobsche (To it sum up) and some other. The investigation intends to proclaim the necessity of including discourse implications into the representation of the sense of an utterance or other discourse unit

Chetviorkin I. I., Braslavski P. I., Loukachevitch N. V.

Sentiment analysis track at ROMIP 2011

Chetviorkin I. I., Braslavski P. I., Loukachevitch N. V. Russian Information Retrieval Seminar (ROMIP) is a Russian TREC-like IR evaluation initiative. In 2011 ROMIP launched a new track on sentiment analysis. Within the track we prepared a training collection of user reviews along with ratings for movies, books, and digital cameras. Additionally, we compiled a test collection of blog posts with reviews in the same domains and labeled them according to expressed sentiment. The paper describes the collections' characteristics, track tasks, the labeling process, and evaluation metrics. We summarize the participants’ results and make suggestions for future editions of the track.

Chetviorkin I. I.

Testing the sentiment classification approach in various domains - ROMIP 2011

We offer a review of sentiment classification experiments in various domains using different training sets. In the movie domain we studied the impact of opinion word weights on the quality of classification. We selected the best feature set and ran them on each task-domain pair. In several tasks our algorithm achieved high quality of the classification.

Chistikov P. G., Korolkov E. A.

Data-driven speech parameter generation for Russian text-to-speech system

We propose a speech parameter generation approach for Russian based on hidden Markov models. The speech parameter sequence is generated from HMMs whose observation vectors contain speech characteristics. As a baseline we use the spectrum represented by mel-frequency cepstral coefficients (MFCC), pitch and duration parameters. All of them can be easily complemented by any other parameters, improving the quality. For the creation of the voice model we use linguistic and prosodic features which are the observations of every allophone in the utterance. This paper also presents the results of research into selecting the most effective features to characterize an allophone. Experimental results show that Russian speech can be successfully parameterized and an arbitrary utterance can be synthesized from the generated parameters.

Daniel M. A., Zelenkov Yu. G.

Russian National Corpus as a playground for sociolinguistic research. Episode IV. Gender and the length of the utterance

The paper discusses sociolinguistic implementations of statistical analysis of the spoken subcorpus of the Russian National Corpus. Given the considerable size of the corpus (about 10 mln tokens), an analysis of co-variation of various linguistic parameters with one of the few sociolinguistic parameters available — the speaker’s gender — may give rich and interesting results. One specific example of co-variation is considered in detail: the mean length of the utterance (in tokens). Comparing this parameter in public communication shows statistically significant difference between the speech of men and women (men talk more), while the same difference is absent in private communication. Another important parameter is the gender of the addressee. Again, co-variation is quite different in public and private discourse. In private communication, the utterances are longer when addressing someone of the same sex, the difference between men and women is not statistically significant. In public communication, the utterances are longer when addressing a woman, whether the speaker herself is a man or woman. These conclusions are consistent with the results of sociolinguistic gender studies obtained elsewhere and by other methods. Linguistic difference between men and women are not absolute but depend on the communicative situation (public vs. private). Public discourse is a playground for linguistic competition in which men are the winning party. In private discourse, competition dissolves.

Davydov A. G., Kiselev V. V., Kochetkov D. S., Tkachenia A. V.

Optimal feature selection for speaker’s emotional state classification

The research focus of present work is optimization of a feature set for voice emotion recognition. First part of the article contains a brief review of the most common speech features widely used in the emotion recognition tasks. It is shown that acoustic characteristics of a voice can be divided into five categories: prosodic, dynamic, qualitative, spectral and power. Also a number of the most effective features and statistical functionals derived from these features are discussed. After that two most widespread techniques of robust feature selection are explained. In the experimental part of the article, the feature selection algorithm developed on the basis of sequential feature selection (SFS) approach is presented. Further cross-validation procedure used in our studies is described. The recognition rate obtained on the Berlin database of emotional speech using optimized feature set (20 features) from the 10-fold cross-validation procedure is approximately 85%. In the conclusion we discuss some properties of the derived feature set and confusion matrix of the developed SVM classifier.

Delikishkina E. A., Fedorova O. V.

The effect of the syntactic role of the antecedent on ambiguous pronoun resolution in Russian

According to a common view, referential choice is determined by the degree of activation of the referent in the cognitive system. It is even argued that our choice of referential device can be predicted solely on the basis of the activation factor. However, sometimes anaphora resolution is complicated by the presence of two or more competing referents, each of which enjoys activation high enough to be coreferent to the pronoun. If neither the context nor the grammar can help us identify the antecedent, we are confronted with referential ambiguity. It has long been the subject of debate what determines the choice of referent in such cases. Various factors have been shown to affect pronoun resolution. According to one account, the preferred antecedent of an ambiguous pronoun is the first-mentioned NP. The first-mention advantage is attributed to language-independent, general cognitive processes. Alternatively, it is argued by some researchers that pronoun resolution is affected by specific linguistic factors such as the antecedent’s grammatical role. According to this account, the most favored antecedent is the grammatical subject. However, distinguishing between the two hypotheses using the data from languages such as English is impossible as the first mentioned referent is also usually a subject of the clause. In the present study we describe an experiment that has shown that speakers of Russian are more likely to corefer ambiguous pronouns to grammatical subjects regardless of whether they are mentioned first or not. We presented our subjects with ambiguous texts consisting of two sentences. The first one contained two referents, and the second one started with a pronoun that could refer to both of these referents. The first sentence had two versions: each of the referents could be either a subject or an object, which was attained by changing the diathesis of the transitive verb. Each of the 32 subjects that participated in the experiment was assigned to one of the two experimental lists that contained only one version of each stimulus text. The results showed that speakers of Russian considered the subject of the sentence to be an antecedent of the pronoun in 58,4 % of cases, which is statistically significant (binomial test, p<0.01), thus proving the importance and independence of the syntactic role factor. In conclusion, we reason that the apparent universality of the subject-preference strategy (the similar results were obtained for other languages, e. g. Finnish) supports the idea that the choice of grammatical structure is conditioned by the constraints on one’s cognitive system, or rather working memory.

Dobrovolskii D. O., Levontina I. B.

Synonymous focus particles in German and Russian

It is well-known that both Russian and German contain a considerable amount of particles of various types. In this paper, we discuss two German focus particles "eben" and "gerade", as well as their Russian near-equivalents "imenno" and "kak raz". Whereas "gerade" and "kak raz" can function as independent sentences, this way of use is blocked for "eben" and "imenno". We show that this blocking goes back to some relevant semantic features of given particles. "Gerade" and "kak raz" mark the point of intersection of two situations, two mental schemas and the like. Although both "eben" and "imenno" may be used as separate utterances, this ability goes back to different semantic properties. The meaning of imenno is based on the idea of confirmation, of agreement with the interlocutor, while the German particle eben points to the most important, central element of the situation.

Dobrushina N. R.

Subjunctive in complement clauses

Irrealis forms, such as subjunctive or conditional, are cross-linguistically widespread in complement clauses. Which verbs take irrealis complement clauses is subject to typological variation. I investigate which Russian verbs license subjunctive complementation. The list of predicates is compiled basing on the occurrences of the subjunctive forms in the Russian National Corpus. As a result, two types of subjunctive complement clauses are distinguished. For one group of verbs, subjunctive complementation is similar to purpose clauses. The verbs in the second, and much smaller, group, are used in constructions with low probability epistemic status.

Gerassimenko O.

Functions of feedback items а, ага and гм in Russian phone conversation

Functions of three feedback items in spontaneous speech are discussed on the basis of the preceding and following dialogical context. The analysis of 45-minute corpus of outpatients department phone calls shows that speakers use гм to signal the inapropriateness of the previous turn, a to signal the need for clarification and aга to signal the completeness and sufficiency of the previous turn.

Grishina E. A.

Autodeixis: types and meaning

The study analyzes the main types of autodeictic gestures, i. e. the gestures which point to the speaker. The main configurations of the pointing hand are described and the main linguistic factors affecting these configurations are specified. We distinguish between the thematic autodeixis with the index finger and the performative autodeixis with the open hand. Besides, three types of autodeixes — portemanteaux have been described, which combine the deictic and non-deictic components. The study makes use of the data of the Multimodal Russian Corpus (MURCO).

Gurin G. B., Belikova A. E.

A procedure for evaluating degree of conventionality of metaphor expressions: from intuition to operational criteria

The paper seeks to improve existing approaches to the quantitative analysis of metaphors in discourse and attempts to develop a new classification for metaphorically used lexical items based on the degree of their conventionality, i. e. on the extent of integrity of the original image element. The novelty of the approach lies in its consistent avoidance of reference to intuitive and introspective criteria, which are of wide use in metaphor studies. The measurement of metaphorical creativity of the source and the target is based on a technique entailing a set of objectively established features, and referring mainly to external sources (dictionaries, corpora) and to a minor extent based on the intuition and subjective language competence of the researcher. The proposed classification goes in pair with a verifiable algorithmic procedure for identification of metaphor type. The method allows quantitative evaluation of the degree of metaphoricity for separate texts and text collections, as well as assessment of the metaphoricity of original texts and their translations. Russian is used as a source of illustrations.

Hetsevich Yu. S., Hetsevich S. A., Lobanov B. M.

Belarusian and Russian linguistic processing modules for the system NooJ as applied to text-to-speech synthesis

This paper describes the program NooJ, which provides a platform for the construction of modules for resolving linguistic problems in the area of text-to-speech synthesis. Belarusian and Russian are chosen as target languages. Basic and comprehensive electronic grammatical dictionaries for NooJ are described. We present the entire algorithm of converting dictionaries of the two languages into the form acceptable in NooJ, which retains all lexical, grammatical and accentual information. The dictionaries developed for NooJ help solve the problems of annotating words with lexical and grammatical categories and syllabic accents, as well as the problems of searching texts for a definite sequence of words according to their grammar and word forms. The grammars developed for NooJ are notable for their clarity and the speed at which they can be built. The structure and algorithm of the morphological grammar working on the text are given, including the localization of words written in Cyrillic and Roman letters. The structure and algorithm of the work of two syntactic grammars on a text are described: one aimed at searching for phonetic words and the other on searching for syntagms with a particular number of phonetic words. NooJ output produced after dictionaries or grammars have been applied to text are exported to text format Future work includes the completion of the transfer pf accentual information into Belarusian and Russian NooJ dictionaries, and construction of grammars for identifying accentual units in syntagms and grammars aimed at learning rhythmic structures of texts

Iomdin B. L., Lopukhina А. А., Piperski A. Ch., Kiselyova M. F., Nosyrev G. V., Rikityanskiy А. М., Vasilyev P. K., Kadykova A. G., Matissen-Rozhkova V. I.

Thesaurus of Russian everyday life terminology: new problems and new techniques

The paper addresses various issues associated with the development of a new encyclopedic thesaurus of Russian everyday life terminology (a current project by a group of researchers in V. V. Vinogradov Russian Language Institute, Russian Academy of Sciences). As new lexical material comes into the picture, lexicographers are faced with exciting challenges which require new approaches. Handling Internet data is one important issue. Frequencies of spelling variants and synonyms in texts of various genres, in particular, in blogs and user query logs, are determined using new Yandex-based tools specially designed for the project. The layout and the results of several kinds of speaker surveys and experiments are discussed in detail. The results of the paper may prove useful for a variety of lexicographic projects as well as for theoretical linguistics. The data under discussion, which has never been systematically studied before, appears to possess peculiarities worth further in-depth semantic research.

Iomdin L., Petrochenkov V., Sizov V., Tsinman L.

ETAP parser: state of the art

The state of the art of the ETAP-3 syntactic parser, which took part in a recent competition of Russian parsers, is presented. The paper gives an outline of the main linguistic resources involved in the parser’s operation, describes the main features and steps of the algorithm, and briefly discusses the applications in which the parser is used, including a machine translation system, a software environment for the creation of a syntactically tagged corpus of Russian, and a hybrid system of Russian speech synthesis. Special attention is given to concrete scientific approaches and solutions that determine the functioning of the parser, including methods of lexical and syntactic disambiguation.

Kashkin E. V., Reznikova T. I., Pavlova E. K., Luchina E. S.

Verbs describing sounds of inanimate objects: towards a typology

The article deals, in a typological perspective, with verbs describing sounds of inanimate objects (cf. the noise of a door being opened, of coins in somebody’s pocket, of a river, etc.). The analysis is based on the data from four languages (Russian, German, Komi-Zyrjan, Khanty), which were obtained from dictionaries, text corpora and field investigation. We discuss the primary meanings of these verbs and identify the parameters that underlie semantic distinctions between them (type of sound source and its features, type of situation causing the emission of a sound, acoustic properties of sounds). Then we concentrate on the semantic shifts undergone by sound verbs. First, we consider their metonymic changes, focusing on morphological and syntactic processes accompanying these shifts. Second, we analyze metaphoric uses of sound verbs, bringing out typical patterns of their derivation. These results should form the basis for a future large-scale typological investigation of sound verbs.

Kibrik A. A., Linnik A. S., Dobrov G. B., Khudyakova M. V.

Optimizing a machine learning based model of referential choice

In this paper we discuss different ways of how a machine learning based system of referential choice prediction may be optimized. Compared to the previous studies, we have improved and extended the annotation scheme. At the next step a “cheaper” set of parameters was applied in order to reach faster and less knowledge-rich processing. Our results demonstrate that only using the maximum of the available data the best accuracy of prediction can be gained, though it is possible to eliminate some of the highly “expensive” factors. Genre affiliation has been added to the system as one of the parameters and proved to increase the accuracy score. Finally, we have started a series of psycholinguistic experiments in order to explore the categorical vs. probabilistic character of the choice the speaker/writer makes. Our first results are promising in that in those instances in which the algorithms fail to make a precise prediction, more than one referential options are actually available, according to human judgments.

Kiuseva M. V., Ryzhova D. A., Kholkina L. S.

Adjectives ‘heavy’ and ‘light’ on the typological background

The report is devoted to the study of Russian adjectives ‘heavy’ and ‘light’. The unexpected symmetry of these lexemes is discussed: on the one side, they are antonymic practically in all meanings they have (internal symmetry), on the other side, this semantic area has the same structure in the languages that served the typological background for our research: Serbian, French, English and Chinese (external symmetry). Yet thorough research shows, that the similarity of lexemes has surface character. The following essential differences are revealed: 1. the adjective ‘heavy’ is used in direct meaning considerably more frequently than ‘light’; 2. the adjective ‘light’ is used more frequently in metaphoric contexts and it particularly becomes apparent while expressing the meaning of degree: the meaning of down-toner of the lexeme ‘light’ is better developed than the meaning of intensifier of the lexeme ‘heavy’; 3. adjective ‘heavy’ when it is used with certain nouns can involve the component of the meaning ‘slow’, while adjective ‘light’ can, on the one hand, involve the meaning with antonymical component ‘fast’, and on the other hand, through the meaning of down-toner, it can involve the component ‘slow’; 4. analogical phenomenon with adjectives that have the meaning of ‘light’ can be seen in estimative component: in the whole the situation ‘lightly’ is rated positively, but there can be contexts in which the adjective with the meaning of ‘light’ has negative connotations. The adjective with meaning ‘heavy’ in Russian can only have negative connotations, but it can also develop positive connotations in other languages (e.g. ‘important’ in Chinese), if its original meaning is not ‘difficult to lift or move’, but ‘(objectively) weighing a lot’.

Klygina E. A., Kreydlin G. E.

The database “human body and corporeality in natural language and culture (structure, ideology and content)”

A fundamental notion of “the semiotic conceptualization of the human body” has been proposed and investigated in order to describe the typical views of ordinary people on the human body, different phenomena of corporeality and interaction between verbal and nonverbal (corporeal) sign codes in an oral dialog. The authors present an original computer database holding information about diverse corporal (i. e. somatic) objects, their structural, physical and functional features and actions performed by the objects and on the objects. Internal structure of the database, its substantial online resources and interface that deal with different aspects of the Russian semiotic conceptualization of the human body and with the sign codes described are outlined and characterized. The database can answer different types of questions that are presented in the form of requests. The system has a user-friendly interface through which users can quickly and easily find answers to their questions and get results in a convenient form.

Klyueva N. M.

Some differences between Czech and Russian: a parallel corpus study

We present a comparative study of some constructions in Czech and Russian. Though Czech and Russian are closely related Slavic languages, they have a few differences at the level of syntax, morphology and their semantics. We discuss incongruities that we found in a parallel Czech-Russian corpus, mainly reflecting differences in the sentence structure. The linguistic evidence presented in the paper will be used while constructing the transfer module of a rule-based maachine translation system between Czech and Russian.

Kobozeva I. M., Lukashevich N. Iu.

Human characters through the prism of adverbs

The present paper deals with adverbs describing actions of a person from the character trait perspective — words like carelessly, tactfully, bravely, etc, derived from adjectives predicating to a person some character trait (careless, tactful, brave). The aim was to see what information such adverbs provide for defining patterns of behaviour that constitute the meaning of corresponding character trait adjectives. It is argued that the analysis of adverbs’ contexts helps to outline more precisely the range of situations relevant for manifestation оf a certain character trait, the range of actions which a person with the described character trait is inclined to perform, and their motivations. This is demonstrated by the analysis of contexts with sincerely, frankly and candidly from the British National Corpus.

Kostyrkin A. V., Panina A. S., Reznikova T. I., Bonch-Osmolovskaia A. A.

Constructing a lexico-typological database (for a study of pain predicates)

We present a database developed for lexico-typological study of expressions of pain (demo version available at http://orientling.ru/bolit/). Its design implements the non-relational, NoSql approach, where data is organized into a flexible tree not limited in size and depth rather than presented as a table. Linguistic annotation is placed directly into the text of example sentences and their translations, so that in effect the database is structured as an annotated corpus. This formalism gives much freedom to both the developers in their task of annotating examples, and users in their queries, since it allows them to vary the level of detail according to how much information is available or needed. Linguistic annotation includes tags for syntactic roles, some syntactic constructions and their components (relative clauses, light verbs, formal subjects, parts of compound words), morphological information (tags for case, number, aspect etc), as well as semantic tags specific to the domain of pain (semantic roles and types of metaphoric shift).

Kotelnikov E. V., Klekovkina M. V.

Sentiment analysis of texts based on machine learning methods

We present the methods of text processing and machine learning used to fulfill the tasks of the tracks for the sentiment analysis on the seminar ROMIP-2011. The issues of the choice of the optimal variant of text vector model and the most suitable machine learning method are addressed. Unsupervised and supervised TF.IDF methods of text representation are used. We apply such classification methods as: Naive Bayes, Rocchio’s method, k-Nearest Neighbors, Support Vector Machines (SVM), the method based on keywords and the method which combines SVM and the keywords method. The experiments proved that the best way of text representation is unsupervised binary model with cosine normalization. The combination of SVM and keywords method showed the best results for classification. The authors give the analysis of the results in comparison with other participants of ROMIP-2011.

Kotov A., Budyanskaya E.

The Russian emotional corpus: communication in natural emotional situations

The Russian Emotional Corpus (REC) includes annotated video recordingsof natural communication in tense emotional situations: oral university examsand talks between clients and officers at a municipal office regardingutility bills. Annotation of the corpus describes speech (text, syntagmaticstructure, speech acts, face threatening acts and the cases of irony), facialexpression, gaze direction and hand gestures. In the recorded emotionalsituations informants show diverse emotional cues and numerous rationaland emotional communicative strategies. The annotation allows us: (a)to search for a specific cue (like smiles or squints) and describe its usageand functions in natural communication, or (b) to search for specific patterns— like behavior markers of hesitation or facial cues at the end of utteranceswith face threatening acts. These observations help us to animateemotional computer agents, which receive semantic trees at their input,simulate rich emotional dynamics and produce semantic trees, ready-madephrases and gestures for the output.

Krasnova E. V., Smirnova N. S.

Regional pronunciation preferences for certain Russian and foreign words

The aim of the reported study is to find out if any regional pronunciation preferences exist for certain words allowing of multiple pronunciations in Russian. In total 240 respondents from 12 Russian cities participated in the study. All of them were men aged from 20 to 60. Their pronunciations of 13 Russian words were recorded and analysed. Most of the pronunciation variants are at least tolerated by Russian orthoepic norm — in orthoepic dictionaries they are marked as either recommended or permissible or outdated. Auditory analysis was applied to the recorded speech data. It consisted in repetitive selective listening to speech segments of interest. Acoustics evidence was addressed in a limited number of controversial instances. As a result, pronunciation distribution statistics were obtained for the analysed words. On the whole, while designating some more or less clear tendencies, the results did not reveal any unambiguously regional pronunciation choices for the target words. Notably, some pronunciation options preferred by the overwhelming majority of speakers do not correspond to the recommended variant. Also of interest is the fact that there are speakers, although very few in number, who may pronounce the same word differently. The obtained statistics of frequency, regional distribution and within-speaker variability of analysed pronunciations can help in forensic comparisons of speech samples and can be taken into account in orthoepic dictionary compilation.

Kravchenko A. N.

Automatic generation of extraction patterns for subjective expressions from untagged text

The goal of opinion mining is to extract and summarize opinionated contents from news, blogs, comments and reviews. One of the main tasks in opinion mining is detecting the boundaries of opinionated expressions and distinguishing between subjective expressions and factual information. High lexicon diversity for different domains excludes the possibility of formulating universal extraction rules that would work for any area of knowledge. In this paper we suggest a solution for this problem, reviewing a classification of subjective expressions in Russian and proposing an algorithm for automatic generation of extraction patterns for subjective expressions from untagged text based on label sequential rules (LSR). The algorithm also includes automatic tagging of the training corpora and result filtering to minimize the need for human participation. At first the proposed algorithm uses an assortment of domain-independent pivots to distinguish opinionated sentences from the factual ones, which allows to avoid manual tagging. Possible subjective expressions are then extracted from selected sentences using a set of syntactic patterns. The applicability of this method is based on the fact that syntactic structure of subjective expressions is domain-independent as well. The resulting subjective expressions are, on the contrary, domainspecific. After that, the expressions are filtered with a use of probabilistic algorithm, increasing precision and therefore minimizing the need for human participation. The effectiveness of the proposed approach was evaluated on an collection of approximately 300 000 sentences, gathered from three different domains user reviews on movies, headphones and photo cameras. The best results (80% precision) were shown on domains with existing objective criteria and low lexical variability, such as reviews on cameras and headphones. For movie reviews preсision reached 64,3% after filtering.

Krylov S. A.

The general corpus of the modern Mongolian language and its structural-probabilistic model

The paper describes a General Corpus of the Modern Mongolian language (GCML), which contains 966 texts, 1 155 583 words. We also report a morphological analyzer for Modern Mongolian language (MML), a grammatical dictionary for 63 071 lexemes, a general table of morphological homonymy. The processor analyzes effectively 97 % of textual word forms which correspond to 76 % word forms from the inputs of the concordance to the GCML. MML can be described in its quantitative aspect, according to a structural-probabilistic model (SPM) of MML. SPM contains frequency dictionaries (FDs) of MML of different types: FDs of word forms, lexemes, grammatemes, root morphemes and allomorphemes, affixal morphemes and allomorphemes, flexionemes, grammemes. SPM allows describing behavior of various language units in the written text from the quantitative point of view: their frequency, distribution in texts, compatibility with other units etc. It is possible to transform the usual structural model into an SPM, which is based on statistical analysis of texts (in this model units of language are considered as possessing "the weight", the language oppositions and relations are being measured). The paper reports the top lists of some FDs: i. e. ranging FD of word forms (top-list of the upper 44 word forms having frequencies higher than 1700 ipm), ranging FD of lexemes (top-list of the upper 44 lexemes having frequencies higher than 2050 ipm) and ranging FD of grammatemes (top-list of the upper 44 grammatemes having frequencies higher than 2909 ipm).

Krylova T. V.

Words with the meaning of “medicinal substance”: today and yesterday

Тhe paper deals with Russian words denoting medicinal substance: лекарство, препарат, средство. Our goal was to analyze the semantic relations in this sublexicon. We have found out that the sense of лекарство, the dominant and the oldest word of this group, is narrower than the sense of препаратand средство. (Particularly,лекарство denotes a medicinal substance, destined for the treatment of a disease, which is not obligatory for препарат orсредство).We conclude that препарат and средство are beginning to play the leading part in the semantic group of medicinal substances, which reflects the changes in the naive world model

Kustova G. I.

Semantic types and semantic functions of the adjectivized participles

The paper explores correlations between an adjectivized participle and the corresponding verbal event, cf. priglashennyj professor (visiting professor) and the implications which qualitative meanings of the adjectivized participles are based on, cf. povyshennyj interes (heightened interest), produmannoe reshenie (thought-out decision). Participials derived from the passive past participles are the most numerous class, whih includes a number of subclasses, such as:.1) Motivating event, cf. razreshennyj miting (permitted meeting). The event to which the participial refers does not affect the “visible” state of the object. It is impossible to reconstruct such an event, it is necessary to know about it. Thus, the motivating event gives a sign sufficient to identify a subclass of the general class. 2) Perfect, cf. okrashennye volosy (dyed hair). This is the traditionally understood perfect — the state as a result of a prior event. 3) Comparative model, cf. povyshennye trebovanija (increased demands). This is a large class of participials, consisting of two main subclasses. The first subclass includes participials from verbs, in their turn formed from adjectives (sometimes from сomparatives like luchshe ‘better’, men’she ‘less’): zamedlennoje razvitie (delayed development) — ‘slower than normal’, zatemnennye stekla (tinted glasses). Participials of the second subclass denote a modified shape: zakruglennyj konchik (rounded tip), iskrivlennyj palec (twisted finger). Such participials express the idea of a relatively small shift in the attribute or parameter scale. 4) Quantitative model (“sufficient degree”), cf. nasyshchennyj rastvor (saturated solution). This group includes participials with a component 'high degree'. This component turns a participle into a qualitative adjective. 5) Qualitative model (“weak evaluation”), cf. prisposoblennoe pomeshchenie (adapted premises). Participials of this group have a component 'good' or ‘bad’, so that prisposoblennoe pomeshchenie implies well adapted premises, zapreshchennyj priem (forbidden method) implies a foul, bad method’. Adjectives prekrasnyj (beautiful), uzhasnyj (terrible) etc. clearly belongs to the class of estimates, their lexical meaning is reduced to the evaluation. Hpwever, participials like prisposoblennyj (adapted) or zapreshchennyj (forbidden) should best be classified as positive / negative attributes than real estimates. Although participials often have semantic counterparts among ordinary verb adjectives they contain important interpretation models not found in other classes. Participials fill in the gaps of characteristics of objects.

Kutuzov A. B., Kunilovskaya M. A., Oschepkov A. Y., Chepurkova A. Y.

Russian learner parallel corpus as a tool for translation studies

The paper presents a project aimed at the development of a Russian Learner Parallel Corpus, discusses the existing analogues, describes the current status and the tasks in which it could be used. The existing parallel corpora contain (comparatively) “correct” translations; whereas the aim of the present project is to create a sufficiently large corpus of imperfectly translated Russian and English texts together with their sources and use it as a tool for translation studies, especially those related to translation mistakes. The new corpus will be a valuable resource for computational linguistics as it provides another way of getting data for evaluation which could be used to improve machine translation systems. As of now, the corpus is available online, it already contains nearly half a million word tokens and is growing. The main source of material is translations made by student translators in Russian universities.

Lashevskaya O. N., Mitrofanova O. A., Grachkova M. A., Romanov S. V., Shimorina A. S., Shurygina A. S.

Building the inventory of Russian nominal constructions

The paper presents experimental results of automatic construction identification performed on the Russian National Corpus (RNC). For this purpose we developed a toolbox which allows extraction and processing of co-occurrence data from RNC samples. Russian nouns are chosen as target words. Lists of constructions were built for each target word. By constructions we mean frequent word combinations which include a target word and frequent lexical-semantic tags — context marker of certain meanings of a target word, as well as frequent lemmas representing the given lexical-semantic tags. E.g.: ВИД (kind, sort, type) + r:abstr t:sport: спорт (sport), футбол (football), биатлон (biathlon), etc. Extracted constructions are grouped according to their structure and lexical-semantic content. In conclusion we verify the experimental results. which implies comparison of lists of constructions with lists of collocations, idioms, etc. registered in various linguistic resources (bigram search engines, dictionaries).

Liudovyk T. V.

The impact of speech and speaker characteristics on the accuracy of automatic speech recognition

The paper reports an investigation of effects caused by speech style and speaker characteristics on speech recognition accuracy. Results of Ukrainian read and spontaneous speech recognition are analyzed. The speech material consists of broadcast news (15 %), talk shows (29 %), and real court reports (31 %). The test corpus counts 17 000 words and includes speech of 9 male and 9 female speakers. The speakers are TV news presenters, politicians, journalists, lawyers, and “ordinary” members of trials. Two experiments have been conducted. The first one consists in automatic speech recognition, while the second one is based on annotations of speech made by experts. Different speech characteristics are investigated, namely speaking rate, speech disfluences (breathing, hesitation fillers, restarts, fragmented words, reduction and prolongation of words), slang and colloquial words. The most error-prone speech characteristics are restarts and fragmented words, slang and colloquial words. Breathing and hesitation fillers are less error-prone because they are successfully modeled in the speech recognition system. Fast read speech is recognized more accurately than slower spontaneous speech of the same speaker. The more accurate recognition (word error rate 10–15 %) is achieved for speakers who have experience in oral presentations. The obtained results allow roughly prediction of the accuracy of speech recognition based on such speaker experience. Ukrainian speech recognition accuracy achieved so far allows using the speech recognition system for automatic transcription (subtitling) of broadcast news.

Liusina V. S.

Oral speech and instant messaging: the impact of the communication channel on dialogue structure

Correlation between oral and virtual dialogues is analyzed. It is known that dialogues consist of minimal dialogic units (MDU). Contrary to a wide-spread opinion that an MDU cannot include another MDU, there are evidences that this is possible when one of the dialogues is not intentional. The structure of 60 Russian Internet dialogues is analyzed. If these dialogues satisfy the “close-tooral- conversation” conditions (i.e. the interlocutors are not involved in any other mental activities while talking), a mutation of dialogue structure occurs. There are five steps of mutation that lead from an overlooking of a remark to a communicative failure. In such “mutated” dialogues, two interlaced “intentional” MDU are present. However, the communicative failure occurs very rarely due to multitasking. Normally, people are involved in several activities while talking via Internet, which provides necessary pauses in Internet conversation. It can be concluded, therefore, that Internet conversation is a wholly new type of communication.

Malkova A. S.

On the structure of a thesaurus of proverbs

The semantics of a proverb corresponds to some class of situations in which this proverb can be used. We argue that the investigation of the structure of proverb situations can give some productive ideas of how to set up the proverb thesaurus (i.e. a collection of proverbs where each item is supplied with multiple cross-references to relevant texts). We suggest using a special meta-language in order to represent proverb situations. The main items of this meta-language are concepts from the domains of a) human acts and characteristics like labor, help, knowledge etc. and b) outer circumstances caused by human acts and characteristics like success, profit, pleasure etc. Each concept has a dual structure and forms an opposition with a positive and a negative part (such as labor-idleness, helpharm, knowledge-ignorance; success-failure, profit-loss, pleasure-suffering etc.). Two oppositions set together make an elementary statement, where a) one opposition is seen as a result of the other or b) the priority of one opposition over the other is set up. To specify the priority, the following pairs can be used: unreal-real, wanted-accessible, temporary-constant, mundane-hollow etc. If each proverb in the collection is provided with such a formal description, cross-references are created automatically, since semantic similarities will be an evidence of coincidences in the formal structure. The suggested procedures were verified on an experimental proverb corpus, which was not so big (500 items) but included most of generally used texts. Project web-site: http://metaphora2.ru/.

Manicheva E. S., Dreyzis Yu. A., Selegey V. P.

Development of Chinese language lexical-semantic dictionary for the multi-language NLP system

This paper deals with bilingual lexical-semantic dictionary of Mandarin Chinese designed for NLP purposes. This dictionary of Chinese core vocabulary has been compiled according to the principles of model-based universal multi-language linguistic technology Compreno, developed in ABBYY. Nowadays most of lexical data-bases of Mandarin Chinese are based on WordNet principles. Our work shows that Chinese language might as well successfully fit in an alternative universal lexicosemantic database. Here we present an overview of major methodological challenges and solutions to integrate Chinese language data into Compreno framework. At the moment lexical-semantic dictionary of Mandarin Chinese covers more than 8000 meanings with well-structured comprehensive information on deep semantic and syntactic model of a meaning, its lexical and grammatical co-occurrence restrictions, and further work on dictionary is still going on. This paper focuses on typological differences between Chinese and European languages in terms of basic unit for dictionary entry, grammar paradigm of a word and its meanings, differences in syntactic realizations of deep semantic model. The paper also gives reasons why certain theoretical approaches prevailing in Chinese linguistic tradition were revised to serve better application needs, what principles of meaning definition were taken into consideration to provide detailed and complete lexicographic descriptions, compare and contrast Chinese with Russian or any other language

Mikheev M. Yu.

Sholokhov, or Kryukov after all? Nonformal procedures to identify the authorship of “Tikhij don”

The article deals with an authorship identification issue. A variety of texts by Sholokhov and Kryukov are analyzed by context comparison. For each hypothetical coincidence of key words a formal sample is chosen. Only context structures of Sholokhov and Kryukov correspond to these samples. The search of coincidences in the Russian National Corpus and in the full texts using Google helps verify the unique character of variation. Then we examine some keywords, isolated expressions and whole fragments from K. and Sh. texts, which represent the closest duplicates of one other. As a result, we get different types of coincidences, single and punctual or multiple (consisting of a set of structures), but also those distributed among various places in the texts of both authors. In each case we evaluate the probability that the original author of the expression could be Kryukov.

Mihkla M., Hein I., Kalvik M., Kiissel I., Tamuri K., Sirts R.

Estonian speech synthesis: applications and challenges

In the 21st century Estonian speech synthesis has been developed using the more widespread methods and freeware development systems (MBROLA, Festival, eSpeak, HTS). The applications have hitherto been developed mainly in view of the needs of the visually impaired (audio system for reading electronic texts, voicing of subtitles, creation of audiobooks). The major challenges currently facing the Estonian specialists are naturalness of the output speech and expressive speech synthesis. The article is concerned with the issues of statistical modelling of the prosody of synthesized speech and the relations of prosody with other language levels as well as with extralinguistic features. Analysis of the emotion-bound acoustic parameters (pauses, speech rate, formants, intensity and pitch) enable one to model emotions for speech synthesis. In addition, speech synthesis interfaces are discussed. By means of such interfaces users could control the process of speech synthesis, monitor text-to speech transformation, follow text structure and vary the parameters (voice loudness, speech rate, voice pitch) of the synthetic voice in various voice applications.

Moldovan D.

Representing and reasoning for explicit, implicit and implicated textual information

Mukhin M., Braslavski P.

What do people ask the community question answering services and how do they do it in Russian?

In our study we surveyed different approaches to the study of questions in traditional linguistics, question answering (QA), and, recently, in community question answering (CQA). We adapted a functional-semantic classification scheme for CQA data and manually labeled 2,000 questions in Russian originating from Otvety@Mail.Ru CQA service. About half of them are purely conversational and do not aim at obtaining actual information. In the subset of meaningful questions the major classes are requests for recommendations, or how-questions, and fact-seeking questions. The data demonstrate a variety of interrogative sentences as well as a host of formally non-interrogative expressions with the meaning of questions and requests. The observations can be of interest both for linguistics and for practical applications.

Nekhay I. V.

Application of N-grams and other letter- and word-level statistics to semantic classification of unknown proper nouns

Automatic semantic classification of unknown proper nouns is a significant problem in the field of automatic text analysis. We investigate the ability to classify proper noun phrases frequent in Russian language using machine learning approaches and n-gram statistics, as well as other letter- and word-level feature statistics such as word capitalization, number of words, presence of abbreviations and numbers, etc. We use only internal information, but, as we expect to use external information in future, we test scenarios that can arise when the former is utilized for classification. We show that such simple features allow to achieve accuracies up to 80-99,9% in 1-1 trials, 89-99% in 1-other trials and 85% accuracy in 5-way trial in 5 pre-selected categories, using training set of about 2000 examples in each category. These results come very close to results of (Patel & Smarr, 2001), achieved by a similar system designed to classify English proper nouns.

Nokel M. A., Bolshakova E. I., Loukachevitch N. V.

Combining multiple features for single-word term extraction

The paper describes experiments on automatic single-word term extraction based on combining various features of words, mainly linguistic and statistical, by machine learning methods. Since single-word terms are much more difficult to recognize than multi-word terms, a broad range of word features was taken into account, among them are widely-known measures (such as TF-IDF), some novel features, as well as proposed modifications of features usually applied for multi-word term extraction. A large target collection of Russian texts in the domain of banking was taken for experiments. Average Precision was chosen to evaluate the results of term extraction, along with the manually created thesaurus of terminology on banking activity that was used to approve extracted terms. The experiments showed that the use of multiple features significantly improves the results of automatic extraction of domainspecific terms. It was proved that logistic regression is the best machine learning method for singleword term extraction; the subset of word features significant for term extraction was also revealed.

Orekhov B. V., Gallyamov A. A.

Bashkir internet: the vocabulary and pragmatics in the quantitative aspect

The paper deals with a quantitative aspect of the Bashkir language segment of the Internet. We analyze the results of a special crawler’s work. Our crawler has indexed Bashkir sites and collected the linguistically valuable data on word form frequencies. This data differs from word frequencies in Bashkir printed texts or in the Russian Internet. Most of the frequent words are marked as official. Obscene words are almost nonexistent. This means that the Bashkir Internet is not designed for any kind of communicative needs; the main goal seems to convey the message of the existence of the Bashkir language and its presence on the Web. Internet terms like “site” and others are rare in the Bashkir web segment. Such popular words in Runet as “job”, “mobile phone” and others are not as frequent in the Bashnet.

Os’mak N. A.

Methods of lexicographic description of Russian spontaneous speech (a corpora research)

The paper presents the results of an analysis of possible methods of lexicographical description of Russian spontaneous speech aimed at the creation of a General Russian Conversational Dictionary. The analysis is based on the material of the Corpus of Spoken Russian "One Speech Day", which presents unique linguistic material, enabling fundamental research in many aspects: study of real spontaneous speech, phonetics and grammar of spoken language, psycholinguistics, communication studies, etc. The General Russian Conversational Dictionary can be created through the description of most frequent units or through the analysis of lexical semantic groups. Advantages and disadvantages of the two methods are analyzed. Tentative lexicographical description of a lexical semantic group representing nominations of humans are given.

Paducheva E. V.

Communicative perspective interpretation: basic structures and linear-accentual transformations

Communicative structure (C-structure), which is also called communicative perspective of a sentence, relies upon linear-accentual structure (LA-structure), which is, basically, a linear sequence of tonal groups, each with its syntactic and, possibly, lexical characteristics. C-structure identifies each group as a constituent of a certain kind, such as theme or rheme, and posits semantic relationships between these constituents. These relationships can be hierarchical; for example, a constituent theme provides a place for theme-rheme division of the lower level. The problem is to create a calculus of possible C-structures of Russian sentences and provide each C-structure with the description of its contribution to communicative semantics of the corresponding sentence. A kind of transformational approach to the problem is accepted. Several basic (or neutral) C-structures are revealed, which most naturally correspond to lexico-syntactic structure of the sentence, and a set of LA-transformations is presented. For each transformation its contribution to communicative meaning of the sentence is described.

Pak A., Paroubek P.

Language independent approach to sentiment analysis (LIMSI Participation in ROMIP’11)

Sentiment analysis is a challenging task for computational linguistics. It poses a difficult problem of identifying user opinion in a given text. In this paper, we describe participation of LIMSI in the sentiment analysis track of the Russian annual evaluation campaign (ROMIP’11). The goal of the track was classification of opinions expressed in blog posts into two, three, and five classes. Our system based on SVM with dependency graph and ngram features was placed 1st in 5-class task on all three datasets (movies, books, cameras), 3rd in the 2-class task on the movies dataset, and 4th in the 3-class task on the cameras dataset, according to the official results.

Polyakov А. Е.

Problems and methods in analysis of Russian texts in prereform spelling

The existing linguistic processors (spellcheckers, lemmatizers, OCR programs) are not suitable for analysis of pre-reform Russian texts because of numerous graphical, morphological and lexical differences from the modern Russian language. Some lemmatizers have a restricted support of pre-reform spelling, but they are closed source and cannot be modified or extended. We have developed a lemmatizer which can properly analyze pre-reform texts and has a facility for flexible adaptation to other spelling systems. This paper discusses the problems in the analysis of pre-reform Russian texts (obsolete forms, spelling variants) and the methods of their solution (normalization, modification of the grammatical model, etc.).

Polyakov P. Yu., Kalinina M. V., Pleshko V. V.

Research on applicability of thematic classification methods to the problem of book review classification

The paper examines the different approaches to forming the training set, methods for extracting classification features, as well as methods of constructing classifiers regarding the problem of book review sentiment analysis. The tasks were to divide book reviews into 2 groups (positive, negative) and into 3 groups (positive, negative, neutral). Several methods were tested in the solution of the two tasks. It was shown that good results could be obtained by using common document categorization methods. The obtained figures approach the best results of the Web-site and regulatory document classification track achieved by participants of ROMIP seminar. A method for enrichment of classification features within the linguistic approach using evaluative vocabulary dictionaries was proposed. It was established that this method gives a slight improvement in the results for the binary classification. We plan to explore in more detail the possibility of using expert-linguistic approaches to the construction of classification features.

Poroshin V.

Proof of concept statistical sentiment classification at ROMIP 2011

In this paper we present a simple statistical classification method that predicts whether the opinion expressed by text in natural language is positive or negative. There are two main approaches in the sentiment or opinion detection: linguistic rule based systems and statistical algorithms. While statistical methods are easier to build when sufficient training data is available, it is widely perceived that a linguistic system can deliver better results. Our work was intended to prove the concept that a simple Naive Bayes based statistical classification algorithm with a minor language dependent adaptation is able to perform well in a binary sentiment classification task. In order to prove the hypothesis, we participated in Russian Information Retrieval Seminar (ROMIP) 2011 sentiment classification track [1], and achieved quite competitive results in sentiment prediction of Russian blog posts. This paper contains a detailed description of our classification method, including a feature extraction and normalization process, training and test data, evaluation metrics; and presents our official ROMIP results.

Savchuk S. O.

Variation in gender forms in the group of pluralia tantum in the Russian language

The paper presents the results of a corpus-based study on variation in Russian pluralia tantum nouns. Most of them vary in gender and/or genitive plural forms. The list of variants was composed by analyzing textbooks and dictionaries compiled at the beginning and the second half of the 20th century. The variants are classified according to their morphological and semantic features. The usage of every lexeme from the list was analyzed in the texts of the Russian National Corpus, all variants were registered in the database and the correlation between variants was determined. The comparison of corpus data with the data derived from dictionaries made it possible to detect the changes in correlation between variants within the studied period and to formulate some trends in variants functioning. The results of the research are used to correct morphologic annotation; they can also be regarded as a material for the lexicographic description of variants.

Schumann A.-K.

Towards the automated enrichment of multilingual terminology databases with knowledge-rich contexts

This paper describes ongoing Phd thesis work dealing with the extraction of knowledge-rich contexts (KRCs) from specialized Russian and German text corpora for the semantic enrichment of terminological resources. In recent years, automatically extracted KRCs have been proposed as a means for deriving empirically grounded concept descriptions for terminography while maintaining the time and costs spent for the acquisition of such descriptions on a reasonable level. KRCs have been studied for a number of European languages ranging from English over French and Spanish to Catalan, however, not much effort has yet been put into widely spoken, but typologically different languages such as Russian or German. This paper, therefore, describes research efforts aiming at the extraction of KRCs in Russian and German for the purpose of termbase enrichment. Section 1 of this paper presents a brief introduction to KRC research and the motivation for this study. Section 2 gives an overview over related work. Section 3 describes the KnowPipe KRC extraction framework, whereas section 4 outlines ranking experiments with KnowPipe on Russian and German data. Section 5 summarizes the results and describes future work.

Semenova S. Iu.

On the metaphora and the metonymy of the Russian parametric noun

The paper is concerned with ambiguity and semantic transfers of the Russian parametric noun. The semantic processes are explicated through changes in the argument taxonomy of this predicate noun. We consider noun phrases with the genitive case of the parametric noun argument Metaphoric transfers of the parametric noun associated with disappearance or alteration of the quantitative meaning and measurability are essential for NLP tasks, namely, for automatic extraction of the parametric data. The concepts of weak and strong metaphor are introduced. The weak metaphor that maintains the measurability seems to be one of the principal ways of terminology formation. Metonymic shifts of the parametric noun and its argument should be fixed and ranked in the NLP-oriented dictionary.

Sharapov R. V., Sharapova E. V.

Research of plagiarism in the student works

In this paper we study plagiarism in student works. We consider as plagiarism a direct unattributed copy of a text. We describe how students try to hide plagiarism. We have analyzed a collection of tests, course papers, and student working papers. When writing their own work (a course paper or a student working paper), students usually copy material from just one source, while when writing a test, they usually use more sources. We found that the methods used to hide the plagiarism usually include changing of text; reduction of text; changing of certain pronouns, adjectives and verbs; replacement of Russian letters by similar Latin ones. Among the less frequently used methods are replacements of parts of text, changed punctuation, substitution of invisible characters for spaces, manual and automatic substitution of synonyms, copying texts from some sources and changing them. Methods of hiding plagiarism may change over time if they become less effective.

Shmelev A. D.

Corpus or experiment?

The paper discusses general distinction between observation and experiment as applied to linguistic research. It claims that the resolution of certain issues has to be based on observation, that is, on corpus data (e. g., studies of ancient languages) while other issues require experiment (in particular, issues concerning the difference between common usage and linguistic standards). Two main varieties of experiment may be distinguished, namely, experiment on the researcher’s linguistic competence and interrogation of informants. The latter is the only accessible experimental method if the researcher has no linguistic competence of the language or dialect under investigation. To illustrate the point, the paper uses the example of regional linguistic standards (in particular, norms of pronunciation). It discusses various pitfalls on the way to an accurate account of regional linguistic standards. To avoid falling into one of those pitfalls, the researcher should be clear in his/her mind about the ultimate objective of the investigation and word questions to the informants in a clear form if s/he is going to use a questionnaire.

Sokolova E. G., Kononenko I. S.

Russian-English thesaurus on computational linguistics

This paper summarizes the experience in the construction of Russian-English information retrieval thesaurus on Computational Linguistics (CL). The need for relating thesaurus terms to the subareas of CL and adjacent sciences is substantiated and the hierarchical structure of subareas is discussed. The kinds of information given in the thesaurus term entry are outlined. A number of terminology description issues are discussed with regard to the specific features of the constructed thesaurus such as bilinguality and insufficient development of Russian CL. Terminological problems are analysed using classification parameters.

Solomennik A. I., Chistikov P. G.

Automatic generation of text corpora for creating voice databases in a Russian text-to-speech system

This paper deals with the problem of speech database design for the needs of unit selection textto- speech synthesis. An obligatory condition for the naturalness and intelligibility of synthesized speech is a high quality speech database. We propose a computer program developed specifically for the Russian language which creates a phonetically balanced text corpus of given size. We present a description of the program and a comparison of an automatically constructed corpus and some arbitrary corpora. The automatic text corpus generation program is part of a new voice building system for VitalVoice Russian TTS. It helps to supplement a text corpus with missing phonetic units. Further possible improvements of the algorithm are also discussed. We consider several ways to take into account intonational variation of units in a database at the stage of the preparation of a text corpus.

Solovyev A. N., Antonova A. Ju., Pazelskaia A. G.

Using Sentiment-analysis for text information extraction

This article aims to demonstrate relationship between sentiment-analysis and information extraction from text. To illustrate the idea we propose a sentiment-based summarization prototype instrument. The performance of the method in comparison with 3 another summarization methods is tested on a small corpus of mass media and blog posts.

Toldova S. Ju., Sokolova E. G., Astaf’eva I., Gareyshina A., Koroleva A., Privoznov D., Sidorova E., Tupikina L., Lyashevskaya O. N.

NLP evaluation 2011–2012: Russian syntactic parsers

NLP Evaluation forum RU–EVAL started in 2010 as a new initiative aimed at independent evaluation of the methods used in Russian language resources and linguistic tools. The second evaluation campaign (2011–2012) is focused on syntactic parsing. It is open both to academic institutions and industrial companies and its general objective is to access the current state–of–the–art in the field and promote the development of syntactic technologies. The paper presents the principles and design of two tracks, which were organized thematically, namely, Main track and News. There were seven participants who follow either rule–based or statistical approach; all of them submitted runs to both tracks. The training set consisted in 100 sentences, the dataset for annotation included ca. one million words, and the test set was composed by ca. 800 sentences (500+ sentences for the Main track and 300+ sentences for the News track). The test set was annotated manually as a Golden Standard by two annotators. We describe how the outputs were compared and discuss common pitfalls for evaluation as well as some cases that are still problematic for parsing. As a side effect of the evaluation campaign and benchmarks for future development, the test data including Golden Standard and three automatically annotated answers are available to the NLP community at http://testsynt.soiza.com.

Uryson E. V.

Conjunctions, connectors, and valency theory

According to traditional grammar conjunctions are words connecting clauses (or words / word combinations). Still many conjunctions can also connect sentences and longer fragments of texts, being in fact connectors, (these are “soft” conjunctions), while some conjunctions are able to connect only clauses (these are “hard” conjunctions). Many Russian conjunctions, both coordinating and subordinating, are soft, but there are also hard coordinating and subordinating conjunctions. Hard conjunctions are easily described in the frameworks of valency theory as predicates with (at least) two semantic and syntactic actants. Soft conjunctions and connectors also have two semantic actants. But at least one of these actants cannot be revealed in a text by an algorithm. It is argued that some refinements of valency theory are necessary for representing syntactic properties of connectors and soft conjunctions.

Valiakhmetova A. R.

Government of the borrowed nouns denoting professions in the modern Russian language

Linguists who study borrowed words concentrate either on sociolinguistic problems or the process of the foreign words adaptation. As for their syntactic features, they are usually left unconsidered. The present paper deals with the semantics and government of the borrowed nouns denoting professions in the modern Russian language, namely: ‘manager’, ‘agent’, ‘freelancer’, ‘outsourcer’, ‘designer’, ‘couturier’, ‘stylist’, ‘coach’ and ‘tutor’. These nouns can be divided into three semantic groups: ‘employees’ (‘manager’, ‘agent’, ‘freelancer’ and ‘outsourcer’), ‘artists’ (‘designer’, ‘couturier’ and ‘stylist’) and ‘teachers’ (‘coach’ and ‘tutor’). There are two government variants of these words: non-prepositional genitive and prepositional phrase po + dative. Different factors affecting the usage of these variants are considered, such as semantic analogy, word formation analogy and the semantics of the borrowed word semantics. Meaning extension is claimed to cause the widespread usage of po + dative construction. It can be generalized that ‘employees’ and ‘teachers’ tend to govern the latter construction in accordance with the linguistic drift affecting the Russian language since the end of the 19th century. As for ‘artists’, they prefer the non-prepositional genitive by analogy with the semantically similar word ‘modeller’.

Vasilyev V., Khudyakova M., Davydov S.

Sentiment classification by fragment rules

In this paper approaches to sentiment classification based on using fragment rules are described. Rules are constructed manually by experts and automatically by using machine learning procedures. Training sets, evaluation metrics and experiments are used according to ROMIP 2011 sentiment analysis track.

Yagunova E. V.

Experiments in the generation of information structure of the text

In this paper, an experimental approach is developed to generate information structure of the text. The experimental texts are spoken as opposed to written ones and business texts as opposed to fiction. The texts are analyzed using different experimental settings, viz. (a) soliciting partially predetermined answers from a team of 13 experienced linguists who listened to the spoken texts, (b) soliciting answers from a team of 59 naive subjects who read the written texts, and (c) applying special computer programs to the written texts. The experiments were used to generate a generalized information structure of a text in terms of its constituent utterances viewed as belonging to Topics or Comments. There is an interesting correlation between the classification of the utterances into Topic substructures of the text and the distribution of key words throughout the text. Another result which deserves attention concerns the problem of differentiating between semantic changes induced by specific syntactic shifts and those brought about by information restructuring due to the contextual effects.

Yanko T. E.

Extracting subjective information from spoken discourse: the case for prosodic emphasis

The aim of this paper is to develop a method for extracting subjective information from spoken discourse. The example of subjective information discussed here is the attitude of the speaker to the reported events as to rare or abnormal. Disrupting a routine course of life which causes the speaker to express his/her strong feelings refers to the concept of emphasis. Emphasis basically has a prosodic expression. The emphatic prosody designates: ‘it is great that he did it by himself because generally he needs assistance’, ‘he came to the meeting which is very unusual because he never comes’, ‘it was the president who congratulated them which is highly honorable because generally it is the vice-president who does it’. Instances of emphasis are highly frequent in everyday speech as well as in mass media language, especially in breaking news. Reports of achievements and losses, accidents, severe meteorological phenomena, such as hurricanes, earthquakes, swollen streams, and numerical scores of any kind are generally made with the use of emphatic prosody. The instrumental analysis of the physical parameters of emphasis can serve as a means for extracting subjective information from spoken texts. The results presented here are exemplified by records from oral corpora. The frequency tracings are generated by the software program Speech Analyzer.

Valiakhmetova A. R.

Government of the borrowed nouns denoting professions in the modern Russian language

Yagunova E. V.

Experiments in the generation of information structure of the text

Zagorulko M. Ju., Kononenko I. S., Sidorova E. A.

System for semantic annotation of domain-specific text corpora

A system for universal annotation of text corpus by an expert is presented that contributes to extraction of domain knowledge within the development framework of information systems in specific domains. The technique and software tools for annotation of text corpora allow the expert to carry out two types of semantic annotation: 1) identify text fragments in which the domain concepts represented by special terms actually appear (term annotation) and 2) identify text fragments (often discontinuous) that correspond to domain relations or situations including their participant structure (event annotation). The general principles and schemes of term and event annotation have been formulated and tested for the domain of heterogeneous catalysis on the basis of the hierarchy of term classes chosen in advance. The system, its functional architecture, and user interface are described. Two main directions of usage of semantically annotated texts are discussed: automatic construction of domain lexicons that associate terms with their linguistic and semantic properties; semi-automatic generation of semantic-syntactic patterns for event extraction.

Zalizniak A. A., Shmelev A. D.

Russian razočarovanie in the European lingist ic context: the past and the present

The paper discusses the semantic evolution of the word razoèarovanie (as well as other words of the same family: razoèarovat’, razoèarovat’sja, razoèarovan) in Russian. It was disclosed that razoèarovanie being a translation loan-word from the French désenchantement had become different from the modern “European” concept encoded in the English disappointment, the French déception, the German Enttäuschung, etc., which refer to the feeling resulting from the failure of expectations. In contrast, the Russian razoèarovanie refers to the discovery that something highly valued is not as good as one believed it to be. In modern Russian, the meaning of razoèarovanie and other words of the same family has become less specific and closer to the English disappointment; accordingly, their government pattern has been modified. This semantic shift occurred in accordance with the general tendency of loan translation from English, but it is also supported by internal systemic factors. The paper suggests a semantic description of the standard use of the Russian words razoèarovanie, razoèarovat’(sja), razoèarovan and reveals the semantic shift in question, which is happening before our eyes, but is not readily identified by the speakers of Russian.

Zanegina N. N.

Improvised-temporary-compounds as a new expressive mean in Russian

The article describes improvised temporary compounds, a new means of expressiveness and a pattern of word formation novel to Russian. Such compounds are created in writing by hyphenating a phrase or an entire sentence (cf. we had to invite a woman-who-knew-how-to-mop-the-floor). A corpus-based analysis permitted us to categorize these compound lexemes according to the type of their original syntactic structure (we used a corpus of texts available online in blogs). The following functions of such compounds can be distinguished: 1) logical accentuation, putting an emphasis on a certain meaning which is usually conveyed by several words (car marker lights are usually switched on in twilight, I-haven't-seen-sun-for-days weather); 2) filling in a lexical gap (in some cases, the absence of the sought word can be verbalized: I-do-not-know-how-to-call-thatfour- wheeled-thing); 3) isolation of a special class of objects (he normally wears his favorite pants and any-T-shirt-he-found-in-the-closet); 4) citing a direct quote as an example of a typical reaction to an object or an event, or as a telling illustration to an event (he called to say he-was-alreadyon- his-way-to-the-office); 5) underlining the fact the complement objects depend on this specific main word (me-by-the-window pic taken by Vasya) and some other. We compared hyphenated temporary compounds to such compound lexemes in which 1) all elements are fused (written in one word); 2) all elements are connected by an underscore (_); 3) each element is capitalized. Their functions partially coincide with the functions of the compounds in question.

Zangenfeind R.

Towards a system of syntactic dependencies of German

A set of syntactic relations in a dependency grammar for German is proposed which can be used for machine translation and other applications. The basis for this set is the formal model of Russian syntax used in the linguistic processor ÈTAP, which is built on the basis of Meaning Û Text theory. Using a terminology as close to ÈTAP as possible facilitates a potential computational implementation. At this stage of development the system of German dependency syntax comprises 58 syntactic relations, which were used to analyze manually several hundred German sentences. 18 of them have defi nitions that are similar to those of their counterparts in Russian syntax. Seven relations have the same defi nitions as their Russian counterparts except for the concrete German lexemes that are part of the defi nitions. For another 30 relations identical defi nitions as for the corresponding Russian relations can be used. Three German relations do not have a Russian counterpart.

Zhila A., Gelbukh A.

Exploring context clustering for term translation

Many tasks in natural language processing, such as machine translation, word sense disambiguation, word translation disambiguation, require analysis of contextual information. In case of supervised approaches this analysis is performed by human experts, which is very costly. Unsupervised approaches offer fully automatic methods to fulfi ll these tasks. Yet these methods are not robust, their results are very parameter-dependent and diffi cult to interpret. Context clustering is an unsupervised technique for analysis of context similarities. In this work we explore dependencies of context clustering results from various clustering parameters. We also explore suitability of the context clustering for word translation disambiguation by evaluating the clustering results against known classes that are classes of translation candidates.

Zimmerling A.

A unified analysis of clitic clusters in world’s languages

The paper proposes a unifi ed analysis of complex syntactic objects defi ned as clusters. A cluster is by defi nition a string of elements {a,b,c…n} which can function without combining with each other but are arranged in a rigid order when they assume a contact position, so that for each pair (a, b) the linear order a > b, i.e. ‘a precedes b’ is fi xed. Elements conforming to this defi nition are called clusterizing. Rules ordering elements in clusters are called Template Rules. In the fi rst section I analyze Template Rules as empiric generalizations made on text corpora representing the normative usage of world’s languages from the class of languages with clusterizing clause-level elements. In the fi nal section I analyze Template Rules as linearization algorithms. The general conclusion is that clusters ordered by Template Rules are normally non-homogenous regarding their morpho-syntactic and prosodic values. I furthermore argue that a theory of clusters can be build with little or no resource to the prosody of the clitic elements.

Proceedings 2012

Contents

Format PDF

Additional

Collection of proceedings