Сборник 2010

A
Antoshina S.A., Murmansk State Pedagogical University, Lyashevskaya O.N. University of Tromsø
NOMINAL CASE PATTERNS FROM THE VIEWPOINT OF CONSTRUCTION GRAMMAR
The present-day descriptions of the nominal government depart from the hypothesis that case patterns are inherited within a word-formation cluster (cf. udarit’.V molotkom.NP.INS ‘to hammer’ – udar.N molotkom.NP.INS ‘hammer blow’, udarit’.V po golove.NP.DAT ‘to hit to the head’ – udar.N po.PREP golove.NP.DAT ‘blow to the head’). The article studies the emergency of new patterns associated with a substantive or an adjective which are not attested with their motivating correlate, cf. zaslon.N pornografii.NP.DAT ‘barrier to pornography’, komitet, analogichnyj.A nobelevskomu.NP.DAT ‘institution similar to the Nobel Committee’ etc., and the same patterns in the light verb construction, cf. postavit’.V zaslon.N pornografii.NP.DAT ‘to put the barrier to pornography’. We consider two hypotheses: that the noun has its own case pattern (Grimshaw 1990) and that the pattern is imposed by the construction “light verb + noun”.
Apresjan V.Ju. V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
SEMANTIC STRUCTURE OF WORDS AND THEIR INTERACTION WITH NEGATION
The paper considers factors that influence the ability of predicates to fall into the scope of negation. Some of the factors instrumental in the way a semantic element interacts with negation are as follows: its semantic weight and its depth in the semantic structure, which determine its place in the logical structure of the meaning. Semantic weight and depth of an element are influenced by the degree of its semantic detalization and its status in the linguistic worldview. Semantic characteristics of predicates which trigger their ability to interact with negation also determine their government patterns, some of which only allow wide or narrow scope of negation. Sometimes synonyms demonstrate different properties with respect to their interaction with negation ("povezti" 'to get lucky' VS. "poschastlivit'sia" 'to hit the jackpot'). Linguistic differences of these synonyms are determined by the fact that the former is an interpretative verb, and the latter an evaluative verb.
B
Baranov A.N. V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
ONCE MORE ON FACTORS OF IDIOMATICITY: TAUTOLOGY AND ONYMIZATION
Two new parameters of idiomaticity are discussed: tautology in the inner form of an idiom and onymization. Tautology is a phenomenon of repetition of the meaning of one component of an idiom (or a part of this component) in the inner form of this idiom. Onymization is a process when properties of a common noun are transfered to the proper noun, which is accompanied with structural modifying of common noun. Some evidences allow considering tautology and onymization as different factors of idiomaticity in the field of phraseology.
Baranov A.N. Dobrovol’skij D.O. V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
SEMANTICS OF IDIOMS: A HIERARCHY OR A SEMANTIC NET?
We consider two basic approaches to the ideographic description of phraseology: inductive and deductive. We argue in favour of the inductive approach, exemplifying our ideas with data from the “Thesaurus of Current Russian idioms”. Attempts to describe the semantic fields of idioms applying the deductive approach turn out to be ineffectiive because a hierarchical logical scheme does not fit in with phraseological data. Many taxa which have to be postulated from a purely logical point of view would remain empty because idioms do not cover the semantic net of language as a whole and are concentrated in certain semantic domains while ignoring many others. Although the inductive approach does not allow for a consistent hierarchical description, the use of paradigmatic (cross-)references, as well as the division of every taxon into core and periphery, reflects, at least partly, relevant semantic relations within the phraseological system.
Mira Bergelson Anna Nekrasova Lomonosov Moscow State University
LINGUISTIC ANALYSIS OF STEREOTYPES: A BALANCE BETWEEN TEXTS AND MEANINGS
The paper deals with Anglo-French ethnic stereotypes found in the publications in the British and French press. Stereotypes are viewed as instances of linguacultural schemas. Various ways of introducing and interpreting information in the media discourse are the subject of the linguistic analysis.
Bogdanova N. V. Faculty of philology and art, Saint-Petersburg State University, Saint-Petersburg, Russia
THE CORPUS OF SPOKEN RUSSIAN: NEW RECEIPTS AND FIRST RESULTS OF RESEARCH
New receipts are presented in the balanced part of the Sound corpus of Russian language – an array of life monologues sharing the same linguistic, sociolinguistic and psycholinguistic criteria. New blocks of this corpus and the first results of analysis are described.
Igor M. Boguslavsky Leonid L.Iomdin A.A.Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences
ON VALENCY PROPERTIES OF A WIDE CLASS OF NOUNS
The valency structure of a number of lexicographic types of Russian predicate nouns is considered. The defining property of these nouns is the fact that one of the actants of any such noun can be referred to as the noun itself. For example, in the sentence The train covers a distance of 650 kilometers in 4 hours, the phrase 650 kilometers instantiates a valency slot of the word distance; at the same time, 650 kilometers is the distance.
Bolshakov, I.A. Independent researcher, Moscow Gelbukh A.F. National Polytechnic Institute, Mexico City, Mexico Bolshakova E.I. Moscow State University, Moscow
AN ASSOCIATIVE NETWORK OF CONCEPTS OCCURING IN INTERNET QUERIES
Based on a vast set of coordinated noun pairs taken from the user queries databases of Google and Yandex search engines, we have built an associative network of concepts that frequently occur in Russian queries to Internet. It is shown that the pairs in the queries also encounter in Internet texts with significant frequencies. The associative network obtained is investigated in brief.
Noah Bubenhofer Roman Schneider Institute for German Language Mannheim/Germany
USING A DOMAIN ONTOLOGY FOR THE SEMANTIC-STATISTICAL CLASSIFICATION OF SPECIALIST HYPERTEXTS
In this feasibility study we aim at contributing at the practical use of domain ontologies for hypertext classification by introducing an algorithm generating potential keywords for specialist hypertext nodes. The algorithm uses structural markup information and lemmatized word lists from the Grammis/ProGr@mm web information system on German grammar, as well as a terminological ontology covering the domain of linguistics. We present the calculation and ranking of keyword candidates based on ontology relationships, word position, frequency information, and statistical significance as evidenced by log-likelihood tests. Finally, the results of our machine-driven classification are validated empirically against manually assigned keywords.
C
Chatviorkin Ilya MSU Faculty of Computational Mathematics and Cybernetics Lukashevich Natalia Research Computing Center of Moscow State University
AUTOMATIC EXTRACTION OF DOMAIN-SPECIFIC OPINION WORDS
In order to effectively extract opinions, knowledge of a large amount of domain-specific opinion words and expressions is required. We present a new method of automatic opinion words extraction. The method combines computational characteristics of the word usage in several domain-specific text collections.
Chernenkov D.M. Moscow Institute of Electronics and Mathematics
YET ANOTHER STATISTICAL METHOD FOR NON-VOCABULARY WORD FLEXION PREDICTION BASED ON TEXT CORPORA
The paper describes an algorithm for predicting flexion models of new words occurring in a text corpus. The algorithm makes use of a number of probabilistic models for word flexion and a machine learning based method for selection and ranking. Statistical properties of the corpora and training results are analyzed.
D
Dolozova O.N. Saint Petersburg State University
A SOUND LEXICO-GRAMMATICAL DATABASE OF THE ITELMEN LANGUAGE: CREATION AND LINGUISTIC ANNOTATION
The creation and linguistic annotation of the sound lexico-grammatical database on the Itelmen language.is described. The main purpose and structure of the database are presented. Search queries to the database are illustrated, The linguistic annotation enables to search for lexemes and word forms using such parameters as alphabetical order, part of speech classification, some grammatical and phonetic characteristics.
E
Epifanov M.E. Antonova A.J. Batalina A.M. Kobzareva T.J. Lakhuti D.G. Russian State University for the Humanities
ITERATIVE APPLICATION OF PART-OF-SPEECH DISAMBIGUATION ALGORITHMS FOR RUSSIAN TEXT
Russian State University for Humanities, Moscow An iterative version of part-of-speech disambiguation algorithms for Russian text is validated, The respective expansion of the instrumental environment for experimentation with linguistic algorithms (ESLA) is described.
N. E. Efremova E.I. Bolshakova A.A. Noskov V.U. Antonov Lomonosov Moscow State University
ANALYSIS OF TEXT TERMINOLOGY BASED ON LEXICOSYNTACTIC PATTERNS
The paper describes lexicosyntactic patterns that specify term occurrences within scientific and technical Russian texts. An experimental study of automatic term recognition procedures based on the patterns is discussed.
F
Fedorova O.V. Delikishkina E.A. Maljutina S.A. Fein A.A. Moscow Lomonosov State University
AN EXPERIMENTAL APPROACH TO REFERENCE IN DISCOURSE: EFFECTS OF RHETORICAL STRUCTURE ON PRONOUN INTEPRETATION
Psycholinguistic experiments were carried out in order to show that the rhetorical distance is an important activation factor which strongly influences the referential choice. The results revealed the RhD effect (i.e. increasing RhD reduces the subject’s ability to identify the antecedent of an anaphoric pronoun) and the working memory effect (i.e. the higher is the subject’s WM span, the more successful he is in answering questions).
G
Alexander Gelbukh National Polytecnic Instirute, Mexico Sidorov G.O. CIC-IPN Liliana Chanona-Hernandez ESIME-IPN Eduardo Lavin-Villa CIC-IPN
AUTOMATIC DETECTION AND CLASSIFICATION OF TERMS IN A SPECIFIC DOMAIN CORPUS USING LOG-LIKELIHOOD FOR ONTOLOGY CONSTRUCTION
The development of ontologies taking as input plain text documents is very important because manual construction of ontologies is labor-extensive and expensive, while a lot of textual information is already available in electronic format. The paper presents an improved method for automatic detection of terms in plain text documents and their classification using a variety of statistical methods, including term extraction using the Log-Likelihood algorithm, the cosine similarity measure and clustering via K-Means algorithm. Various experiments were conducted for Spanish using a corpus of domain of informatics.
Gilyarova K.A. Russian State University for the Humanities, Moscow
SUCH A GIRL-GIRL. SEMANTICS OF NOUN REDUPLICATION IN COLLOQUIAL RUSSIAN AND THE INTERNET LANGUAGE
This paper presents a semantic analysis of noun reduplication in colloquial Russian. We consider syntactic reduplication, i.e., repetition of a word within the same prosodic unit as in "takaja devochka-devochka" ('such a girl-girl'), "prjam leto-leto" ('a real summer-summer'). Drawing on a corpus of examples gathered from natural speech and the Internet texts we categorize the semantics of noun reduplication into seven types: (1) strict, literal meaning (2) prototype, (3) connotation, (4) intensified meaning, (5) appreciation, (6) specifying the meaning of polysemous words, (7) referent determiner.
Grishina Elena V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
THE VOCAL GESTURE AH IN SPOKEN RUSSIAN
The paper analyzes the usage of the vocal gesture Ah according to the data of the Multimodal Russian Corpus (MURCO). The investigation is based on the analysis of the body and face movements, which accompany this vocal gesture in the process of oral speech. As a result, four meanings have been isolated: 1) Ah as an interrogative particle, 2) Ah as a negative particle, 3) Ah as an interjection, and 4) Ah as a physiological exclamation. The vocal gestures Ah and Oh are compared.
I
Boris Iomdin Russian Language Institute n.a. V.V.Vinogradov, Russian Academy of Sciences Aleksandr Piperski Moscow State University n.a. M.V.Lomonosov
PRAGMATICS OF FOOD. CONNOTATIONS IN RUSSIAN AND GERMAN NUTRITION VOCABULARY
Russian and German vocabulary of food (over 250 words in each language) are used to study persistent associations and discuss the problems of their lexicographic description within the connotation theory. Certain new lexicographic solutions are proposed.
K
Anton Kazennikov Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences
A COMPARATIVE ANALYSIS OF MACHINE LEARNING DEPENDENCY TREE-BASED PARSING ALGORITHMS
The paper provides a comparative analysis of machine learning dependency tree-based parsing algorithms. We compare two different machine learning approaches to dependency parsing with a rule-based parser of the ETAP-3system. One algorithm views the parsing task as construction of maximum spanning tree, and the other as a path-finding task. The comparison is made over link accuracy instead of tree accuracy, as the analyzed algorithms behave differently on nonprojective links. The difference from the existing implementations lies in the type of the underlying classifier used.
Karpova O.S. ABBYY Press Arkhangelskiy T.A. MSU Kyuseva M.V. MSU Rakhilina E.V. IRL RAS Reznikova T.I. VINITI RAS Ryzhova D.A. MSU Tagabileva M.G. MSU
THE DATABASE ON RUSSIAN POLYSEMOUS ADJECTIVES AND ADVERBS
The paper reports on a research project aimed at semantic analysis and formalization of polysemy patterns in Russian adjectives and adverbs. The research is based on the data from the Russian National Corpus. For each polysemous adjective or adverb we describe a set of its possible senses, assign each of the senses to a corresponding taxonomic class, identify types of semantic shifts between individual senses, and offer context conditions of these shifts. The results gained from this analysis are implemented in a database, which allows for various generalizations on the regularities of change in adjective meaning.
Kachinska I. Moscow Lomonosov State University, Philological faculty; Krylov S. Institute of Oriental Studies of the Russian Academy of Sciences, Moscow
DIALECT LEXICOGRAPHY: AN ELECTRONIC CATALOGUE OF THE ARKHANGELSK REGION DIALECT DICTIONARY
The Arkhangelsk Region Dialect Dictionary (ARDD) is the largest dialect dictionary of one region, its corpus totals apprx. 180 thousand words. An important problem for the Electronic Corpus is the transition from the phonetic transcription to the grammatical representation of the word form and further on to lemmatization with the main vocabulary form. The results of automatic processing are illustrated ion by translation of the text of the expedition notebook in the rtf format into the database format (dbf), alphabetically arranged word forms and hypothetical lemmas created by the automatic analyzer in StarLing environment.
Khomitsevich O. Solomennik M. Speech Technology Center, Saint Petersburg
AUTOMATIC PAUSE PLACEMENT IN A RUSSIAN TEXT-TO-SPEECH SYSTEM
The paper describes an algorithm for automatic pause placement used in the Russian Text-to-Speech system "VitalVoice", which was developed by Speech Technology Center (STC). The algorithm uses POS sequence detection to find appropriate places for breaks in a sentence. We show the system performance with texts of different types and compare it to the baseline TTS system developed earlier.
Khoroshevsky V.F. Dorodnicyn Computing Centre of RAS, Moscow
INFORMATION EXTRACTION AT DIALOGUE CONFERENCES: A NEIGHBOUR’S VIEW
The paper presents the retrospective analysis of publications in information extraction domain published in Proceedings of International conferences “Computational Linguistics and Intellectual Technologies” (Dialogue) in 2000-2009. Statistical methods, ontology engineering, and semantic clustering/classification methods are used to outline the Dialogue conference “world picture”. Significant semantic spaces and “shadow groups” of authors presented in these Proceedings are extracted with the usage of information extraction system OntosMiner/SG developed within Ontos project powered by the Russian IT company, Avicomp Services.
Andrej A. Kibrik Institute of Linguistics RAN Grigoriy B. Dobrov Lomonosov Moscow State University Dmitriy A. Zalmanov Anastasia S. Linnik Moscow Lomonosov State University Natalia V. Loukachevitch Moscow Lomonosov State University
REFERENTIAL CHOICE AS A MULTI-FACTOR PROBABILISTIC PROCESS
One of the major processes involved in discourse production is referential choice, i.e. the choice of a linguistic expression when mentioning a person or an object. Referential choice depends on a great number of discourse factors. A model is proposed that is based on the methods of machine learning and describes referential choice in an annotated corpus of English texts.
Klyshinsky E.S. Kochetkova N.A. Litvinov M.I. Moscow State Institute for Electronics and Mathematics Maximov V.Yu. Keldysh IAM RAS
AUTOMATIC CONSTRUCTION OF WORD COMBINATION DATABASE USING A HUGE TEXT CORPUS
Automatic compilation of a word combination database (verb or verbal adverb + noun, adjective + noun, participle + noun) is discussed. The database is extracted with the help of a very big annotated corpus counting more than a billion words.
Kobzareva T.J. Russian State University for Humanities
SEARCH OF THE SYNTACTIC HEAD FOR A PREPOSITIONAL PHRASE IN RUSSIAN
Searching the syntactic head for a prepositional phrase in automatic Russian surface-syntactical analysis of Russian sentences in a variety of text genres as performed in the system of modular analysis of Russian syntax (MARS) is described.
Kobozeva I.M. Marushkina A.S. Lomonosov Moscow State University
AN ONTOLOGY OF FORCE INTERACTIONS
The paper is aimed at showing the principles of building text meaning representation (TMR) based on ontology of force interactions. The theoretical foundation for the ontology is Force Dynamics theory proposed by Leonard Talmy. The ТМR is twofold: it includes both a dynamic scenario of force transitions and a formalized lexical description.
Kozerenko A.D. V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
ALPHA AND OMEGA, FROM A TO Я: FROM THE PRIMARY METAPHOR TO CONTEMPORARY MEANING
The paper offers a semantic analysis of Russian idioms having the first and the last letters of various alphabets in their inner form, such as Russian alpha i omega, ot alphy do omegi, ot A do Я, ot A do Z etc. A semantic connection is traced between the primary metaphor and contemporary meanings of these idioms. Definitions of all meanings of the idioms under consideration are proposed.
Elena B. Kozerenko Igor P. Kuznetsov Institute for Informatics Problems of the Russian Academy of Sciences, Moscow, Russia
EVOLUTION OF LINGUISTIC SEMANTIC PRESENTATIONS IN THE INTELLIGENT SYSTEMS BASED ON THE EXTENDED SEMANTIC NETWORKS
The paper deals with the design and development of syntactic semantic and lexical semantic presentations in linguistic processors of the systems based on the Extended Semantic Networks (ESN) mechanism. The emphasis is laid on the language engineering solutions employed for constructing an integral linguistic model which can be modified depending on the specific task, and which range from the "heavy" form based on the specific deep presentations to the reduced shells focused on a particular subject area and (or) a controlled language. Special attention is given to the techniques of describing the distributional and transformational features of language objects.
Kopotev M. V. University of Helsinki, Finland
DETECTING ERRORS IN A CORPUS USING MTE-ANNOTATION
The paper opens with a short introduction to MTE (MulText-East), a multilingual dataset for language engineering research and development. The main part of the paper describes a frequency-based technique which detects errors in the morphologically annotated corpus HANCO.
Kotov, A. A. Institute of Linguistics, Russian State University for the Humanities, Moscow, Russia
CONTINUOUS SIMULATION OF EMOTIONAL COMMUNICATIVE BEHAVIOR BY A COMPUTER AGENT
We represent the architecture of an emotional computer agent which reacts to incoming semantic representations and continuously demonstrates diverse communicative behavior: utterances, gestures and communicative actions – scratching, changes of pose and gaze direction, etc.
Kreydlin G.E. Russian State University for the Humanities
HUMAN BODY IN A DIALOG: SEMIOTIC CONCEPTUALISATION OF THE BODY. I
The paper summarizes the research findings of “Body parts in the Russian language and culture”, the project conducted by a group of researchers from RSUH and some academic institutes. The author introduces an important concept of semiotic conceptualization of the human body, which reflects the ideas of unsophisticated native Russian speakers concerning body and other somatic objects. Comparing the multiplicity of the tools in Russian (and of the ways they are used) with the instruments of the Russian body language, the author considers the semantic structure of the Russian word telo ‘body’ and describes structural, physical and functional properties of various somatic objects reflected in these languages. The investigation of specific features of somatic objects allows accounting for certain aspects of human behavior in communication.
Kreydlin G.E. Pereverzeva S.I. Russian State University for the Humanities
HUMAN BODY IN A DIALOG: SEMIOTIC CONCEPTUALIZATION OF THE BODY. II
The paper sums up the discussion of the main results of “Body and body parts in the Russian language and culture”, the collaborative project conducted by a group of researches from RSUH and some academic institutes. The paper presents two lines of research: (1) classification of features and properties of somatic objects and values thereof and (2) design of an electronic database aimed at integrating and presenting the results of the research.
Krugliakova V.A. Russian State University for Humanities Rakhilina E.V. Institute for Russian Language
VERBS OF ROTATION: LEXICAL TYPOLOGY
The paper presents research into the semantics of verbs of rotation based on the material of 15 languages. A semantic map for this lexical domain is proposed. Strategies of meaning merge and formation of lexical systems of different types are discussed.
Krylov Sergey A. Institute of Oriental Studies of Russian Academy of Sciences, Moscow & Institute of System Analysis of Russian Academy of Sciences, Moscow
WHAT KIND OF ELEMENTS DOES THE METALANGUAGE OF LINGUISTICS CONSIST OF?
Types of metalinguistic databases are discussed. An attempt to classify objects of metalinguistic systematization is made. Ways of arrangement of metalinguistic data are listed. Types of semantic elements in metalinguistic models are investigated in details.
Kuznetsov I.P. Somin N.V. Russian Academy of Sciences, Institute of Informatics Problems
PECULIARITIES OF LEXICAL-GRAMMATICAL ANALYSIS FOR OBJECT EXTRACTION FROM NATURAL LANGUAGE TEXTS
The paper shows how the linguistic processor is used for extracting knowledge (information objects and their links) from natural language texts. A significant part of the processor is the procedure of lexical-grammatical analysis, which has been subject to many modifications in the course of tuning to various subject fields.
Kustova G.I. Moscow State Pedagogical University
ADJECTIVES IN NOMINATIONS OF HUMANS
The paper discusses the mechanisms of incorporation of an adjective into a noun phrase. The adjective can be connected (semantic inclusion, cf. chestnyj ispolnitel’ ‘an honest executor’) or not connected (free inclusion, cf. boltlivaja prodavshchitsa ‘a talkative saleswoman) with the internal predicate of a noun.
L
Lande D.V. Braichevsky S.M. Darmokhval A.T. Zhigalo V.V. Information centre ElVisti, Kiev, Ukraine
DESIGNING A SYSTEM OF MANAGING INFORMATION LINKS BETWEEN MONITORED OBJECTS
An approach to the construction of an innovative fulltext information retrieval system is presented. The basic elements of the system are relations between concepts extracted from text documents rather than individual terms. Basic modules, architectural solutions, and user interface of the user are described.
Levontina I.B. V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
QUOTATION AND RENDERING MARKERS IN RUSSIAN
The category of evidentiality is well known in the grammar of some languages, such as Turkish, Bulgarian, etc. Certain markers of quotation and rendering are also used in Russian. Some of them, like mol and deskat', have been described earlier; some other lexical items, as well as a few prosodic means of presenting somebody else’s speech, are introduced in the paper.
Alexander Letuchiy Russian Language Institute of Russian Academy of Sciences
OBJECT OMISSION AND SIMILAR PROCESSES IN ARABIC IN COMPARISON WITH RUSSIAN (BASED ON CORPUS DATA)
We analyze object omission and similar processes (lability, variation of form of the object) in Arabic as compared to Russian. The corpus data are insturmental in revealing some tendencies not addressed explicitly in dictionaries and grammar descriptions. First of all, we see that in Arabic, in contrast to Russian, omission of definite object is rare. Second, sometimes the form of the object depends on the grammatical form of the verb. Finally, the precise class of verbs admitting object omission varies between the two languages. These distinctions point to a more general difference. In Arabic, object omission is primarily related to semantic characteristics of the verb and the object itself, whereas in Russian, pragmatic properties are more relevant.
Boris M. Lobanov United Institute of Informatics Problems, National Academy of Sciences of Belarus
THE PUNCTUATION STRUCTURE OF FICTION AND ITS ROLE IN THE SYNTHESIS OF EXPRESSIVE SPEECH FROM TEXT
Statistical data on the frequency of various punctuation marks and on the distribution of simple and complex sentences in fiction of various genres are presented. These data helped reveal the most frequent punctuation marks and the conditions of their use, define the frequency and quantitative structure of simple and complex sentences and formulate rules of how to use the information on punctuation marks for prosodic enrichment of the speech generated from text.
Lukashevich Natalia, Research Computing Center of Moscow State University
NEAR-SYNONYMS IN LINGUISTIC ONTOLOGIES
The problem of representation of near-synonyms in linguistic ontologies is discussed. We argue that it is important to introduce distinguishable concepts having their unique features into linguistic ontologies. We use really existing multiword expressions for better representation of closely related senses of near-synonyms with distinguishable concepts.
Lyudovyk Т.V. International Research/Training Center for Information Technologies and Systems, Kyiv, Ukraine
AN ANALYSIS OF SMS TEXTS AIMED AT INPROVING THEIR AUTOMATIC READING
The paper deals with SMS text properties that hamper high-quality automatic reading of them. The main problems are related to incorrect identification of the language of the message, as well as difficulties in recognizing surzhyk, slang expressions, non-standard transliteration and spelling used by SMS senders.
Astaf'eva I., Bonch-Osmolovskaya A., Garejshina A., Grishina Ju., D'jachkov V., Ionov M., Koroleva A., Kudrinsky M., Lityagina A., Luchina E., Sidorova E., Toldova S., Moscow State University, Lyashevskaya O., Savchuk S., Institute of Russian Language RAS, Koval' S.
NLP EVALUATION: RUSSIAN MORPHOLOGICAL PARSERS
NLP Evaluation forum (http://ru-eval.ru) is a new initiative aimed at independent assessing the methods that are used in Russian-oriented linguistic resources. The paper describes the first contest of morphological parsers, its participants, data and test collections and reports the design and results of evaluation as well as problematic cases.
Lyashevskaya O.N. University of Tromsø
GENITIVE AND INSTRUMENTAL CONSTRUCTIONS OF SHAPE: SIMILARITY AND SPECIFICITY
Two Russian constructions are studied that is used to characterize the shape of one object through the shape of another: Genitive construction (issohshie pleti ruk) and Instrumental construction (ruki povisli plet’mi). The constructions differ in their syntactic properties and the lexical filling of their nominal slots. On this basis, we build the hypothesis about the semantic differences in the profiling of the spatial situation.
M
Markasova E.V. Vorobieva S.A. Saint Petersburg University
THE RUSSIAN WORD KONECHNO ‘OF COURSE’ IN EVERYDAY COMMUNICATION (ACCORDING TO THE ‘ONE SPEECH DAY’ ORAL SPEECH CORPUS DATA
In everyday communication, the word konechno ‘of course’ has a variety of functions. The data of the ‘One Speech Day’ corpus were used to study the distribution of uses of this word in real speech. Individual speaker-dependent features of this word are discussed. Emphasis is laid on the identification of te speaker's communicative intentions. A perception experiment was carried out where participants were asked to characterize the word taken out from speech flow.
Meelis Mihkla Indrek Hein Mari-Liis Kalvik Indrek Kiissel Institute of the Estonian Language
SPEECH RATE PERCEPTION AND SOME FINDINGS OF MODELLING SPEECH RHYTHMICITY IN ESTONIAN
The perception of different speech rates in the blind and the sighted is investigated. The aim is to find out what the preferable rates are for synthetic and for natural speech, and to which degree it could be compressed. According to the results, the blind, who are everyday users of screen readers and listeners of synthetic speech, prefer a considerably higher speech rate than the sighted. The knowledge of speech rhythmicity helps to model the temporal structure of synthetic speech, which, in turn, improves the naturalness of output speech. Our study of the parameters of the three degrees of phonetic quantity which form the stress structure observed in standard Estonian reveals that in the recognition of quantity opposition the durational ratio of the vowels of stressed and unstressed syllables is a more adequate distinctive feature than the durational ratios of adjacent phones.
Mikheev M.Ju. Moscow Lomonosov State University
COMPILATION OR … LANGUAGE CLICHES? COMPARING SETS OF WORD COLLOCATIONS CHARACTERISTIC FOR AUTHORS’ STYLES
The article establishes a way to determine what makes two authors comparable based on the correspondence of word-combinations in their respective literary idiolects. This comparison of the degree to which their word combinations correspond allows to see, precisely how far or near they are to each other stylistically – ranging from identical (attribution) to completely unrelated (therefore perhaps belonging to unrelated realms or independent of each other or even unknown to each other personally or culturally). With any degree of correspondence between the two sets of ready-made word combination, there is then a way to determine the degree to which a later author might have been influenced by the former. The word combinations compared are highly marked (as opposed to the way Geir Kietsaa uses the same approach). For example, "the work of hell", which is not a sociolect cliche in Russian, in contrast to the English, fairly common, "a job from hell", may allow to establish a fairly marked connection between Myakovsky and Platonov’s Chevengur.
N
Nedoluzhko А. Charles University, Prague, Czech Republic
COREFERENTIAL RELATIONS IN THE TEXT – A COMPARATIVE ANALYSIS OF ANNOTATED DATA
The research is based on a developing project that aims to annotate nominal coreference and bridging anaphora in the syntactically annotated corpus of Czech texts, PDT 2.0. In the process of annotating coreferential and bridging relations it became evident that the relatively low inter-annotator agreement is, to a large extent, due to the fact that a text may have a variety of legitimate objective interpretations, rather than the annotators’ mistakes or carelessness. A classification of discrepancy types and possible causes of emergence thereof is presented. Typical examples of multiple interpretations of coreferencial and bridging relations in a text are given.
O
Okatiev V.V. Erekhinskaya T.N. Ratanova T.E. DICTUM Ltd., Nizhny Novgorod, Russia
SECRET PUNCTUATION MARKS
The paper describes functional ambiguity of punctuation marks in the Russian language. A formal model of isolations and series of coordination members is presented. Mathematical target setting for punctuation use in syntax parsing and the algorithm for this task are suggested.
Ostapova I.V. Shyrokov V. A. Ukrainian Lingua-Information Fund, NAS of Ukraine, Kiev, Ukraine
A VIRTUAL LEXICOGRAPHIC LABORATORY FOR EXPLANATORY DICTIONARIES
Principles of constructing lexicographic systems in digital environment are discussed. The explanatory Ukrainian Language Dictionary is considered as an example of an integrated lexicographic system that makes use of a number of lexicographic solutions. A computer network environment that supports the explanatory dictionary structure is described in the form of a virtual lexicographic laboratory. This laboratory ensures coordinated work of the geographically distributed team of linguists that work on a large-scale lexicographic project.
P
Pavlov A.S. Lomonosov Moscow State University Dobrov B.V. Research Computer Center of M.V. Lomonosov Moscow State University
A METHOD OF DETECTING MASS GENERATED UNNATURAL TEXTS
We explore a fast method of detecting automatically generated texts. The method uses multiple statistical features to distinguish mass-generated web spam from normal texts. We demonstrate the capability of the method to detect documents created by Markov chains text generators in Russian and English. We also compare the importance of different statistical features depending on the language of the document.
Paducheva E.V. Russian institute of scientific and technical information
TOWARDS INTERPRETATON OF TENSE-ASPECT FORM IN THE NARRATIVE REGISTER: PRAESENS HISTORICUM
Praesens historicum is treated as present narrative, i.e. as the present tense of the narrative register. It is demonstrated that praesens historicum is, substantially, a relative use of the grammeme of tense, when the form of the present expresses simultaneity with the moment singled out in the context, rather than with the moment of speech.
Max Petrenko RiverGlass, Inc.
LEXICON MANAGEMENT IN ONTOLOGICAL SEMANTICS
The paper describes how the lexicon – a static knowledge resource – is managed by a human acquirer. The study draws on the methodology, theory and strategy of lexical acquisition outlined in [NR 2004] and takes into account the ongoing implementation experience in various applications, as well as recent revisions/improvements. After a brief outline of the lexicon, the general strategy of lexical acquisition will be introduced, and techniques of acquisition described. An example will then illustrate how complex cases are handled through lexical acquisition within the framework of Ontological Semantic Technology (OST).
Potemkin S.B. Philological faculty, Moscow State University
CONCEPT LATTICE IMPLEMENTATION IN SEMANTIC STRUCTURING OF ADJECTIVES
Methods of the formal concept analysis (FCA) in application to construction of ontological relations in the class of Russian adjectives using computer thesaurus WordNet are considered. The approach is illustrated by the adjectives characterizing human appearance, whose semantic paradigm is analyzed. The structure of hierarchical relations existing in these adjectives is revealed on the basis of formal context constructed with the help of a bilingual dictionary.
Prodan A.I. Talanov A.O. Chistikov P.G. Speech Technology Center, St.Petersburg, Russia
A VOICE BUILDING SYSTEM FOR THE HYBRID VITALVOICE RUSSIAN TTS SYSTEM
The paper presents a new voice building system for the hybrid Russian TTS system “VitalVoice” developed at Speech Technology Center. VitalVoice is a Unit Selection TTS system complemented with triphone inventory. The paper describes all steps of Unit Selection database building, including text selection, automatic segmentation, tuning voice dependent parameters, etc.
R
Raskin Victor Hempelmann Christian F. Taylor Julia M. Purdue University & RiverGlass Inc, USA
GUESSING VS. KNOWING: THE TWO APPROACHES TO SEMANTICS IN NATURAL LANGUAGE PROCESSING
The paper deals with the distinction between the statistical/machine-learning approach to meaning in NLP and the one based on direct and comprehensive meaning-access. The crucial difference is in the acquisition of semantic resources, which the first approach declares unimplementable and the second implements. The theoretical roots of the opposition are followed back to the representative vs. non-representative traditions in theoretical linguistics, philosophy of science and language, and AI, starting from Peirce and the structuralists, most notably Hjelmslev's opposition of the planes of expression and content, and the commutation between them. After a brief review of each approach, the difference in evaluation techniques is discussed. The authors weigh in heavily on the side of the meaning-based approach and develop and improve one such approach, the Ontological Semantics Technology.
Alexandr Rosen Charles University, Prague
HARMONIZING TAGSETS FOR MULTILINGUAL CORPORA VIA CONCEPT LATTICE
Comparison of the existing morphosyntactic tagsets often reveals different assumptions, obscuring similarities and distinctions across languages. To overcome the formal and conceptual mismatches, we build an abstract interlingual tagset as a hierarchy of categories, using Formal Concept Analysis.
R.I. Rozina
Theory and reality: nominalization of verbs in spoken language
The article addresses the problem of the inheritance of a verb’s meaning and stylistic colouring by a verbal noun. The focus is on verbs labeled “colloquial” and “slang” in dictionaries of modern Russian, and the notion of the degree of colloquiality is introduced. It is argued that stylistic colouring of verbal nouns is the result of the interaction of several factors, such as the degree of a verb’s colloquiality, the nature of the word-forming base and of the suffix. Polysemy of verbal nouns and the difference between semantic derivation of their slang and standard meanings are considered. All observations are based on the data from the Russian National Corpus.
Romanov Aleksandr S. Meshcheriakov Roman V. Tomsk State University of Control System and Radioelectronics
IDENTIFICATION OF AUTHORSHIP OF SHORT TEXTS WITH MACHINE LEARNING TECHNIQUES
The problem of identifying authors of short texts is studied. The creation of the author’s model and the text processing algorithm are described. The experiments with short text authorship identification based on two types of technique, artificial neural networks and support vector machines, are described.
Rubashkin V.Sh. Bocharov V.V. Pivovarova L.M. Chuprin B.Ju. St. Petersburg State University
AN APPROACH TO ONTOLOGY LEARNING FROM MACHINE-READABLE DICTIONARIES
The paper presents the current state of ontology learning technology developed by the authors. The technology is based on linguistic and semantic analysis of definitions in Russian encyclopedic and explanatory dictionaries.
S
Savchuk S.O. V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
A CORPUS-BASED STUDY OF MORPHOLOGICAL VARIABILITY: VARIANTS OF GENITIVE PLURAL MASCULINE IN RUSSIAN
The paper presents the results of corpus-based study on variation of genitive plural forms of masculine nouns in Russian. These results are given in comparison with the data derived from sociolinguistic research conducted in 1960s. The main trends in the correlation between variants within the studied group are specified.
Salomatina N.V. Gusev V.D. Sobolev Institute of Mathematics, RAS Iljina L.Yu. Kuzmin A.O. Parmon V.N., Boreskov Institute of Catalysis SB RAS
POSSIBILITIES OF AUTOMATIC RELATIONSHIP IDENTIFICATION AMONG SUBJECT-DOMAIN TERMS (CATALYSIS EXAMPLE)
Within the problem of automatic subject-domain thesaurus construction based on text collection, three approaches for identification of term relations are considered: 1) clustering profile construction of the most significant elements of the texts, 2) formation of specific templates (patterns with variables), and 3) use of cue-words of the relations.
Serdobolskaya N.V. Russian State University for the Humanities Zimmerling A.V. Moscow State Pedagogic University Arkadiev P.M. Institute of Slavonic Studies of Russian Academy of Sciences
A PROJECT OF THE TYPOLOGICAL DATABASE ON WORD ORDER AND SYNTACTIC CONSTRAINTS
The paper presents a pilot project of a typological database on word order and syntactic constraints. The goal of the database is to summarize and record the knowledge acquired by different researchers who have worked on word order and syntactic constraints in languages with various genetic origin and areal distribution.
Sokirko Alexey Yandex
Bystroslovar’: morphological prediction of new Russian words using very large corpora
The paper reports the results of using very large corpora to model morphological guessing (morphological features and inflexion model). We use 4 billion Russian web pages and 400 million search queries to build prediction factors that contribute greatly to the process of machine learning. Comparing to a system with fewer prediction factors, our system achieves higher results.
Sokolova E.G. Russian State University fro Humanities, Moscow
CORPUS-BASED EVIDENCE FOR LEXICO-SEMANTIC RELATIONS FOR 6 RUSSIAN WORDS DESIGNING PERMANENT STRUCTURES
Usually lexico-semantic relations of words – synonymy, hyponymy - are examined by diagnostic contexts or some other methods where researcher’s introspection or psycholinguistic experiment is involved. The main idea of this paper is that syntactic constructions must exist where these relations would be manifested in texts. Syntactic constructions including two semantically homogeneous concepts presented by six Russian words (approximately: house, building, construction, erection, pavilion) are took out from the Russian National Corpus and classified. Some syntactic constructions supposed to manifest quasi-synonymy and hyponymy relations are found and discussed.
A.S. Starostin N.V.Arefyev M.G. Malkovsky Lomonosov Moscow State University
«TREEVIAL» SYNTAX PARSER. PARADIGM OF THE DYNAMIC HYPOTHESIS RANKING
This article is devoted to the principle of dynamic ranking of hypotheses on which the “Treevial” parser is based. The formal apparatus used in Treevial to describe grammar is discussed. A separate section is devoted to the mechanism of fines, which operates in conjunction with the basic formalism. The scheme of the analyzer is set forth, and the advantages and disadvantages of the proposed approach are described. The last section contains a description of the tools which the analyzer uses to impose fines.
Sharapov R.V. Sharapova E.V. Murom Institute of Vladimir State University
RESEARCH OF WEB SPAM PLACED BY LINK BROKERS
The article characteristics of link spam placed by link brokers are considered. We investigate the lifetime and rotation of links, analyze the thematic proximity links and pages.
Shemanaeva O.Yu. The Institute for information transmission problems of the Russian Academy of Sciences
THE PECULIARITIES OF THE RUSSIAN IDIOMATIC CONSTRUCTION “PRAZDNIK NE V PRAZDNIK”
The peculiarity of Russian construction prazdnik ne v prazdnik is shown. The syntax and semantics of this construction is analyzed. The meaning of this construction is controversial, it may be evaluated both positive and negative.
Alexei Shmelev Moscow Pedagogical State University; Institute of Russian Language, Russian Academy of Sciences
ASPECTUAL CORRELATION IN AN EXPLANATORY DICTIONARY
The paper gives a lexicographic account of aspectual correlates and the related question of inflectional vs. derivational nature of the Russian verbal aspect. In addition, it discusses a general functional mechanism that forces the speaker to replace any perfective verb with an imperfective one when perfective verbs are not allowed.
Shmeleva E.Y. Shmelev A.D. Institute of Russian Language, Moscow
RUSSIAN JOKES IN WRITTEN FORM
The paper deals with Russian jokes in their written form. It makes distinction between scripts of oral jokes, rudimentary jokes, and Internet humor.
T
Tagabileva M.G. Berezutskaya Yu. N. Lomonosov Moscow State University
Word-formation annotation of the Russian National Corpus: aims and methods
The paper presents a project aimed at creating a full-fledged word-formation annotation of the Russian National Corpus (http://ruscorpora.ru). The first phase of the work is compiling a word formation database oriented to corpus annotation. Important theoretical problems concerning different approaches to the word formation in Russian are discussed. Possible approaches to the automation of the annotation process are outlined.
Aleksandra V. Ter-Avanesova Institute of Russian Language of Russian Academy of Sciences, Moscow Sergej A. Krylov Institute of Oriental Studies of Russian Academy of Sciences, Moscow & Institute of System Analysis of Russian Academy of Sciences, Moscow
Lexico-grammatical databases and comparative study of Russian dialectal accentuation systems
Data base of Russian dialects whose phonological systems are characterized by distinction of two o-phonemes has been enriched by material of several dialects. Thus comparative analysis of accentuation systems of dialects with the help of data base became possible. This paper deals with the comparative study of accentuation of masculine nouns and the correlation between accentuation (synchronic and Common Slavic accentual paradigms of words) and thimber of o in their roots.
U
E.V.Uryson V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
COMPOUND CONJUNCTION VS WORD COMBINATION (SEMANTICS OF DAZHE ESLI ‘IF EVER’)
The Russian equivalent of IF (ESLI) and the equivalent of IF EVER (DAZHE ESLI) are analyzed. In grammars and typology, IF EVER (DAZHE ESLI) is regarded as a concessive-conditional conjunction. I demonstrate that the meaning of this lexical unit is the sum of the meanings of its constituents: IF EVER = ‘if’ + ‘ever’. The point is that the conjunction IF (ESLI) in combination with EVER (DAZHE) is represented in its secondary modification, which can be realized only in a limited range of contexts.
V
Vitaly Vasilyev LAN-PROJECT
CLASSIFIER TRAINING BY PASSAGE RECOGNITION
A new method for training classifiers is described, which is based on using results of passage extraction and classification for an iteratively retraining classification model. Also, we discuss approaches to passage extraction and propose some new methods. Experimental evaluation of our new approach shows significant improvements in the quality of text classification on the standard English and Russian test collections.
Maria Voznesenskaya V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences
A TOPOLOGICAL ASPECT OF POLYSEMY IN SEMANTICS OF IDIOMS
Hierarchical structure of polysemy in semantics of idioms is considered. Three topological types of polysemy are discussed: radial, chain-like and combined (radial-chain). It is shown that polysemy in idioms has specific features as compared to that of ordinary words.
Voskresenskiy A.L. independent researcher, Ilyin S.N. Academy of Phantasy Milos Zelezny University of West Bohemia, Plzeň, Czech Republic
ABOUT RECOGNITION OF SIGN LANGUAGE GESTURES
Problems of perception and recognition of gestures of Russian sign language in system of the automated sign language translation are discussed. The new approach to morphology of gestures and a method of separate gestures allocation for sign statements are offered. The working definition for "text understandig" is offered.
W
Yorick Wilks, University of Oxford
Is a Companion a distinctive kind of relationship with a machine?
I start from the perspective of the EC COMPANIONS project, and set out its aim to model a new kind of human-computer relationship based on long-term interaction, with some tasks involved although the Companion is not inherently task-based, since there need be no stopping point to its conversation. Some demonstration of its functionality will be given but the main purpose here is an analysis of what it is people might want from such a relationship and what evidence we have for whatever we conclude. Is politeness important? Is an attempt at emotional sympathy important or achievable? Does a user want a consistent personality in a Companion or a variety of personalities? Should we be talking more in terms of a "cognitive prosthesis (or orthosis)?" ---something to extract, organize, and locate the user's knowledge or personal information---rather than attitudes?
Y
Tanya Yanko Institute for linguistics, Russian Academy of sciences
PROSODY ОF SENTENCES WITH NO ILLOCUTIONARY FORCE
In oral speech with no illocutionary distinctions marked prosodically, the prosody serves as a means of segmenting speech into phonetic groups, lines, or texts. In Orthodox liturgical reading, the prosody marks the beginning of a line and the end of a prayer, in Muslim prayers it marks the beginning and the end of a line, and in Joseph Brodsky’s verse-reading, the prosody marks the beginning of a phonetic word and the end of a text.
Z
Zalizniak Anna A. Institute of linguistics, Russian Academy of Sciences Irina Mikaelian The Pennsylvania State University, USA
ASPECTUAL TRIPLETS IN CONTEMPORARY RUSSIAN ASPECTUAL SYSTEM
Far from being marginal, as they are often considered, aspectual triplets represent a highly productive phenomenon in Russian. Indeed, they are generated by the same functional mechanism that generates aspectual correlates for virtually any perfective verb. This does not contradict the fact that the Russian aspectual system is organized as a binary correlation and aspectual pair remains its basic unit.
V. Zakharov M. Khokhlova St.Petersburg State University, Institute for Linguistic Studies, St.Petersburg, Russia
A STUDY OF EFFECTIVENESS OF STATISTICAL MEASURES FOR COLLOCATION EXTRACTION ON RUSSIAN TEXTS
The paper deals with the results of a corpus-based study of collocational behaviour of the most frequent Russian verbs. Association measures (MI, t-score, and log-likelihood) are suggested as instruments to extract collocations. The paper discusses the results obtained and the applicability of the statistical measures. Future prospects of development are outlined.
Anton Zimmerling Moscow State Pedagogical University
NOMINAL PREDICATIVES AND SYNTACTIC STRUCTURES WITH A DATIVE MARKING ON THE SEMANTIC SUBJECT IN THE EUROPEAN LANGUAGES
The paper discusses the correlation of a class of nominal word forms with a characteristic semantics of Stage level predicates (= lexical statives, Predicatives) and syntactic structures with a dative marking on the semantic subject (Dative structures). Most European languages which have copular Dative structures have a lexical class of nominal Predicatives. Some languages with a class of nominal Predicatives use them outside Dative structures. The typology of Predicatives is determined by the opposition of the three classes of adjectival stems: from the stems of one class only designations of properties are derived, from the stems of the second class only Stage-level predicates are derived, the stems of the third class are ambivalent. The number of adjectival predicatives and their lexical meanings depend on the proportion of these three classes in a given language.