The paper discusses certain features of mobile phone conversations. The research is based on the materials presented in a corpus of mobile phone conversations. Typical words and remarks used in the opening, body and closing of a conversation are considered.
The present-day descriptions of the nominal government depart from the hypothesis that case patterns are inherited within a word-formation cluster (cf. udarit’.V molotkom.NP.INS ‘to hammer’ – udar.N molotkom.NP.INS ‘hammer blow’, udarit’.V po golove.NP.DAT ‘to hit to the head’ – udar.N po.PREP golove.NP.DAT ‘blow to the head’). The article studies the emergency of new patterns associated with a substantive or an adjective which are not attested with their motivating correlate, cf. zaslon.N pornografii.NP.DAT ‘barrier to pornography’, komitet, analogichnyj.A nobelevskomu.NP.DAT ‘institution similar to the Nobel Committee’ etc., and the same patterns in the light verb construction, cf. postavit’.V zaslon.N pornografii.NP.DAT ‘to put the barrier to pornography’. We consider two hypotheses: that the noun has its own case pattern (Grimshaw 1990) and that the pattern is imposed by the construction “light verb + noun”.
The paper considers factors that influence the ability of predicates to fall into the scope of negation. Some of the factors instrumental in the way a semantic element interacts with negation are as follows: its semantic weight and its depth in the semantic structure, which determine its place in the logical structure of the meaning. Semantic weight and depth of an element are influenced by the degree of its semantic detalization and its status in the linguistic worldview. Semantic characteristics of predicates which trigger their ability to interact with negation also determine their government patterns, some of which only allow wide or narrow scope of negation. Sometimes synonyms demonstrate different properties with respect to their interaction with negation ("povezti" 'to get lucky' VS. "poschastlivit'sia" 'to hit the jackpot'). Linguistic differences of these synonyms are determined by the fact that the former is an interpretative verb, and the latter an evaluative verb.
Two new parameters of idiomaticity are discussed: tautology in the inner form of an idiom and onymization. Tautology is a phenomenon of repetition of the meaning of one component of an idiom (or a part of this component) in the inner form of this idiom. Onymization is a process when properties of a common noun are transfered to the proper noun, which is accompanied with structural modifying of common noun. Some evidences allow considering tautology and onymization as different factors of idiomaticity in the field of phraseology.
We consider two basic approaches to the ideographic description of phraseology: inductive and deductive. We argue in favour of the inductive approach, exemplifying our ideas with data from the “Thesaurus of Current Russian idioms”. Attempts to describe the semantic fields of idioms applying the deductive approach turn out to be ineffectiive because a hierarchical logical scheme does not fit in with phraseological data. Many taxa which have to be postulated from a purely logical point of view would remain empty because idioms do not cover the semantic net of language as a whole and are concentrated in certain semantic domains while ignoring many others. Although the inductive approach does not allow for a consistent hierarchical description, the use of paradigmatic (cross-)references, as well as the division of every taxon into core and periphery, reflects, at least partly, relevant semantic relations within the phraseological system.
The paper deals with Anglo-French ethnic stereotypes found in the publications in the British and French press. Stereotypes are viewed as instances of linguacultural schemas. Various ways of introducing and interpreting information in the media discourse are the subject of the linguistic analysis.
New receipts are presented in the balanced part of the Sound corpus of Russian language – an array of life monologues sharing the same linguistic, sociolinguistic and psycholinguistic criteria. New blocks of this corpus and the first results of analysis are described.
The paper presents new receipts in the ORD speech corpus of Russian everyday communication. Certain research results of the ORD material are described.
The valency structure of a number of lexicographic types of Russian predicate nouns is considered. The defining property of these nouns is the fact that one of the actants of any such noun can be referred to as the noun itself. For example, in the sentence The train covers a distance of 650 kilometers in 4 hours, the phrase 650 kilometers instantiates a valency slot of the word distance; at the same time, 650 kilometers is the distance.
Based on a vast set of coordinated noun pairs taken from the user queries databases of Google and Yandex search engines, we have built an associative network of concepts that frequently occur in Russian queries to Internet. It is shown that the pairs in the queries also encounter in Internet texts with significant frequencies. The associative network obtained is investigated in brief.
In this feasibility study we aim at contributing at the practical use of domain ontologies for hypertext classification by introducing an algorithm generating potential keywords for specialist hypertext nodes. The algorithm uses structural markup information and lemmatized word lists from the Grammis/ProGr@mm web information system on German grammar, as well as a terminological ontology covering the domain of linguistics. We present the calculation and ranking of keyword candidates based on ontology relationships, word position, frequency information, and statistical significance as evidenced by log-likelihood tests. Finally, the results of our machine-driven classification are validated empirically against manually assigned keywords.
In order to effectively extract opinions, knowledge of a large amount of domain-specific opinion words and expressions is required. We present a new method of automatic opinion words extraction. The method combines computational characteristics of the word usage in several domain-specific text collections.
The paper describes an algorithm for predicting flexion models of new words occurring in a text corpus. The algorithm makes use of a number of probabilistic models for word flexion and a machine learning based method for selection and ranking. Statistical properties of the corpora and training results are analyzed.
The creation and linguistic annotation of the sound lexico-grammatical database on the Itelmen language.is described. The main purpose and structure of the database are presented. Search queries to the database are illustrated, The linguistic annotation enables to search for lexemes and word forms using such parameters as alphabetical order, part of speech classification, some grammatical and phonetic characteristics.
Russian State University for Humanities, Moscow An iterative version of part-of-speech disambiguation algorithms for Russian text is validated, The respective expansion of the instrumental environment for experimentation with linguistic algorithms (ESLA) is described.
The paper describes lexicosyntactic patterns that specify term occurrences within scientific and technical Russian texts. An experimental study of automatic term recognition procedures based on the patterns is discussed.
Psycholinguistic experiments were carried out in order to show that the rhetorical distance is an important activation factor which strongly influences the referential choice. The results revealed the RhD effect (i.e. increasing RhD reduces the subject’s ability to identify the antecedent of an anaphoric pronoun) and the working memory effect (i.e. the higher is the subject’s WM span, the more successful he is in answering questions).
The development of ontologies taking as input plain text documents is very important because manual construction of ontologies is labor-extensive and expensive, while a lot of textual information is already available in electronic format. The paper presents an improved method for automatic detection of terms in plain text documents and their classification using a variety of statistical methods, including term extraction using the Log-Likelihood algorithm, the cosine similarity measure and clustering via K-Means algorithm. Various experiments were conducted for Spanish using a corpus of domain of informatics.
This paper presents a semantic analysis of noun reduplication in colloquial Russian. We consider syntactic reduplication, i.e., repetition of a word within the same prosodic unit as in "takaja devochka-devochka" ('such a girl-girl'), "prjam leto-leto" ('a real summer-summer'). Drawing on a corpus of examples gathered from natural speech and the Internet texts we categorize the semantics of noun reduplication into seven types: (1) strict, literal meaning (2) prototype, (3) connotation, (4) intensified meaning, (5) appreciation, (6) specifying the meaning of polysemous words, (7) referent determiner.
The data of the associative dictionaries are used to reveal the complex of parameters with which the picture of the world in the human mind could be built.
The paper analyzes the usage of the vocal gesture Ah according to the data of the Multimodal Russian Corpus (MURCO). The investigation is based on the analysis of the body and face movements, which accompany this vocal gesture in the process of oral speech. As a result, four meanings have been isolated: 1) Ah as an interrogative particle, 2) Ah as a negative particle, 3) Ah as an interjection, and 4) Ah as a physiological exclamation. The vocal gestures Ah and Oh are compared.
The paper examines constraints on anaphoric relation between a quantifier expression or a quantified noun phrase and a 3rd person pronoun on ‘he’ in Russian.
Russian and German vocabulary of food (over 250 words in each language) are used to study persistent associations and discuss the problems of their lexicographic description within the connotation theory. Certain new lexicographic solutions are proposed.
The paper provides a comparative analysis of machine learning dependency tree-based parsing algorithms. We compare two different machine learning approaches to dependency parsing with a rule-based parser of the ETAP-3system. One algorithm views the parsing task as construction of maximum spanning tree, and the other as a path-finding task. The comparison is made over link accuracy instead of tree accuracy, as the analyzed algorithms behave differently on nonprojective links. The difference from the existing implementations lies in the type of the underlying classifier used.
The paper reports on a research project aimed at semantic analysis and formalization of polysemy patterns in Russian adjectives and adverbs. The research is based on the data from the Russian National Corpus. For each polysemous adjective or adverb we describe a set of its possible senses, assign each of the senses to a corresponding taxonomic class, identify types of semantic shifts between individual senses, and offer context conditions of these shifts. The results gained from this analysis are implemented in a database, which allows for various generalizations on the regularities of change in adjective meaning.
The Arkhangelsk Region Dialect Dictionary (ARDD) is the largest dialect dictionary of one region, its corpus totals apprx. 180 thousand words. An important problem for the Electronic Corpus is the transition from the phonetic transcription to the grammatical representation of the word form and further on to lemmatization with the main vocabulary form. The results of automatic processing are illustrated ion by translation of the text of the expedition notebook in the rtf format into the database format (dbf), alphabetically arranged word forms and hypothetical lemmas created by the automatic analyzer in StarLing environment.
The paper describes an algorithm for automatic pause placement used in the Russian Text-to-Speech system "VitalVoice", which was developed by Speech Technology Center (STC). The algorithm uses POS sequence detection to find appropriate places for breaks in a sentence. We show the system performance with texts of different types and compare it to the baseline TTS system developed earlier.
The paper presents the retrospective analysis of publications in information extraction domain published in Proceedings of International conferences “Computational Linguistics and Intellectual Technologies” (Dialogue) in 2000-2009. Statistical methods, ontology engineering, and semantic clustering/classification methods are used to outline the Dialogue conference “world picture”. Significant semantic spaces and “shadow groups” of authors presented in these Proceedings are extracted with the usage of information extraction system OntosMiner/SG developed within Ontos project powered by the Russian IT company, Avicomp Services.
One of the major processes involved in discourse production is referential choice, i.e. the choice of a linguistic expression when mentioning a person or an object. Referential choice depends on a great number of discourse factors. A model is proposed that is based on the methods of machine learning and describes referential choice in an annotated corpus of English texts.
Automatic compilation of a word combination database (verb or verbal adverb + noun, adjective + noun, participle + noun) is discussed. The database is extracted with the help of a very big annotated corpus counting more than a billion words.
Searching the syntactic head for a prepositional phrase in automatic Russian surface-syntactical analysis of Russian sentences in a variety of text genres as performed in the system of modular analysis of Russian syntax (MARS) is described.
The paper is aimed at showing the principles of building text meaning representation (TMR) based on ontology of force interactions. The theoretical foundation for the ontology is Force Dynamics theory proposed by Leonard Talmy. The ТМR is twofold: it includes both a dynamic scenario of force transitions and a formalized lexical description.
The paper offers a semantic analysis of Russian idioms having the first and the last letters of various alphabets in their inner form, such as Russian alpha i omega, ot alphy do omegi, ot A do Я, ot A do Z etc. A semantic connection is traced between the primary metaphor and contemporary meanings of these idioms. Definitions of all meanings of the idioms under consideration are proposed.
The paper deals with the design and development of syntactic semantic and lexical semantic presentations in linguistic processors of the systems based on the Extended Semantic Networks (ESN) mechanism. The emphasis is laid on the language engineering solutions employed for constructing an integral linguistic model which can be modified depending on the specific task, and which range from the "heavy" form based on the specific deep presentations to the reduced shells focused on a particular subject area and (or) a controlled language. Special attention is given to the techniques of describing the distributional and transformational features of language objects.
The paper opens with a short introduction to MTE (MulText-East), a multilingual dataset for language engineering research and development. The main part of the paper describes a frequency-based technique which detects errors in the morphologically annotated corpus HANCO.
We represent the architecture of an emotional computer agent which reacts to incoming semantic representations and continuously demonstrates diverse communicative behavior: utterances, gestures and communicative actions – scratching, changes of pose and gaze direction, etc.
The paper summarizes the research findings of “Body parts in the Russian language and culture”, the project conducted by a group of researchers from RSUH and some academic institutes. The author introduces an important concept of semiotic conceptualization of the human body, which reflects the ideas of unsophisticated native Russian speakers concerning body and other somatic objects. Comparing the multiplicity of the tools in Russian (and of the ways they are used) with the instruments of the Russian body language, the author considers the semantic structure of the Russian word telo ‘body’ and describes structural, physical and functional properties of various somatic objects reflected in these languages. The investigation of specific features of somatic objects allows accounting for certain aspects of human behavior in communication.
The paper sums up the discussion of the main results of “Body and body parts in the Russian language and culture”, the collaborative project conducted by a group of researches from RSUH and some academic institutes. The paper presents two lines of research: (1) classification of features and properties of somatic objects and values thereof and (2) design of an electronic database aimed at integrating and presenting the results of the research.
The paper presents research into the semantics of verbs of rotation based on the material of 15 languages. A semantic map for this lexical domain is proposed. Strategies of meaning merge and formation of lexical systems of different types are discussed.
Types of metalinguistic databases are discussed. An attempt to classify objects of metalinguistic systematization is made. Ways of arrangement of metalinguistic data are listed. Types of semantic elements in metalinguistic models are investigated in details.
The paper shows how the linguistic processor is used for extracting knowledge (information objects and their links) from natural language texts. A significant part of the processor is the procedure of lexical-grammatical analysis, which has been subject to many modifications in the course of tuning to various subject fields.
The paper discusses the mechanisms of incorporation of an adjective into a noun phrase. The adjective can be connected (semantic inclusion, cf. chestnyj ispolnitel’ ‘an honest executor’) or not connected (free inclusion, cf. boltlivaja prodavshchitsa ‘a talkative saleswoman) with the internal predicate of a noun.
An approach to the construction of an innovative fulltext information retrieval system is presented. The basic elements of the system are relations between concepts extracted from text documents rather than individual terms. Basic modules, architectural solutions, and user interface of the user are described.
The influence of the hesitation pause on the understanding of «before» and «after» sentences is investigated. The pause in difficult sentences decreases the rate of mistakes and the time normal speakers need for processing and increases both in the case of simple sentences. No influence on aphasic speakers was found.
The category of evidentiality is well known in the grammar of some languages, such as Turkish, Bulgarian, etc. Certain markers of quotation and rendering are also used in Russian. Some of them, like mol and deskat', have been described earlier; some other lexical items, as well as a few prosodic means of presenting somebody else’s speech, are introduced in the paper.
We analyze object omission and similar processes (lability, variation of form of the object) in Arabic as compared to Russian. The corpus data are insturmental in revealing some tendencies not addressed explicitly in dictionaries and grammar descriptions. First of all, we see that in Arabic, in contrast to Russian, omission of definite object is rare. Second, sometimes the form of the object depends on the grammatical form of the verb. Finally, the precise class of verbs admitting object omission varies between the two languages. These distinctions point to a more general difference. In Arabic, object omission is primarily related to semantic characteristics of the verb and the object itself, whereas in Russian, pragmatic properties are more relevant.
Statistical data on the frequency of various punctuation marks and on the distribution of simple and complex sentences in fiction of various genres are presented. These data helped reveal the most frequent punctuation marks and the conditions of their use, define the frequency and quantitative structure of simple and complex sentences and formulate rules of how to use the information on punctuation marks for prosodic enrichment of the speech generated from text.
The problem of representation of near-synonyms in linguistic ontologies is discussed. We argue that it is important to introduce distinguishable concepts having their unique features into linguistic ontologies. We use really existing multiword expressions for better representation of closely related senses of near-synonyms with distinguishable concepts.
The paper deals with SMS text properties that hamper high-quality automatic reading of them. The main problems are related to incorrect identification of the language of the message, as well as difficulties in recognizing surzhyk, slang expressions, non-standard transliteration and spelling used by SMS senders.
NLP Evaluation forum (http://ru-eval.ru) is a new initiative aimed at independent assessing the methods that are used in Russian-oriented linguistic resources. The paper describes the first contest of morphological parsers, its participants, data and test collections and reports the design and results of evaluation as well as problematic cases.
Two Russian constructions are studied that is used to characterize the shape of one object through the shape of another: Genitive construction (issohshie pleti ruk) and Instrumental construction (ruki povisli plet’mi). The constructions differ in their syntactic properties and the lexical filling of their nominal slots. On this basis, we build the hypothesis about the semantic differences in the profiling of the spatial situation.
In everyday communication, the word konechno ‘of course’ has a variety of functions. The data of the ‘One Speech Day’ corpus were used to study the distribution of uses of this word in real speech. Individual speaker-dependent features of this word are discussed. Emphasis is laid on the identification of te speaker's communicative intentions. A perception experiment was carried out where participants were asked to characterize the word taken out from speech flow.
The perception of different speech rates in the blind and the sighted is investigated. The aim is to find out what the preferable rates are for synthetic and for natural speech, and to which degree it could be compressed. According to the results, the blind, who are everyday users of screen readers and listeners of synthetic speech, prefer a considerably higher speech rate than the sighted. The knowledge of speech rhythmicity helps to model the temporal structure of synthetic speech, which, in turn, improves the naturalness of output speech. Our study of the parameters of the three degrees of phonetic quantity which form the stress structure observed in standard Estonian reveals that in the recognition of quantity opposition the durational ratio of the vowels of stressed and unstressed syllables is a more adequate distinctive feature than the durational ratios of adjacent phones.
The article establishes a way to determine what makes two authors comparable based on the correspondence of word-combinations in their respective literary idiolects. This comparison of the degree to which their word combinations correspond allows to see, precisely how far or near they are to each other stylistically – ranging from identical (attribution) to completely unrelated (therefore perhaps belonging to unrelated realms or independent of each other or even unknown to each other personally or culturally). With any degree of correspondence between the two sets of ready-made word combination, there is then a way to determine the degree to which a later author might have been influenced by the former. The word combinations compared are highly marked (as opposed to the way Geir Kietsaa uses the same approach). For example, "the work of hell", which is not a sociolect cliche in Russian, in contrast to the English, fairly common, "a job from hell", may allow to establish a fairly marked connection between Myakovsky and Platonov’s Chevengur.
The research is based on a developing project that aims to annotate nominal coreference and bridging anaphora in the syntactically annotated corpus of Czech texts, PDT 2.0. In the process of annotating coreferential and bridging relations it became evident that the relatively low inter-annotator agreement is, to a large extent, due to the fact that a text may have a variety of legitimate objective interpretations, rather than the annotators’ mistakes or carelessness. A classification of discrepancy types and possible causes of emergence thereof is presented. Typical examples of multiple interpretations of coreferencial and bridging relations in a text are given.
The paper describes functional ambiguity of punctuation marks in the Russian language. A formal model of isolations and series of coordination members is presented. Mathematical target setting for punctuation use in syntax parsing and the algorithm for this task are suggested.
Principles of constructing lexicographic systems in digital environment are discussed. The explanatory Ukrainian Language Dictionary is considered as an example of an integrated lexicographic system that makes use of a number of lexicographic solutions. A computer network environment that supports the explanatory dictionary structure is described in the form of a virtual lexicographic laboratory. This laboratory ensures coordinated work of the geographically distributed team of linguists that work on a large-scale lexicographic project.
We explore a fast method of detecting automatically generated texts. The method uses multiple statistical features to distinguish mass-generated web spam from normal texts. We demonstrate the capability of the method to detect documents created by Markov chains text generators in Russian and English. We also compare the importance of different statistical features depending on the language of the document.
Praesens historicum is treated as present narrative, i.e. as the present tense of the narrative register. It is demonstrated that praesens historicum is, substantially, a relative use of the grammeme of tense, when the form of the present expresses simultaneity with the moment singled out in the context, rather than with the moment of speech.
The paper describes how the lexicon – a static knowledge resource – is managed by a human acquirer. The study draws on the methodology, theory and strategy of lexical acquisition outlined in [NR 2004] and takes into account the ongoing implementation experience in various applications, as well as recent revisions/improvements. After a brief outline of the lexicon, the general strategy of lexical acquisition will be introduced, and techniques of acquisition described. An example will then illustrate how complex cases are handled through lexical acquisition within the framework of Ontological Semantic Technology (OST).
Basing on first-hand corpus data, the paper investigates self-repairs in Japanese spontaneous narrative discourse and compares types of self-repairs and their frequency in Japanese and Russian.
Methods of the formal concept analysis (FCA) in application to construction of ontological relations in the class of Russian adjectives using computer thesaurus WordNet are considered. The approach is illustrated by the adjectives characterizing human appearance, whose semantic paradigm is analyzed. The structure of hierarchical relations existing in these adjectives is revealed on the basis of formal context constructed with the help of a bilingual dictionary.
The paper presents a new voice building system for the hybrid Russian TTS system “VitalVoice” developed at Speech Technology Center. VitalVoice is a Unit Selection TTS system complemented with triphone inventory. The paper describes all steps of Unit Selection database building, including text selection, automatic segmentation, tuning voice dependent parameters, etc.
The paper deals with the distinction between the statistical/machine-learning approach to meaning in NLP and the one based on direct and comprehensive meaning-access. The crucial difference is in the acquisition of semantic resources, which the first approach declares unimplementable and the second implements. The theoretical roots of the opposition are followed back to the representative vs. non-representative traditions in theoretical linguistics, philosophy of science and language, and AI, starting from Peirce and the structuralists, most notably Hjelmslev's opposition of the planes of expression and content, and the commutation between them. After a brief review of each approach, the difference in evaluation techniques is discussed. The authors weigh in heavily on the side of the meaning-based approach and develop and improve one such approach, the Ontological Semantics Technology.
Comparison of the existing morphosyntactic tagsets often reveals different assumptions, obscuring similarities and distinctions across languages. To overcome the formal and conceptual mismatches, we build an abstract interlingual tagset as a hierarchy of categories, using Formal Concept Analysis.
The article addresses the problem of the inheritance of a verb’s meaning and stylistic colouring by a verbal noun. The focus is on verbs labeled “colloquial” and “slang” in dictionaries of modern Russian, and the notion of the degree of colloquiality is introduced. It is argued that stylistic colouring of verbal nouns is the result of the interaction of several factors, such as the degree of a verb’s colloquiality, the nature of the word-forming base and of the suffix. Polysemy of verbal nouns and the difference between semantic derivation of their slang and standard meanings are considered. All observations are based on the data from the Russian National Corpus.
The problem of identifying authors of short texts is studied. The creation of the author’s model and the text processing algorithm are described. The experiments with short text authorship identification based on two types of technique, artificial neural networks and support vector machines, are described.
The paper presents the current state of ontology learning technology developed by the authors. The technology is based on linguistic and semantic analysis of definitions in Russian encyclopedic and explanatory dictionaries.
The paper presents the results of corpus-based study on variation of genitive plural forms of masculine nouns in Russian. These results are given in comparison with the data derived from sociolinguistic research conducted in 1960s. The main trends in the correlation between variants within the studied group are specified.
Within the problem of automatic subject-domain thesaurus construction based on text collection, three approaches for identification of term relations are considered: 1) clustering profile construction of the most significant elements of the texts, 2) formation of specific templates (patterns with variables), and 3) use of cue-words of the relations.
The paper presents a pilot project of a typological database on word order and syntactic constraints. The goal of the database is to summarize and record the knowledge acquired by different researchers who have worked on word order and syntactic constraints in languages with various genetic origin and areal distribution.
Rule-based lexical analysis of NL texts within the context of information extraction (named entities, hypertext transitions, terminology) is discussed. The DSTL language, proposed to solve these tasks, has a variety of expressive means and enables very compact descriptions.
The paper reports the results of using very large corpora to model morphological guessing (morphological features and inflexion model). We use 4 billion Russian web pages and 400 million search queries to build prediction factors that contribute greatly to the process of machine learning. Comparing to a system with fewer prediction factors, our system achieves higher results.
Usually lexico-semantic relations of words – synonymy, hyponymy - are examined by diagnostic contexts or some other methods where researcher’s introspection or psycholinguistic experiment is involved. The main idea of this paper is that syntactic constructions must exist where these relations would be manifested in texts. Syntactic constructions including two semantically homogeneous concepts presented by six Russian words (approximately: house, building, construction, erection, pavilion) are took out from the Russian National Corpus and classified. Some syntactic constructions supposed to manifest quasi-synonymy and hyponymy relations are found and discussed.
Speech disfluencies caused by interference in Russian-Belorussian bilingualism are studied using corpus of spontaneous speech. The obtained results are compared to the strategies of overcoming speech disfluencies in monolingual Russian speakers described in literature.
This article is devoted to the principle of dynamic ranking of hypotheses on which the “Treevial” parser is based. The formal apparatus used in Treevial to describe grammar is discussed. A separate section is devoted to the mechanism of fines, which operates in conjunction with the basic formalism. The scheme of the analyzer is set forth, and the advantages and disadvantages of the proposed approach are described. The last section contains a description of the tools which the analyzer uses to impose fines.
An analysis of inflections in spontaneous monologues from the Russian speech corpus of everyday communication is proposed. The most frequent grammatical meanings of inflection are described, as well as their phonetic realization and reduction.
The article characteristics of link spam placed by link brokers are considered. We investigate the lifetime and rotation of links, analyze the thematic proximity links and pages.
The peculiarity of Russian construction prazdnik ne v prazdnik is shown. The syntax and semantics of this construction is analyzed. The meaning of this construction is controversial, it may be evaluated both positive and negative.
The paper describes the structure and methods of compilation of an electronic bilingual dictionary of metaphors of human psychology sphere (based on Belarusian and English nouns).
The paper gives a lexicographic account of aspectual correlates and the related question of inflectional vs. derivational nature of the Russian verbal aspect. In addition, it discusses a general functional mechanism that forces the speaker to replace any perfective verb with an imperfective one when perfective verbs are not allowed.
The paper deals with Russian jokes in their written form. It makes distinction between scripts of oral jokes, rudimentary jokes, and Internet humor.
The paper presents a project aimed at creating a full-fledged word-formation annotation of the Russian National Corpus (http://ruscorpora.ru). The first phase of the work is compiling a word formation database oriented to corpus annotation. Important theoretical problems concerning different approaches to the word formation in Russian are discussed. Possible approaches to the automation of the annotation process are outlined.
Data base of Russian dialects whose phonological systems are characterized by distinction of two o-phonemes has been enriched by material of several dialects. Thus comparative analysis of accentuation systems of dialects with the help of data base became possible. This paper deals with the comparative study of accentuation of masculine nouns and the correlation between accentuation (synchronic and Common Slavic accentual paradigms of words) and thimber of o in their roots.
The Russian equivalent of IF (ESLI) and the equivalent of IF EVER (DAZHE ESLI) are analyzed. In grammars and typology, IF EVER (DAZHE ESLI) is regarded as a concessive-conditional conjunction. I demonstrate that the meaning of this lexical unit is the sum of the meanings of its constituents: IF EVER = ‘if’ + ‘ever’. The point is that the conjunction IF (ESLI) in combination with EVER (DAZHE) is represented in its secondary modification, which can be realized only in a limited range of contexts.
A new method for training classifiers is described, which is based on using results of passage extraction and classification for an iteratively retraining classification model. Also, we discuss approaches to passage extraction and propose some new methods. Experimental evaluation of our new approach shows significant improvements in the quality of text classification on the standard English and Russian test collections.
Hierarchical structure of polysemy in semantics of idioms is considered. Three topological types of polysemy are discussed: radial, chain-like and combined (radial-chain). It is shown that polysemy in idioms has specific features as compared to that of ordinary words.
Problems of perception and recognition of gestures of Russian sign language in system of the automated sign language translation are discussed. The new approach to morphology of gestures and a method of separate gestures allocation for sign statements are offered. The working definition for "text understandig" is offered.
I start from the perspective of the EC COMPANIONS project, and set out its aim to model a new kind of human-computer relationship based on long-term interaction, with some tasks involved although the Companion is not inherently task-based, since there need be no stopping point to its conversation. Some demonstration of its functionality will be given but the main purpose here is an analysis of what it is people might want from such a relationship and what evidence we have for whatever we conclude. Is politeness important? Is an attempt at emotional sympathy important or achievable? Does a user want a consistent personality in a Companion or a variety of personalities? Should we be talking more in terms of a "cognitive prosthesis (or orthosis)?" ---something to extract, organize, and locate the user's knowledge or personal information---rather than attitudes?
The paper lists some factors instrumental in syntactic ambiguity resolution which can be used in computer analysis of the text. Semantic analysis of syntactic ambiguity is discussed.
In oral speech with no illocutionary distinctions marked prosodically, the prosody serves as a means of segmenting speech into phonetic groups, lines, or texts. In Orthodox liturgical reading, the prosody marks the beginning of a line and the end of a prayer, in Muslim prayers it marks the beginning and the end of a line, and in Joseph Brodsky’s verse-reading, the prosody marks the beginning of a phonetic word and the end of a text.
Far from being marginal, as they are often considered, aspectual triplets represent a highly productive phenomenon in Russian. Indeed, they are generated by the same functional mechanism that generates aspectual correlates for virtually any perfective verb. This does not contradict the fact that the Russian aspectual system is organized as a binary correlation and aspectual pair remains its basic unit.
The paper deals with the results of a corpus-based study of collocational behaviour of the most frequent Russian verbs. Association measures (MI, t-score, and log-likelihood) are suggested as instruments to extract collocations. The paper discusses the results obtained and the applicability of the statistical measures. Future prospects of development are outlined.
The paper discusses the correlation of a class of nominal word forms with a characteristic semantics of Stage level predicates (= lexical statives, Predicatives) and syntactic structures with a dative marking on the semantic subject (Dative structures). Most European languages which have copular Dative structures have a lexical class of nominal Predicatives. Some languages with a class of nominal Predicatives use them outside Dative structures. The typology of Predicatives is determined by the opposition of the three classes of adjectival stems: from the stems of one class only designations of properties are derived, from the stems of the second class only Stage-level predicates are derived, the stems of the third class are ambivalent. The number of adjectival predicatives and their lexical meanings depend on the proportion of these three classes in a given language.