In the paper the fact mining system Factus is described, it is a prototype model oriented to a restricted domain, which is apt to widen. The problems of domain ontology representation and its extension on the basis of extracted features during text processing are discussed aiming at so called “open concept frame”. The text analysis is accomplished by means of special structures including syntactic arrangements of predicate arguments, their context markers being used for pattern validation.
The paper outlines a new method of cross-linguistic comparison of emotion concepts, where entire emotion “clusters” rather than individual terms are juxtaposed. The method is applied to eleven emotion clusters in Russian and English languages. The paper considers both universal semantic tendencies and specific linguistic means in the expression of emotion. The paper proposes certain tentative explanations for the observed cross-linguistic similarities and discrepancies.
The paper is concerned with a project aimed at creating a production dictionary of contemporary Russian. Work on the project started in 2006 at the Russian Language Institute of RAS. The main idea of the dictionary is to present a complete and unified account of all linguistically relevant properties of each lexical unit. Apart from grammatical forms and senses they include a) regular semantic modifications of the dictionary definition in verifiable contextual conditions, b) detailed government patterns and their possible modifications, c) a list of minor type sentences specific for a given lexical unit, d) its combinatorial potential (especially as handled by the theory of lexical functions), e) its lexicalized prosody. All these make an integral part of the linguistic competence of speakers and should be characterized on the basis of the latest theoretical findings of linguistic research in the respective fields.
The paper deals with the Russian regional terms, describing urban realty — names of different types of apartment houses and flats (depending on the time of building, planning, material, etc.). These words, as a rule, are rarely included into the explanatory dictionaries, except of some colloquial words, which are normative for the speech of Moscow and St.-Petersburg citizens. The research was carried out on the materials of the Integrum database, including mass media publications and public documents from all of the Russian-speaking space. Using the statistics of mentioning these terms in the regional and central public documents, helps to make preliminary conclusion about their areal distribution.
In the report the problem of inner form representation in definitions of idioms is discussed. Decomposition of meaning cannot be used for semantic description of non-discrete semantic phenomena such as metaphor and image. It is proposed to use for semantic representation of inner form the strategy of recognizing of metaphor. The process of recognition is supported in a definition of idiom by semantic “trigger”, which generates the necessary chain of associations.
The dictionaries of Russian proverbs are analyzed with respect to their repertory and the selection of the main variant of items.
In the paper we discuss the orthography in the Internet and we analyse a widespread misspell which is writing the soft sign in the ending of verb forms containing suffix -s’a (-ся), like delaet’s’a (делаеться). Our analysis shows that the number of such misspells allows us to talk about kind of new standards in written language of the Internet.
The paper reports principles for balancing the corpus of spontaneous monologues in the Russian language collected according to shared linguistic and sociolinguistic parameters. It presents samples of collected data, benefits of multilevel analysis and perspectives of further augmentation.
This work introduces and analyzes the Russian example Ja ne byl v zale, kogda vyklučili svet ‘I wasn’t in the hall when they turned out the lights.’ This example refutes Ju.D. Apresjan’s claim that sentences of that kind cannot have a “synchronous” interpretation. Various meanings of the verb byt’ ‘be’ in locative and existential sentences are discussed.
The paper investigates and compares five methods for variable length term extraction and assembling. Experiments are conducted on a corpus of scientific papers on genetics and microbiology. Evaluation method combining both expert and formal assessment is proposed, the results of comparative evaluation of the methods are presented.
The paper considers search on the Web. Questions on the users’ manners of search are formulated, with emphasis on multiple tasks execution. It is shown that multitasking is rare, usually includes only two task sessions and is formed into a temporal inclusion of an interrupting task into the interrupted one. Quantitative characteristics of search behavior in 3 classes of temporal sessions (single-task session, several tasks executed one-by-one, and multitasking session) were compared, and significant differences were revealed.
The paper analyzes the semantics of Russian instrumental construction with the meaning of shape (xvost kol’com ‘ring tail’, slozhit’ gubki bantikom ‘to make Cupid’s bow’). Spatial interpretation of this construction is described in terms of topological classes (Talmy 2000, Rakhilina 2000). Possible mirroring of topological classes in both slots X and Y is investigated as well as their predictable mutual accommodation.
Semantics of deictic words can be analysed more efficiently if we take the communicative situation of the utterance into account. Traditionally, the semantics of the German deictic elements hin and her was described as being orientated towards the speaker – towards the speaker’s place and time. However, this is true only for the canonical communicative situation, when speaker and hearer are both in the same place. In non-canonical situations (when speaker and hearer are not in the same place) and especially in contexts of hypotaxis or narrative, the speaker may be deprived of his “deictic privileges”, which are then transferred to some other persons.
An attempt is made to optimize the operation of the parser in ETAP-3 linguistic processor. The idea is to change parsing rules in such a way that the emerging syntactic hypotheses be ranked according to probabilities of their appearance in the resulting syntactic tree of the sentence processed. Experimental results are given.
The present report is devoted to the problems of using ontologies in text mining systems. Peculiarities of ontologies used in such systems are examined. A method for automatic ontology generation, in which terms of data domain and relations between them are initially detected by means of computer analysis of the text, is proposed.
In the paper the core group of adjectives expressing various types of judgments are discussed. Structuring of attributive meanings in the wordnet-type thesaurus for Russian (RussNet) is described. The three facets of judgment – pragmatic, aesthetic, and moral – are considered to be fundamental.
The paper discusses unsupervised technique for automatic detection of morpheme structure of words in flexive languages, using Spanish language as a case study. We use global optimization implemented as genetic algorithm, without any heuristics or assumptions that affect the problem dimensions a priori. Description of genetic algorithm is given; preliminary results of evaluations are presented. Input data is the list of words, compiled on the basis of a dictionary or a corpus. Output data is the same list of the words separated in morphemes. As many other automatic methods, this algorithm does not pretend to detect a hundred percent correct results and require postprocessing. Still, it allows for fast detection of tendencies in data and for obtaining of preliminary results without manual work.
Russian dialogue particle „aha” can express either agreement/confirmation or surprise/satisfaction. In Russian lexicography those meanings are mostly presented as being expressed with homonymic linguistical items, the particle and the interjection. The paper examines examples of „aha” in spontaneous institutional dialogues and discusses the possibility of finding a common meaning part.
The article presents a new linguistic software – machine translation system Tilde Translator, and focuses on lexical ambiguaty and multiword expression treatment in the system. The improvement of machine translation quality with semantic filters in the system is also described.
The paper presents a minimalist approach to Russian Adjective Phrase (AP hereafter) structure. The puzzling properties of predicative long form (LoF) adjectives with complements is the starting point of the paper. To explain the distribution of the complement-taking adjectives, we suggest the multi-layered structure of adjectival phrase. The internal A is a lexical head that surface as a short form (ShF) adjective. External small a is a functional head responsible for case concord of attributive LoFs. The chief claim concerns the properties of lexical A heads in Russian. These ShF phrases: (i) are the locus for argument merging; (ii) project their own Spec position; (iii) do not assign structural case (iv) allow eventive (stage-level) interpretation. At the same time, the external LoF a-shell lacks all these properties and is responsible for case-concord in noun phrases. As for the constraint on complementation, attested with Russian predicative LoF adjectives, we supposed that it is due to the two facts: “defective” structure of nominative predicates on the one hand and the elaborated shell structure of adjectival phrase on the other. In such constructions the subject of the lexical AP “has not enough time” to raise to Spec, IP and activate case feature on I, which subsequently should be transmitted to the a head through Pred. This conflict does not arise in case of instrumentals (assigned by Pred) and ShF (no case assignment). Then, case features on LoF do not influence its complement-taking potential in attributive function and in secondary predication. We ascribe the grammaticality of attributive and secondary instrumental / nominative LoFs on the fact that such adjectival phrases are control structures and the case value does not dependend on the internal subject raising. The proposed analysis is supported by several other properties of LoF and ShF: distribution of symmetric predicates, stage/individual-level interpretations, properties of derived nominals and others.
The paper describes the construction project “Corpus of Oral Russian”, which may be created on the basis of the Movie Sub-Corpus of the Russian National Corpus. The authors offer some solutions to the problems concerning the structure of the Corpus, the types of the annotation, the format of the issues, the types of the queries, and the variety of the tasks which may be posed and solved by the use of the Corpus.
Main characteristics of set of recognizable words (in perception text in white noise) have been described in terms of compression texts (with comparison of key-word set). Results of reconstruction text with the set words are analyzed with reference of discovering main characteristics of the set. One of the most finding is the dependence of sense structure of a text on following text parameters: professional vs. fiction and dynamic vs. static.
The paper deals with metalanguage lexical units that convey certain relations of names of different objects: these are Russian units одноимённый ‘of the same name, cognominal’ (and its derivates) and так и называется » ‘called exactly this way’. Such items are difficult to interpret in NLP applications. Lexicographic definitions are proposed based on a number of key senses identified by the author: ideas of coincidence, correspondence, and simplicity.
A class of Russian syntactic idioms is considered from the theoretical and NLP points of view. The class, formed with the noun сила ‘force, power’ consists of a variety of lexical units with surprisingly individual peculiarities. Examples of this class include (1) a preposition в силу ≈‘by virtue of’, as in В силу этой теории поведение в одной точке вселенной влияет на поведение в другой точке ‘By virtue of this theory, the behaviour in one point of the universe influences the behaviour in another point’; (2) an adverb of degree от силы ‘at the most’, as in от силы десять человек ‘ten people at the most’, (3) an adverbial pattern в X-овую силу ‘using such and such part of one’s force, as in работает в полную силу ‘he works to the full extent of his power’, работает в треть силы ‘He works using a third of his force’; (4) a predicative adverb в силах 1 ≈ ‘being able’ as in старик был не в силах быстро ходить ‘the old man was unable to walk fast’, (5) a predicative adverbial pattern в (чьих-либо) силах 2 ‘within one’s power’, as in сдержать смех было не в моих силах ‘to contain laughter was beyond my powers’. Specific descriptions of several of these idioms are given using a specially designed standard layout.
A methodological tool is proposed that enhances the quality of discourse transcription, in the course of preparing corpora of spoken language. Prosodic prototypes unerlying discourse segmentation and expression of phasal meanings can be identified with the help of prosodic portraits of individual speakers.
The paper discusses a linguistic basis of the segment connections building in Russian sentence and some problems that arise when searching for a control word of a specified segment.
In the paper German modal particle JA in constative utterances is compared to its Russian translation equivalents VED’ and ŽE on the basis of studying parallel samples of modern German prose and its professional translations into Russian. The analysis reveals the following differences: 1) VED’ presupposes its proposition as а fact while JA and ŽE do not, and it explains the ability of the latter two to be freely used in imperatives; 2) VED’ and ŽE specify the degree of rhetoric activity (≈ intencity of illocutionary force) as normal and high resp. while for JA this semantic feature is irrelevant and this makes the choice of its translation equivalent dependant on such pragmatic features of the context as its relation to speaker’s interests and interpersonal relations among the interlocutors; 3) JA can be used in responses, implying yes / no answers to direct questions, while VED’ and ŽE cannot occur in this context; 4) the use of VED’ in answers demands the dictal component of its propositional content to be different from that of the question; 5) VED’ cannot occur in correcting remarks and direct answers if it is not preceded by initial adversative particles (NO, А, DA). Its use together with one of these particles overtly marks the response as conflicting with some of the addressee’s initial assumptions and thus violating the maxim of consent and so in some cases it may damage semantic equivalence of the translation with respect to the interpersonal aspects of utterance meaning.
The development of a data base for intonation of oral mass-media texts is now in progress at the Philological Faculty of Moscow University. A highly detailed system for sentence prosody description is used. Great differences are found between the use of prosodic means in informal dialogues and in informational texts in TV-programs.
A brief description of the experiment on the evaluation of the scientific work’ performance in Russian Academy of sciences carried out in 2007 is given. At the final stage this action revealed several problems. In this connection, a method and an instrument of their solution are suggested. These are classification method and semantic dictionary with integrated classification scheme, correspondingly.
In this paper, data of the experimental investigation of Russian causative verbs semantics is presented. The investigation was conducted in the framework of Force dynamics theory. We distinguish the concepts of CAUSE, ENABLE, and PREVENT depending on the correlation of three main parameters of the causative situation: 1)the tendency of the patient for a result, 2) the presence of opposition between the affector and the patient, and 3) the occurrence of a result.
The article deals with the study of language behavior patterns using by Russians in the communicative situation of argumentation.
The present research is concerned with the pauses at different syntactic boundaries in oral monologue Japanese speech. It aims to find out, how frequent and therefore probable are the pauses at the boundaries of sentences and clauses lesser than sentences and what their “normal” length is.
The paper reports a corpus-based study of prosodic strategies employed in multiclausal structures with a postpositioned dependent clause in spoken Russian. Three main strategies are discussed: (1) the pitch direction at the primary accent in the main clause is opposite to that in the dependent clause, (2) the pitch direction at the primary accent in the main clause copies that in the dependent clause, and (3) the main clause remains non-accented. Quantitative and qualitative analysis is provided to explain the speaker’s choice between the three strategies.
We represent and discuss a model to control speech behaviour of a virtual computer agent (computer game agent, interface component or, in the future, mobile robot). The model simulates “mood dynamics”, which controls agent’s behaviour in a communication. In particular, the model uses a set of phrasal templates to construct shorts monologues, revealing the dynamics of agent’s “feelings” and allowing the agent to switch between several dialogues in a communication.
Academic lecture regarded as a kind of a dialog is a suitable experimental ground for studying general regularities and specific rules of gesture-speech interrelation and human interaction. In the first part (part II A) of the research a classification of didactic deictic gestures has been compiled and some classes of these gestures has been described. In this part (part II B) of the research I imply to demonstrate that deictic gestures of each type have their own non-trivial relations with the verbal and nonverbal signs in a dialog.
An attempt is made, to evaluate the frequency of syntactic molecules (= minimal autosemantic sentence parts, able to serve as answers to a question) on the evidence from the Russian General Corpus (created on the base of the Uppsala Corpus) with the help of the StarLing database processing software package.
The objects of this article are words благородный and великодушный. Firstly, we describe the difference in their semantics and try to establish the connection between meaning of благородный and its internal form. Then, the polysemy of благородный is examined. At last, we analyse the meaning of lexemes благородный 3.1 и 3.2 (благородное лицо, благородное животное) and formulate the hypothesis that the conception of connection between internal qualities and birth of person is still preserved in modern language.
The report deals with the principles of organization and methods of building a multimedia textual dialect corpus, representing dialect as a comprehensive whole of cultural and communicative features and modeling the communication of specific speech groups in specific social and cultural environment.
In this article, we describe some useful extensions to translation-oriented terminological dictionaries using as an example two dictionaries compiled at the University of Helsinki, Palmenia Centre for Continuing Education in Kouvola, in 2003–2007. These dictionaries are mostly descriptive but they contain some elements which are usually characteristic of normative dictionaries, such as restrictive labels, strict terminological definitions, and concept charts. Special attention is paid to translator-friendly techniques, such as explicit marking of partial and artificial equivalents and explanation of the differences between concepts in the source and target languages.
Paper considers the linguistic processor ”Semantix” for automatic formalization of natural language texts in some fields: criminal, autobiography, texts about terrorism . The processor extract from texts the user objects, their links and facts of object actions. Results are XML-files which are used for Knowledge Base organization, semantic search and analytic tasks.
Parts of speech syncretism in Russian and Uralic (the problem of the composition of bilingual dictionaries for inflectional and agglutinative languages) The article discusses pro and contra of presenting the material as homonyms or polysemantic units in bilingual dictionaries.
This report deals with a project of dictionary (lexical database) including «non-nominative» items which are used as adverbial modifiers (for ex. на ходу, под предлогом (чего), во всяком случае).
In this article we investigate to what extent materials available to paying subscribers are openly published on web-sites. We obtained the distribution of news agencies’ messages based on the time of delay. We also measured specific quantity of reprints of the news agencies’ materials on web sites as well as Internet messages included to the agencies’ news-lines.
Russian discourse particle uzh is very difficult to describe. It produces manifold pragmatic effects, and it is unclear how this effects are connected with the components of its meaning. The paper is devoted to some of such components and discourse effects they cause.
Construction with two NPs with prepositions u, such as U men’a u dočki segodn’a den’ roždenija ‘Today is my daughter’s birthday’ – is very often used in colloquial Russian. In this paper I will describe syntactic, semantic and discourse features of this construction.
The paper is devoted to a closed class of Russian asyndetic composite sentences that require the use of colon in written language and are characterized by a special intonation in spoken language. The problems that arise while transcribing such sentences in spoken narrative are discussed.
An algorithm of segmentation of the text on the syntactic syntagmas, based on the analysis of the steady phrase-logical and grammar-semantic word-combinations making the sentence is suggested. The basic sense of allocation consists in the sentence of considered word-combinations that now freedom of its division into syntagmas is limited, namely: the syntagma border can be only outside of word-combinations, but not in them.
A computer system of prosodic speech parameters cloning is described. The system allows to automate the process of creation of a complex prosodic portraits necessary for TTS synthesis. The system is intended for widening of inventory of prosodic portraits for the personalized speech synthesis under texts of various genres.
In the paper we describe the development of an automatized system for analysis of multiword expressions that facilitates the discovery of specific features of syntactic and semantic behaviour of multiword expressions. The analysis is based on automatic comparison of the component structure of expressions and uses the knowledge described in a thesaurus-like lingustic resource. At present we test the system in the process of terms acquisition for Ontology on natural sciences and technologies.
A frequency dictionary represents the base lexicon of contemporary Russian (1950–2005) that gives information about word frequency in actual use and provides frequency comparisons between different functional styles and periods of creation of texts. The dictionary is based on texts of the Russian National Corpus Словарь (100 million words).
This Report is devoted to the rhetorical antonymy recognized only in oral speech. Unlike inherent and adherent antonymy, the kind distinguished on the paper is characterized not by the inner antinomy of meanings, but by the opposition of the communicative purposes (contructive and destructive).
Barriers between the correlative clause and the main clause in correlative constructions in Russian are described. It is also shown that correlatives do not reconstruct in Russian. The preliminary syntactic structure of Russian correlatives is suggested, that involves the position of topic and/or focus.
The paper presents results of a corpus-based study of selectional preferences of frequent Russian lexemes. Research procedure requires analysis of co-occurrence data obtained from Russian texts. It is implied that selectional preferences of a lexical item may be defined through sorting its left/right neighbours in bigrams by MI-score values. Given an ordered set of neighbours for a lexical item, it is possible to induce its context patterns. Selectional preferences are specified with respect to morphological and semantic features of co-occurring lexical items.
The paper presents experimental results on automatic word sense disambiguation. Contexts for Russian nouns denoting physical objects extracted from the National Corpus of the Russian Language serve as an empirical basis of the study. Optimal conditions for WSD are defined taking into account lexical markers of word meanings in contexts and semantic annotation of contexts.
The issues discussed are the principles of compiling of interpreting corpora with a corpus of court interpreting as an example. Such a corpus combines a spoken corpus with a parallel corpus. The tagging should reflect communicative, prosodic, as well as extralinguistic information. The interpreting corpora are a valuable resource of data for multidisciplinary research.
Different instances of poet devices used in the works of V.Nabokov and A.Platonov were counted with a list of about 100 pages for each author. Based on this comparison allows to hypothesize an opposition between their respective poetics as the poetic of language elegance and the poetic of awkwardness.
The report lays the foundation for the need to compile a new specialized dictionary, reflecting changes in Russian language government over the period from early 19th century to the present day. The author presents a concise list of principles underlying such a dictionary and introduces a sample dictionary article for the verb skuchat’.
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Besides the large corpus of Czech, a corpus of Czech-English parallel resources (The Prague Czech-English Dependency Treebank) is being developed. English sentences from the Wall Street Journal and their translations into Czech are being annotated in the same way as in PDT 2.0. This corpus is suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation. In the report, the basic annotation scheme is represented, with special reference to complex semantic (tectogrammatical) level. The system of syntactic functors and valency lexicon VALLEX are also discussed.
Current study examines interlingual communication in Estonian web-portal rate.ee. First conversations between Estonians and Russians are viewed in order to see the factors in choosing language for first conversation act (conversations are normally strings of picture comments). Most of these factors are related to situation (who are the participants, how it is more comfortable to communicate, what is the purpose), but some things are learned unintentionally via community of practice, generally environment-related unwritten rules of politeness and polite language choices with equipment of suitable vocabulary.
The lexical semantics of the word can determine its weak or strong accentual position in the phrase, its intention to play the role of the topic or the comment. The bonds between the lexical meaning of the word and its potential accentuality could help to describe the different meanings of one and the same polisemic word in more detail. The interaction between the polisemic word and its accentaulity allows to find its additional specific and particular meanings. The subjective and estimating, negative and retrospective semantics is especially „appealing“ for the phrase accent. But there are several factors which can withstand this accent „appeal“, for example specific communication task (pure narrativity, explanation of cause, imperative sentences), idiomatic phrases, innumeration, the use of numerals. If we also include the prosodic information about accentuality into the dictionary, it is necessary to comment at least on the potential obstacles which can destroy the anticipated accentual construction of the phrase. This comment could be presented for instance in the foreword of such a dictionary. Generally not all the words of the vocabulary request this kind of prosodic information. On the other hand, there are some lexical meanings of the polisemic words closely connected with the accentual emphasis, this fact should not be neglected in the lexicography.
The focus of attention in modern semantics gradually transfers from the meaning of separate linguistic entities to meaning shifts and contexts that motivate these meaning shifts. The type of communicative situation (and REGISTERS OF INTERPRETATION it engenders – such as dialogical register, narrative, hypotaxis) is one of the most relevant parameters. Examples are given of EGOCENTRICAL grammatical categories, words and constructions that have different interpretations in different registers.
It is widely recognized that the marker of text incompleteness in many languages is the rising tone. This paper argues that in German the intonational strategies of the coherence maintaining can be specified and that a variety of ways to show that a statement is not text-final can be singled out.
A goal of this paper is to analyze the differences between mathematical definitions of symmetry and a concept of symmetry that would fit best with observed linguistic generalizations. This requires a closer look at some aspects of the linguistic behavior of symmetric and non-symmetric predicates.
This paper discusses modification of some syntactic rules that regulate the interaction of verbal and nonverbal semiotic codes in the dialogue. We show that there is regular correspondence between particular meanings in the semantic explanation of the gesture given and different components of the physical realization of this gesture.
Aligning parallel texts, i.e. automatically setting the sentences or words in one text into correspondence with their equivalents in a translation, is a very useful preprocessing step for a range of applications, including but not limited to machine translation, cross-language information retrieval, and dictionary creation. We are presenting a new alignment algorithm for aligning bilingual, linguistically un-annotated parallel corpora. It enables alignment at sentence level, using bilingual dictionary and heuristic cues, along with linguistics-based rules. The program based on the algorithm currently aligns Russian and English texts, requires no previous marking-up or other manual text pre-processing. Russian lemmas are retrieved in the grammar dictionary. The adaptive nature of the system allows experiments with a variety of fiction or non-fiction (i.e. scientific and juridical) texts. The algorithm deals with the typical alignment problems like the correct alignment of one-to many sentences correspondence and omission of a sentence, or how to align texts with different syntactic patterns in two languages. First phase of performance tests seems promising, and we are going to develop word and multiword alignment technique.
In this paper, we analyse pauses in Russian Sign Language discourse. In order to describe different types of pauses, we use signed discourse transcription data, which contains information on movement phases of signs and on changes in the facial expression and body posture of the signer.
We describe an implementation of a simple probabilistic link grammar. This probabilistic language model extends trigrams by allowing a word to be predicted not only from the two immediately preceeding words, but potentially from any preceeding pair of adjacent words that lie within the same sentence. In this way, the trigram model can skip over less informative words to make its predictions. The underlying "grammar" is nothing more than a list of pairs of words that can be linked together with. Finally, we report some experimental results using russian corpora.
The paper is devoted to the comparison between nominalizations in Russian everyday speech and slang on the one hand and in modern standard Russian on the other hand. Derivational bases, means of derivation, meaning, argument frame and surface behavior of nominalizations are considered. The analyses suggest that, considering the intermediate position of nominalizations between nouns and verbs, Russian colloquial and slang nominalizations are less related to motivating verbs than nominalizations in standard Russian.
The development and usage of ontoeditor designed for operation with the knowledge model of InTez ontology are presented. Browsing, input, editing and other functions are discussed. The ontoeditor is compared with similar environments developed abroad.
The paper considers multilevel linguistic annotation of the Russian Speech Corpus and its potential for description of spontaneous speech in comparison to standard language.
The paper presents the project of the new dictionary of variants in Russian, which is supposed to be accomplished on the basis of the Russian National Corpus. The paper gives the preliminary description of the dictionary word list, the types of posed and solved tasks and problems.
The technology intended for building of subject-oriented dictionaries and solving of various tasks of text analysis in information systems is considered. A problem of simultaneous use of several dictionaries and coordination of their contents is investigated.
In this paper we discuss problems that emerge while developing ontology for the scientific discipline concerned with computational language, text and speech processing, that is Computational Linguistics. The problems range from defining the name and scope of the subject domain to meeting formal requirements set on the ontology specification by the knowledge portal design. Difficulties are due to the deviation of the CL from “classic” sciences like, for example, archeology, since computer for CL is not only amplification and intellectualization of modeling means. It is inherent part of the science. We consider the problems and the ontology organization.
The report concerns the methodological principles elaborated for creation of the speech corpus of the Russian everyday communication “One Speaker’s Day”. The paper presents the main rules for data processing on primary stages, the description of the database, and the current state of the corpus formation.
Estonian human-human calls (directory inquiries) are analyzed with the further aim to develop a computer-human dialogue system that interacts with a user in natural language. The analysis is based on the Estonian Dialogue Corpus. Linguistic features of clients’ requests and agents’ grants are studied. A client’s initial request sets up a goal which will be achieved in collaboration with the agent. Information is given briefly by agents, using short sentences or phrases. Information-sharing sub-dialogues are initiated by both participants if either a request or a grant needs to be adjusted. A formal grammar of information dialogue is introduced in the paper. The results of the study will be implemented in two dialogue systems under development.
On the basis of Nirenburg & Raskin «Ontological Semantics» formal rules are proposed for recognizing semantic roles of Instrument and Similar-to (in form and in general) expressed by the instrumental case in Russian. The rules are needed for the correct translation of NP adjuncts with the head N from the class of artifacts within an AT system
The article aims at defining a potential set of different directions in which hesitation pauses and kinetic phrases can interact. The material is a spontaneous monologue stretch of English speech. Due to a multidisciplinary approach there are seven ways detected, alongside of which the investigation of pause-kinetic interaction can be conducted.
The research is aimed to distinguish interjections and participles with a help of syntactic, semantic and pragmatic criteria. The word should be regarded as interjection, if it is syntactically autonomous, spontaneous and not addressed reaction to linguistic, and also to extra-linguistic stimulus
Russian verbs of going down are described in this paper. The relevant parameters of adequate semantic description are shown, for example the control of the subject, the speed of movement, the layer in which the subject is being put. Three main metaphorical extensions – BAD IS DOWN, the large amount of something and the disappearance from sight are being discussed.
Simulating Ukrainian speech, making fun of funny-sounding Ukrainian words and names are unmistakable sings that jokes about Ukrainians are produced in the Russian linguistic environment. The paper aims at revealing links between typical joke plots, “linguistic masks” of the characters, and ethnic stereotypes.
We discuss the work on building the first free speech database for recognition systems. This report reviews free speech sources, processing technique and problems related to the collection of the big multilingual speech database.
The paper considers problems of using linguistic methods of search in contemporary search engines. The features of search engine Exactus are described. The experimental evaluation of the quality of search is performed. The advantages of integration of linguistic and statistic methods are shown.
This report deals with methods of word sense disambiguation (reduction) using the information about verb argument structure. Most of the systems based on this method require specially designed resources such as WordNet, FrameNet etc. We explore the possibility to extract and use the information available from the standard dictionaries including a Verb-argument dictionary. We used a subcorpus of National corpus of Russian language that has unambiguous morphological annotation as training and testing data. The aim was to reduce the number of tags for verbs in the semantic annotation. The experiment has shown that the information extracted from dictionaries could not be used as it is. However the extracted argument structure can be used as a seed set for future training. It allows to remove rare meanings and can reduce the number of semantic tags for a verb. The further corpus training and enriching the argument structure with general semantic properties of nouns can further improve the method.
The paper presents an algorithm of segmentation into phrases and intonation tagging of narrative sentences. The algorithm takes into account the positional and combinatory prosodic factors. The use of the proposed algorithm in TTS synthesis system provides an elimination of so called “second degree of monotony” in synthesized speech.
Russian conjunctions a to [lit.: ‘and/but that’] and a ne to [lit.: ‘and/but not that’] according to their form cannot be synonyms. Yet they easily substitute for one another in some contexts. To explain this fact I analyze the element TO of these conjunctions. It derives from demonstrative/anaphoric pronoun TO(T) and in the conjunctions under discussion is not quite bleached. TO in A TO and A NE TO refers to certain fragments of a semantic structure of an utterance. The difference between the conjunctions is in the scope of TO. Compositional analysis of Russian conjunctions and particles is considered.
In this paper we propose a data-driven algorithm for detecting sentence boundaries in Russian. The algorithm relies on shallow features and does not require any deep syntactic knowledge. We evaluate our approach with three publicly available machine learners: C4.5, Ripper and SVM-light. The evaluation results suggest that our algorithm significantly outperforms rule-based approaches.
The report discusses the problems that arise when building automatic text classification systems. Main elements of the integrated text classification technology are described. Particular attention is given to the construction of combined decision rules for the implementation of a hierarchical classification of texts.
Comparative lexicographic description in RuSLED dictionary of Russian words and gestures of Russian sign language with same or near meanings is presented. There is intended to use multimedia dictionary RuSLED for study of Russian words and gestures of Russian sign language usage features.
This paper seeks to connect the future of the Internet to a new, even though relatively underdeveloped, technology, that of computer speech and language and its embodiment, in a concept I shall call an Artificial Companion. Before moving to describe the integration that constitutes the Companion, we must first mention two technologies, not only in their own right but because, in each case, there have been misunderstandings about their achievements and goals. They are: Language and speech technology Agents and the Semantic Web The first of these is Berners-Lee’s [Berners-Lee et al., 2001] vision of how the Internet will change, and it is to that new Internet we intend the Companion as the human interface, on the ground that without it the Internet may get harder and not easier to use, and we shall return to the Semantic Web at the end of this paper. The second notion above is that agents will change from transitory software entities that e.g. locate a cheap camera on the internet, to more permanent social Companion entities that deal with a user through dialogue over a long period, learn his or her needs and preferences and elicit large quantities of life data though conversation.
Representing prosodic data in a dictionary raises two problems: to account for limitations on the communicative and prosodic application of words and constructions by their definitions or functions in discourse; to collect idiomatic illocutions and their prosodic parameters in a prosodic dictionary.
Anomalous phrases in A.Platonov’s texts so far have been investigated exclusively as a source of information on the author’s poetic world. The paper demonstrates that Platonov’s linguistic anomalies can be used as a source of information about Russian language. These anomalies reveal some subtle semantic, combinatorial and categorical properties of Russian words, which hardly could have been noticed otherwise. This information can be used in explanatory dictionaries of Russian, as well as in the semantic tagging of electronic corpora.
Documents of the 7-th Framework program of the European Union, accepted for the period 2007-2013, contain formulations of the new tasks concerning to the knowledge representation problem in the digital sphere. In the paper key positions of these formulations are analyzed. Results of the analysis are used for definition of some terms suggested for the description of knowledge representation processes in digital libraries.
The paper discusses word order and phrasal prosody in Russian. I claim that both phenomena can be described in terms of two successive sets of rules — local rules vs. global rules. Combinations of these two sets of rules are typical of multilayer language models and for algorithmic generation of complex structural objects in formal grammars. Modern Russian applies to a highly formalized rule of choosing the locus of the main phrasal accent: the hierarchy of potential accent bearers is a mirror image of the grammatical hierarchy of arguments and adjuncts. The order of communicative constituents in Russian is governed by 7-8 Linear-Accent Transformations (LA-transformations). LA-transformations are Movement rules, which both operate on constituent order and change accent markings of communicative constituents. In the preceding Russian linguistic tradition (cf. Paducheva and Yanko) LA-transformations are defined as Context-Sensitive rules, which makes word order calculus impossible. I discuss the possibility to reformulate LA-transformations as pairs of the type and offer an analysis compatible with Mildly Context-Sensitive Grammars, e.g. Stablerian Minimalist Grammars.