Noah Bubenhofer
(bubenhofer@ids-mannheim.de), Roman Schneider (schneider@ids-mannheim.de)
Institute for German Language (IDS), Mannheim/Germany
In this feasibility study we aim at contributing at
the practical use of domain ontologies for hypertext classification by
introducing an algorithm generating potential keywords. The algorithm uses
structural markup information and lemmatized word lists as well as a domain ontology
on linguistics. We present the calculation and ranking of keyword candidates
based on ontology relationships, word position, frequency information, and
statistical significance as evidenced by log-likelihood tests. Finally, the
results of our machine-driven classification are validated empirically against
manually assigned keywords.
Research into ontologies has received much attention for
the last years [16] [17] [18]. Due to its practical use for common tasks related
to knowledge sharing and publication, it has been subject of study in most different
scientific communities. Ontologies are often seen as enabling technology for
information sharing, with their ability to be easily reused being a key factor
for successful application scenarios [4] [6] [8] [15]. On the web, which represents a large universe of mostly unclassified
semi-structured hypertexts, semantic techniques and
technologies open up new strategies for information
retrieval and text classification [5].
The Institute for German Language (IDS) in Mannheim is
the central institution for research and documentation of the German language.
It hosts several specialist resources, including the hypertextual information
systems Grammis and ProGr@mm and
a terminological ontology [12] [13] [14]. Since only less than 40% of the
hypertexts are classified with manually assigned keywords, our goal is to gain insight of how ontology features can affect automatic semantic-statistical
classification. We introduce the resources as far as necessary to understand
our test-bed, and then present a self-conducted empirical
case study to verify the feasibility of our approach.
Grammis is a specialist hypertext resource that brings
together terminological, lexicographical,
and bibliographical information about German grammar. Initiated more than a
decade ago, it combines traditional description of grammatical structures with
the results of corpus-based studies and hypermedia design principles.
Considering that the grammar of human languages is a highly complex scientific
domain, the project authors use hypertext chunking and linking as well as
multimedia extensions like spoken language excerpts and graphical explanations
in order to address a broad target audience with heterogeneous foreknowledge.
Their goal is to present a comprehensive overall picture of contemporary German
grammar from a syntactic, semantic, and functional perspective. Today, Grammis[1]
is the most prominent academic information system dedicated to German Grammar,
with consistently more than 50,000 page impressions per month. ProGr@mm[2]
is an e-learning system for schools, colleges, and universities, and didactically
prepared for online learning. A special module covers selected grammatical topics
from the perspective of other European languages and is well-suited for
students and teachers of German as a foreign language. Functional add-ons are
guided tours, personal notes, and discussion forums for the educational
community.

From a technical point
of view, both Grammis and ProGr@mm can be described as XML- and database-driven
web information systems, whose semi-structured hypertext nodes (instances)
conform to the Grammis Markup Language (GrammisML). GrammisML defines detailed constraints
on the instance’s logical structure, allowing for subsequent cross-media
publishing (“one source fits all”). It provides conventional block elements
like paragraphs, lists, or tables, as well as specific markup structures for
the coding of grammatical metadata, typed hyperlinks, etc. Using a web-based
authoring frontend, arbitrary keywords and object words/phrases for retrieval
operations can be assigned manually. Parsing, analysis, and transformation of
the hypertext resources are conducted using established technology like XPath,
XQuery, XSQL, and XSL Transformation [11].
Not just since the proclamation of the Semantic Web [2],
semantic resources are among the most prominent add-ons and tools for
information retrieval. Domain ontologies, organizing specialist terms
(concepts) and their interconnections (relationships), can make a most valuable
contribution to the analysis, classification, and finding of documents on the
web – not least in the context of academic publications [3]. This is due to the
both simple and unfortunate fact that scientific terminology is often far from
being consistent. Especially in the field of linguistics, different theories,
schools of thought, or even authors not only name things differently, but even
assign varying meanings to identical terms. A semantically enriched retrieval application
for the exploration of linguistic resources should incorporate these theory-related
details so that it can offer appropriate solutions. As a consequence, we
integrated a domain ontology for linguistic/grammatical terminology. The semiautomatic
detection of concepts as well as the modeling of relationships has been conducted
using statistical methods on large general language corpora and specialist language
corpora.[3]
Broadly speaking, in order to bring together theoretical desiderata with practical
demands and limitations, we combine well-established standards of ontological
engineering – e.g., the use of ISO-2788/ANSI Z39.19 compliant hyponymy/meronymy
relationship types like Broader Term Generic (BTG) or Broader Term Partitive
(BTP) – with terminological modeling principles – e.g., termsets, expanded by
theory-related attributes and explicit linking of individual concepts belonging
to different Termsets [1].
Figure 2 illustrates our ontology model. It covers three
termsets, indicated by dotted border lines. The bottom termset contains the two
concepts Verbgruppe and Verbalphrase, recognizable by rectangles
with rounded corners. Verbgruppe is
characterized by a theory-related attribute named IDS', meaning that it is used primarily when referring to the IDS Grammar of German Language. The
concept Verbalphrase consists of four
lexical entries:
·
Verbalphrase with a marker for
Preferred Term (PT) and a language attribute (German)
·
Verbphrase linked to the former by
a synonymy relation (SYN)
·
VP linked by a abbreviation relation (AB)
·
Verb Phrase with language attribute
(English) and translation relation (TR)
The termset is linked with its hyperonym by a Broader
Term Generic (BTG) relation. In order to clarify the benefit of linking not
only termsets, but also concepts, our example illustrates the relationships
between Phrase (engl. phrase) and Satz (engl. sentence).
Basically, the corresponding termsets are connected with the help of a Broader
Term Partitive (BTP) relation (meronymy). Beyond this, since generative
grammars usually classify sentences as phrases (complementizer phrases), only these two concepts – singled out by a
theory-related attribute – are linked by a Narrower Term Generic (NTG) relation
(hyponymy). This should facilitate communication between people or computers
using different terminological systems.

Fig. 2: Grammis ontology
modeling structure
The goal of the
classification process is to find terms (keywords) describing the content of a
hypertext. We use the following information for our algorithm:
l
The hypertexts contain XML-coded markup like paragraphs, lists, tables
etc., but also specific grammatical metadata and links to the grammatical
dictionary.
l
For the classification process the hypertexts were lemmatized and
part-of-speech tagged using the “TreeTagger” [10] and a training file for
German.
l
The source for possible keywords is our ontology, that can be accessed
by functions such as „get hypernyms of a term x up to n levels“ or
„get synonyms of term x“ etc.
For the classification
process we stored the hypertexts as a lemmatized word list which also contains
the type of the paragraph the word is used in (title, subtitle or definition).
We omitted words that are used in examples and tables: Examples contain object
language that should not be used as a source for keyword candidates. Tables
also often list object language and contain word chunks or fragments, because
they are used for the presentation of inflection paradigms and the like. The
basic idea of our classification algorithm is the following:
1. We select for each text all the terms that are
also part of the ontology.
2. For each term, we assume broader terms one
level above as additional keyword.
3. For each term, a rank is calculated that
reflects its importance within the text. We use basically three factors to
calculate importance:
a. Frequency: More frequent terms are more
important than rare ones.
b. Position: If a term is mentioned in a title,
subtitle or a definition or is used as a link to the grammatical dictionary, then
it is supposed to be more important.
c. Statistical significance: The relative
frequency compared to the mean frequency of the term in all the other texts is
calculated using a log-likelihood test.
These three factors are
combined to an overall score. Frequency and position are calculated by counting
the occurrences of the term in question multiplied by a weight depending on its
position. In our standard procedure we used: titles = 6, subtitles = 4,
definitions and „Merksätze“ = 2, all other positions = 1. The statistical
significance is calculated using the log-likelihood test [9]:
LL = 2*((a*log (a/E1)) + (b*log (b/E2)))
In this formula, a and b
are the raw frequencies of the term in the text and the whole corpus
respectively. E1 and E2 are the expected frequencies in
the text and the whole corpus. The calculated value expresses the difference of
the relative frequency to the total corpus. The higher the value, the higher is
the significance of the term for the specific text. The following example demonstrates
the difference between raw frequencies and relative frequency: Table 1 shows
the frequencies and significance values of a hypertext node on valency
(“Valenz”).[4]
|
ID |
Type |
keyword
Candidate |
Frequency |
Weighted
frequency |
LLR |
Source
term |
|
2871 |
d |
Valenzträger |
8 |
17 |
70,93 |
Valenzträger |
|
2871 |
d |
syntagmatische Beziehung |
11 |
23 |
64,89 |
Valenz |
|
2871 |
d |
Valenz |
11 |
23 |
64,89 |
Valenz |
|
2871 |
d |
Komplement |
18 |
18 |
54,44 |
Komplement |
|
2871 |
d |
Leerstelle |
3 |
3 |
26,21 |
Leerstelle |
|
2871 |
d |
Wortart |
15 |
33 |
13,68 |
Verb |
|
2871 |
d |
Verb |
15 |
18 |
13,68 |
Verb |
|
2871 |
d |
Nominalgruppe |
2 |
2 |
13,33 |
Nominalgruppe |
|
2871 |
d |
Modifikator |
10 |
14 |
10,94 |
Adjektiv |
|
2871 |
d |
Adjektiv |
10 |
13 |
10,94 |
Adjektiv |
|
2871 |
d |
Satzadverbial |
10 |
13 |
10,94 |
Adjektiv |
|
2871 |
d |
Nomen |
10 |
13 |
9,85 |
Nomen |
|
2871 |
d |
Bedeutung |
6 |
6 |
9,66 |
Bedeutung |
|
2871 |
d |
Verbvalenz |
1 |
1 |
6,25 |
Verbvalenz |
|
2871 |
d |
Eigenschaft |
4 |
4 |
4,58 |
Eigenschaft |
|
2871 |
d |
Prädikat |
4 |
4 |
4,58 |
Eigenschaft |
|
2871 |
d |
Form |
6 |
6 |
3,51 |
Form |
|
2871 |
d |
Ergänzung |
1 |
1 |
3,34 |
Ergänzung |
|
2871 |
d |
Infinitivkonstruktion |
1 |
1 |
3,15 |
Infinitivkonstruktion |
|
2871 |
d |
Anhebung |
1 |
1 |
3,15 |
Infinitivkonstruktion |
|
2871 |
d |
Infinitkonstruktion |
1 |
1 |
3,15 |
Infinitivkonstruktion |
|
2871 |
d |
Nominalphrase |
4 |
4 |
2,65 |
Nominalphrase |
Table 1: Comparison of different measures for the frequency of terms in the valency
hypertext node
The keywords are ordered
by their significance for the text (column „LLR“). Column „frequency“ contains
the raw frequencies, and „weighted frequency“ stands for the frequencies
weighted by the position in the text. The list also contains terms that are not
mentioned literally, but are broader terms of a token („source term“). The most
frequent term is Valenzträger, but according to the raw and the
weighted frequency, Valenzträger would be on a lower rank. And vice
versa: A very often used term like Adjektiv is not significant enough
for a text on (verb) valency to rank in a top position ordered by significance.
The algorithm produces
two different rankings: One ranking reflects the combination of frequency of
the term and its position, the other ranking represents the significance of the
term. Both aspects influence the final ranking. We combined the two rankings in
the following way: The rank is transformed to a score by inverting the rank
position. We then sum up the two scores and get a final ranking. In addition,
we omit keywords with raw frequency 1 which tend to get very high LLR values
but are not important enough to be included into the keyword list. When
applying the algorithm to the valency hypertext node (see table 1 above for the
raw frequencies), we get the final ranking as shown in table 2.
|
ID |
Type |
Keyword Candidate |
Score |
|
2871 |
d |
Valenz |
17 |
|
2871 |
d |
syntagmatische Beziehung |
17 |
|
2871 |
d |
Wortart |
15 |
|
2871 |
d |
Valenzträger |
14 |
|
2871 |
d |
Komplement |
12 |
|
2871 |
d |
Verb |
10 |
|
2871 |
d |
Leerstelle |
6 |
|
2871 |
d |
Modifikator |
5 |
|
2871 |
d |
Nominalgruppe |
3 |
|
2871 |
d |
Satzadverbial |
2 |
|
2871 |
d |
Adjektiv |
2 |
Table 2: Final ranking of the terms in the text
„Verbvalenz“
The number of keyword
candidates depends on how congruent the two lists of the highest ranked terms
are. Table 2 is based upon the combination of two top 10 lists and both lists
contain more or less the same terms in different order. Therefore the merged
list contains only two terms more than the two source lists. Intuitively, table
2 satisfyingly reflects the text about valency, similar to other hypertext nodes
we evaluated manually. But, as described in the following section, we tried to further
evaluate the lists for better results.
Some of the hypertext
nodes are already classified by manually assigned keywords, using an
uncontrolled vocabulary. These keywords are a measure to evaluate our automated
classification and to experiment with different settings of the classification
algorithm. Currently, the algorithm cannot cope with multi-word units,
therefore we only analyze texts with one or more single-word keywords. Table 3
shows how the change of some parameters of the classification algorithm – e.g.,
weight of position (title, subtitle, etc.) – affects the matching of manually
and automatically assigned keywords. We differentiate three matching levels: level
1 counts documents, that at minimum have one correspondence of manually and
automatically assigned keywords. At level 2 at least 50%, and at level 3 all of
the manual keywords need to be matched by the automatic ones.
|
Version |
Parameters |
Level 1: Min. 1 KW |
Level 2: |
Level 3: |
|
1 |
Default version Weight of positions: titel = 10, subtitle = 4, definitions and „Merksätze“ = 2 Source lists: top 10 |
79,34% |
37,68% 312/828 |
22,4% 186/828 |
|
2 |
More keywords Equal to default
version, but: Source lists: top
20 |
83,69% 693/828 |
48,18% 399/828 |
29,71% 246/828 |
|
3 |
More keywords Equal to default
version, but: |
85,02% 704/828 |
52,29% 433/828 |
32,97% 273/828 |
|
4 |
More keywords Equal to default
version, but: |
85,02% 704/828 |
52,54% 435/828 |
33,33% 276/828 |
|
5 |
Titles version Equal to default version, but: titel = 30,
subtitle = 10 |
79,59% 659/828 |
38,53% 319/828 |
23,19% 192/828 |
|
6 |
Versions with more keywords
lead to the same results than versions 2-4 above |
|||
|
7 |
No hypernyms Equal to default version, but: Only literally used words are keyword candidates, no hypernyms. |
79,10% 655/828 |
37,07% 307/828 |
22,46% 186/828 |
|
8 |
More hypernyms Equal to default version, but: Hypernyms up to 2 levels above in the hierarchy |
78,02% 646/828 |
37,44% 310/828 |
21,98% 182/828 |
|
9 |
More keywords Equal to “more
hypernyms” version (8),
but: Source list: top
100 |
85,51% 708/828 |
53,5% 443/828 |
34,54% 286/828 |
Table 3: Evaluation of the automatic assigned keywords
The evaluation illustrates
two key issues for successful keyword detection:
·
Getting all possible keyword candidates out of the text (tested with
versions 1, 5, 7 and 8 in table 3).
·
Putting the keyword candidates into the right ranking order, so that the
top 10 ranking reflects the text content (tested with versions 2-4, 6 and 9 in
table 3).
The first evaluation results
are not too impressive: A 50% matching of automatically and manually assigned
keywords is only achieved at about 37% of the documents (table 3, 1). About 80%
of the documents have at minimum one correspondence. Crucial seems the number
of keywords that are included into the final list of keyword candidates. If
this number is being increased, the matching scores also get better (table 3,
2-4). But even if the source lists contain 100 keyword candidates, only 52% of
the documents have matches at a 50% level (85% at level 1). If other parameters
are changed, the score does not increase significantly: Neither accepting less
nor more hypernyms (table 3, 7-8) has a substantial impact on the matching
score. Only a higher weight of title positions (table 3, 5) slightly increases
the score.
These results interfere
with our first impression when we intuitively evaluated documents without any
manual keywords. Therefore, the manual classification process has to be
examined. In 262 (32%) of 828 documents, at minimum 80% of all manually chosen
keywords are not used at all within the hypertext nodes, even if the most narrow
terms are taken into consideration. The reasons for that are manifold:
·
Tagging issues influence the matching results: The TreeTagger does not
lemmatize some plural forms (e.g., Pronomina)
correctly. This leads to a mismatch in hypertext nodes where only the plural
form is used.
·
The fact that at the moment we cannot cope with multi-word units also
affects the evaluation of the manual classification process.
·
Our human classifiers tend to choose keywords that are neither mentioned
in the hypertext node nor are close hypernyms of text words.
The above mentioned hypertext node (“Relativ-Elemente”) also shows that
different keynote annotators could disagree on the best solution (bad
inter-rater reliability). Pronomen
and Wortart are the manually assigned
keywords, but another rater perhaps would also or instead set Relativsatz, Relativ-Element (as used in the title of the text) or something
else as a keyword. Table 4 shows the automatically assigned keywords to the
text.
|
ID |
Type |
Keyword Candidate |
Score |
|
368 |
d |
Phrase |
11 |
|
368 |
d |
grammatische
Kategorie |
10 |
|
368 |
d |
Relativsatz |
10 |
|
368 |
d |
eingeleiteter
Nebensatz |
9 |
|
368 |
d |
Einheitenkategorie |
8 |
|
368 |
d |
Relativ-Element |
8 |
|
368 |
d |
nicht-verbaler
Ausdruck |
7 |
|
368 |
d |
Pronominalphrase |
7 |
|
368 |
d |
Einbettung |
7 |
|
368 |
d |
Proposition |
6 |
|
368 |
d |
Verkettungsverfahren |
6 |
|
368 |
d |
restriktiv |
5 |
|
368 |
d |
Präpositionalphrase |
4 |
|
368 |
d |
semantische
Relation |
4 |
|
368 |
d |
phrasale
Kategorie |
2 |
|
368 |
d |
Nominalphrase |
1 |
Table 4: Final ranking of the terms in the text „Relativ-Elemente“
The discussion shows the
demand for a gold standard regarding the automatic detection of keywords for
specialist texts. But the establishing of such standards seems difficult due to
the fact that different (hypertext) publications even today mostly use
different microstructures. An orientation to existing guidelines like TEI would
possibly ease the determination of default settings for position parameters
like title, subtitle, paragraph types, etc. Beyond that, controlled vocabularies
for the manually assigned keywords – or, alternatively, the integration of
user-independent data like social bookmark tags or folksonomies – would surely
affect the congruity with machine-detected terms. Nevertheless, the first
results of our ontology-based approach encourage for further application in the
context of information retrieval and classification – and for methodological
comparisons with other approaches for automatic keyword extraction.
[1] Beißwenger, M. TermNet – ein
terminologisches Wortnetz im Stile des Princeton WordNet // TU Dortmund:
Institut für deutsche Sprache und Literatur, 2008.
[2] Berners-Lee, T. /
Hendler, J.A. / Lassila, O. The Semantic Web // Scientific American 284 (5),
2001. P. 34-43.
[3] Chiarcos, C. An
Ontology of Linguistic Annotations // LDV-Forum/Journal for Computational
Linguistics and Language Technology, 2008. 23(1). P. 1-6.
[4] Fensel, D. Ontologies:
Silver Bullet for Knowledge Management and Electronic Commerce // New York: Springer,
2001.
[5] Gottron, T. /
Schneider, R. A Hybrid Approach to Statistical and Semantical Analysis of Web
Documents // Merabti, M. (Ed.): Proceedings of The Fifth IASTED European
Conference on Internet and Multimedia Systems and Applications (EuroIMSA). Acta
Press, 2009. P. 115-120.
[6] Gruber, T.R.
Ontology of Folksonomy: A Mash-up of Apples and Oranges // International
Journal on Semantic Web & Information Systems, 2007. Vol. 3.2. P. 1-11.
[7] Gruber, T.R.
Ontology // Liu, L. / Tamer Özsu, M. (Eds.): Encyclopedia of Database
Systems. New York: Springer, 2009.
[8] Maedche, A. Ontology
Learning for the Semantic Web // Kluwer Academic Publishers, 2002.
[9] Rayson, P. / Garside, R. Comparing corpora using
frequency profiling // Proceedings of the workshop on Comparing Corpora, 38th
annual meeting of the Association for Computational Linguistics (ACL), Hong
Kong, 2000. P. 1-6.
[10] Schmid, H. Probabilistic Part-of-Speech Tagging
Using Decision Trees // Proceedings of the International Conference on New
Methods in Language Processing, Manchester, 1994. P. 44-49.
[11] Schneider, R. E-VALBU:
Advanced SQL/XML processing of dictionary data using an object-relational XML
database // SDV - Sprache und Datenverarbeitung/International Journal for
Language Data Processing, 2008. Vol. 32.1/2008. P. 33-44.
[12] Schneider, R. A
Database-driven Ontology for German Grammar // Rehm, G. / Witt, A. / Lemnitzer,
L. (Eds.): Data Structures for Linguistic Resources and Applications. Proceedings of
the Biennial GLDV Conference 2007. Narr, 2007. P. 305-314.
[13] Schneider, R. Benutzeradaptive Systeme im
Internet: Informieren und Lernen mit GRAMMIS und ProGr@mm// Mannheim: Institut
für Deutsche Sprache (IDS), 2004.
[14] Sejane, I. Database-Driven
Access to Heterogeneous XML-Contents Using Domain Ontology of German Grammar //
SDV - Sprache und Datenverarbeitung/International Journal for Language Data
Processing, 2008. Vol. 32.1/2008. P. 71-87.
[15] Simperl, E. Reusing
ontologies on the Semantic Web: A feasibility study // Data & Knowledge
Engineering, 2009. Vol. 68/10. P. 905-925.
[16] Staab, S. / Studer,
R. Handbook on Ontologies. International Handbooks on Information Systems // New
York: Springer, 2009..
[17] Studer, R. / Davies,
J. / Warren, P. Semantic Web Technologies – Trends and Research in
Ontology-Based Systems // John Wiley & Sons, 2006.
[18] Tran, T. / Cimiano,
P. / Rudolph, S. / Studer, R. Ontolgy-Based Interpretation of Keywords for Semantic
Search // Proceedings of the 6th International Semantic Web Conference, 2007. P.
523-536.
[1] Short for: Grammatical Information System (http://www.ids-mannheim.de/grammis/). The
authors of this paper are members of the Grammis project team.
[2] Short for: Propaedeutic Grammar (http://www.ids-mannheim.de/progr@mm/)
[3] See [12] for a description of the ontology building process; [14] describes
the ontology in greater detail.
[4] http://hypermedia.ids-mannheim.de/pls/public/sysgram.ansicht?v_typ=d&v_id=2871