Виртуальные компаньоны человека как новый вид диалогового
интерфейса для будущего Интернета
Artificial
Companions as a new kind of dialogue interface to the future Internet
Yorick Wilks (yorickwilks@googlemail.com)
University of Sheffield, UK
В статье делается попытка связать будущее Интернета с
новой, пока что относительно мало разработанной, технологией компьютерной
реализации языка и речи. Концепцию, лежащую в основе этой технологии, я называю
виртуальным компаньоном человека. Прежде чем обсуждать состав виртуального
компаньона, необходимо упомянуть две технологии, не только потому, что они
важны сами по себе, но также и потому, что относительно целей и достигнутых
результатов каждой из этих технологий существует недопонимание. Конкретнее
говоря, это
Языковые
и речевые технологии
Агенты и
семантическая сеть
К первой технологии имеет отношение представление Бернерса-Ли [Berners-Lee et al., 2001] о том, какие изменения предстоят
Интернету. Именно для этого нового Интернета мы предназначаем
виртуального компаньона – интерфейс человека и машины. Мы полагаем, что без
такого компаньона пользоваться Интернет будет сложнее, а не проще. В конце
статьи мы обратимся к семантической сети. Второе понятие – это агенты, которые
из временных программных средств, способных, к примеру, обнаружить в Интернете дешевую веб-камеру, превратятся в
постоянные элементы социального компаньона, способные взаимодействовать с
пользователем в диалоговом режиме в течение долгого времени, усваивать
потребности и предпочтения пользователя и в разговоре с ним сообщать большое
количество жизненно важных данных.
Introduction
This is not a
paper in social science, but rather in speculative technology: however, the
underlying technologies exist already and I will briefly describe them, along
with some account of the current debates over their consequences. The crucial
move in the paper will be when, after describing Artificial Companions, real
and possible, I go on to argue that they can be seen as links to the Internet,
at least for vulnerable classes of people (the old, the young) but perhaps for
all of us when faced with the coming torrent of information on the Internet,
particularly information about ourselves.
Before moving to describe the integration that
constitutes the Companion, we must first mention two technologies, not only in
their own right but because, in each case, there have been misunderstandings
about their achievements and goals.
Language and speech technologies are, for our
purposes, two closely related methods for interfacing to the Internet; the
first by typing to it to ask a question or to ask it to do something, and the
second by speaking and listening, for the same purposes. The two are related,
in that speech technology normally decodes speech waves—i.e. what is said into
a microphone---into some form like written text inside a computer, which is
then analysed so as to be understood, with the effect that both spoken and
written input end up being analysed in similar ways by what we are calling
“language technology’, which we can think of, loosely, as going from text to
what it means.
The notion of a
Companion
The paper introduces the
notion of an Artificial Companion as a socially important paradigm for language
and speech research in the next ten years: an intelligent and helpful cognitive
agent which appears to know its owner and their habits, chats to them and
diverts them, assists them with simple tasks but makes no technical demands on
them at all, and might be most suitable for vulnerable social groups like the
young and the old. The paper also discusses current aspects of the overall
speech and language research program that a Companion will need.
The technologies needed
for a Companion are very near to a real trial model; some people think that
Artificial Intelligence (AI) is a failed project after nearly fifty years, but
that is it not true at all: it is simply everywhere. It is in the computers on
200-ton planes that land automatically in dark and fog and which we trust with our
lives; it is in chess programs like IBM's Big Blue that have beaten the world's
champion, and it is in the machine translation programs that offer to translate
for you any page of an Italian or Japanese newspaper on the web.
And where AI certainly
is present, is in the computer technologies of speech and language: in those
machine translation programs and in the typewriters that type from your
dictation, and in the programs on the phone that recognise
where you want to buy a train ticket to, from among the four hundred or so
British station names. But this is not a paper about computer technology any
more than it is about robots, nor is it about philosophy.
Companions are not at
all about fooling us as to their true natures, as in the Turing test scenario, because
they will not pretend to be human at all: imagine the following scenario, which
will become the principal one, running through this paper. An old person sits
on a sofa, and beside them is a large furry handbag, which we shall call a
Senior Companion; it is easy to carry about, but much of the day it just sits
there and chats. Given the experience of Tamagochi,
and the easily ascertained fact that old people with pets survive far better
than those without, we will expect the Companion to be an essential lifespan
and health improving object to own.
Other Companions are
just as plausible as the Senior one, in particular the
Junior Companion for children, that would probably take the form of a backpack,
a small and hard to remove backpack that always knew where the child was. But
the Senior Companion will remain our focus, not because of its obvious social
relevance and benefit, possibly even at a low level of function that could be
easily built with what is now available in laboratories, but because of the
particular fit between what a Companion is and old people's needs.
Common sense tells us
that no matter what we read by way of official encouragement, a large
proportion of today's old people are effectively excluded from information
technology, the web, the internet and advanced mobile phones because "they
cannot learn to cope with the buttons". This can be because of their
generation or because of losses of skill with age: there are talking books in
abundance now but many, otherwise intelligent, old
people cannot manipulate a tape recorder, which has too many small controls for
them with unwanted functionalities. All this is obvious and well known and yet
there is little thought as to how our growing body of old people can have
access to at least some of the benefits of information technology without the
ability to operate a PC or even a mobile phone.
After all, the needs of
the elderly are real, not just to have someone to talk to, but to deal with
correspondence from public bodies, such as councils and utility companies
demanding payment, with the need to set up times by phone to be visited by
nurses or relatives, how to be sure they have taken the pills, when keeping any
kind of diary may have become difficult, as well as deciding what foods to
order, even when a delivery service is available via the net but difficult in
practice for them to make use of.
In all these situations,
one can see how a Companion that could talk and understand on the phone, and
also gain access to the web, as well as to process written text in email could
become an essential mental prosthesis for an old person, one that any
responsible society would have to support. But there are also aspects of this
which go beyond getting information, such as having the newspapers blown up on
the TV screen till the print was big enough to be read, and dealing with
affairs requiring some degree of reasoning, like paying bills from a bank
account.
We have talked of
Companions as specialised computer agents for tasks
as simple as using the web to find a supermarket’s home delivery service for
groceries. More interestingly, it may involve using the web to find out what
has happened to their old school friends and workmates, something millions
already use the web for. But we shall need some abstract notion of time lines
and the coherence of life events on the web to sort friends and schoolmates
from the thousands of other people with the same names.
the reasoning
technologies we shall need to organise the life of a
Companion’s owner may turn out to be very same technologies needed to locate
other individuals on the web and select them out from all the personal
information about the world’s population that fills up the WWW, given that the
web is now not just for describing the famous but covers potentially everyone. Two
of my friends and colleagues who are professors of computer science have some
difficulty distinguishing, and maintaining a difference, between themselves on
the web and, in one case, a famous pornography supplier in Dallas, and in
another case a reasonably well known disc-jockey in Houston, all of whom are
highly ranked by the Google algorithm [Page et al., 1998].
These problems---of
sorting out who exactly web information is about---- will soon become not just
quirky but the norm for everyone, and what I shall want to argue later is that
the kind of computer agency we shall need in a Companion, one that deals with
the web for us if we are old or maybe just lazy, is in fact closely related to
the kind of agency we shall need to deal with the web in any case as it becomes
more complex. To put this very simply: the web will become unusable for
non-experts unless we have human-like agents to manage its complexity for us. The
Internet/web itself must develop more human-like characteristics at its
peripheries if it is to survive as a usable resource and technology: just
locating a particular individual on the web, when a majority of the EU and US
populations have a web presence, will become far more difficult and time
consuming that it is now. If this argument is right, Companions will be needed
by everyone, not simply the old, the young and the otherwise handicapped. It is
going to be impossible to use of the web without its having some kind of a
human face.
The notion of a Companion
developed so far is anything but superhuman; it is vital to stress this because
some of the public rhetoric about what companionable computers will be like has
come from films such as 2001, whose computer HAL is superhuman in knowledge and
reasoning. He is a very dangerous Companion, and prepared to be deceptive to
get what he wants, which may be not at all what we want. Seymour Papert at MIT always argued that it was a total
misconception that AI would ever try to model the superhuman, and that its mission
was to model the normal, which was much the same as AI-pioneer John McCarthy’s
emphasis on the importance of common sense reasoning was on capturing the
shorthand of reasoning, the tricks that people actually use to cope with
everyday life. Only then would we understand the machines we have built and
trained and avoid them becoming too clever or too dangerous. This same impetus
was very much behind Asimov’s Laws of Robotics, which set out high-level
principles that no robot should ever break if it is to bring no harm to humans.
The difficulty with such
principles is fairly obvious: if a machine were clever enough it would find a
way of justifying (to itself) an unpleasant outcome for someone, perfectly
consistently with acceptable overall principles. Doing that has been a
distinctively human characteristic throughout history: one thinks of all those
burned for the good of their own souls and all those sacrificed so that others
might live. In the latter case, we are probably grateful for those lost in what
were really medical experiments ---- such as the early heart transplants ----
even though they were never called that.
It will not be possible
to ignore these questions when presenting Companions in more detail, and in
particular the issue of where responsibility and blame may lie when a Companion
acts as a person’s agent and something goes wrong. At the moment,
Anglo-American law has no real notion of any responsible entity except a human,
if we exclude Acts of God in insurance policies. The only possible exception
here is dogs, which occupy a special place in English law, at least, and seem
to have certain rights and attributions of character separate from their
owners. If one keeps a tiger, one is totally responsible for whatever damage it
does, because it is ferae naturae, a
wild beast. Dogs, however, seem to occupy a middle ground as responsible
agents, and an owner may not be responsible unless the dog is known to be of “bad
character”. We shall return to this later and argue that we may have here a narrow
window through which we may begin to introduce notions of responsible machine
agency, different from that of the owners and manufacturers of machines.
It is easy to see the
need for something like this: suppose a Companion told one’s grandmother that
it was warm outside and, when she went out into the freezing garden believing
this, she caught a chill and became ill. One might well want to blame someone
or something in these circumstances and would not be happy to be told that
Companions could not accept blame and that, if one read the small print on the
Companion’s box, one would see that the company had declined all responsibility
and had even got one to sign a document accepting this. All this may seem
fanciful and even acceptable if one’s grandmother recovered and the company
gave the Companion a small tweak so it never happened again.
This story makes no
sense at the moment, and indeed the Companion might point out with reason, when
the maintenance doctor came round, that it had read the outside temperature
electronically and could show that it was a moderate reading and the blame
should fall on the building maintenance staff, if anywhere. These issues will
return later but what is obvious already is that Companions must be prepared to
show exactly why they said the things they said and offered the advice they
did.
A Companion’s memory of
what it has said and done may be important, but will be used only rarely one hopes; though it may be necessary for it to repeat its
advice at intervals with a recalcitrant user: “You still haven’t taken your
pills. Come on, take them now and I’ll tell you a joke you haven’t heard
before”. James Allen in Florida is already said to have modeled a talking
companionable pill for the elderly!
The state of language and
speech technology
How does this rather airy vision connect
to the general state of R & D in speech recognition and natural language
processing at the moment? My own belief is that most of the components needed
for a minimally interesting Companion are already available; certainly the
Companion is not particularly vulnerable to one major current technical
weakness, namely the imperfect recognition rate of available Automatic Speech
Recognition (ASR) systems. This, of course, is because a Companion is by definition
dedicated to a user and so the issue of user-independent ASR does not initially
arise, except when the Companion needs to make its own phone calls and
understand what is said to it.
However, the Companion is not merely an
application wholly neutral between current disputes about how best to advance
speech and language systems, in part because it will surely need a great deal
of representation of human knowledge and belief and therefore the Companion’s
development would seem to need overall approaches and software architectures
that allow such representations and, ultimately, their derivation from data by
machine learning. This last clause is very important because there has been a
profound methodological shift in speech and language research in the last two
decades. Before that, it was generally assumed that the knowledge of the world
and of language that a machine intelligence required
could be programmed in directly, the content being provided by the researcher’s
intuition. In the case of language, this assumption followed directly from
Chomsky’s [1972] approach to linguistics: that intuitions
about the nature of language can be computed by rules written by experts who
have intuitive knowledge of their (native) language.
All this has now turned out to be false:
no effective systems have ever been built on such principles, nor (outside
machine translation, perhaps) are they ever likely to be. The revolution that
has replaced those doctrines holds that such knowledge, world or linguistic,
must be gained from data by defensible (i.e. non-intuitionistic)
procedures like machine learning.
In the late 1980’s when symbolic natural
language processing (NLP) was invaded by an empirical and statistical
methodology driven by recent successes in speech processing. The shock troops
of that invasion were the IBM team under Jelinek
which developed a wholly novel statistical approach to machine translation
(MT), one that was not ultimately successful [see Wilks 1994 for a discussion]
but did better than anyone in conventional MT initially expected, and set in
train a revolution in methodology in NLP as a whole.
Although the IBM team began without any
attention to the symbolic content of linguistic MT, they were forced, by their
inability to beat conventional MT systems in DARPA competitions, to take on
board traditional linguistic notions such as lexicons, morphology and grammar,
but they imported them not from intuitions but in forms such they could be
learned in their turn and that fact was the ultimate triumph of their
revolution.
The present situation in dialogue
modeling---such as will be needed for a Companion--- is in some ways a replay,
at a lower level, of that titanic struggle. The introduction into ASR of so
called “language models” –which are usually no more than corpus bi-gram
statistics to aid recognition of words by their likely neighbours-----have
caused some, like Young [2002] to suggest that simple extensions to current
speech (ASR) methods could solve all the problems of language dialogue modeling.
Young describes a
complete dialogue system seen as what he calls a Partially Observable Markov
process, of which subcomponents can be observed in turn with intermediate
variables and named (in order):
Speech
understanding
Semantic
decoding
Dialogue
act detection
Dialogue
management and control
Speech
generation
Such titles
are close to conventional for an NLP researcher, e.g. when he intends the third
module as something that can also recognise what we may call the function of an utterance, such as that
it is a command to do something and not a pleasantry. Such terms have been the
basis of NLP dialogue pragmatics for some thirty years, and the interesting
issue here is whether Young’s Partially
Observable Markov Decision Processes, are a good level at which to describe
such phenomena, implying as they do that
the classic ASR machine learning
methodology can capture the full functionality of a dialogue system, when its
internal structures cannot be fully observed, even in the sense that the waves,
the phones and written English words can be. The analogy with Jelinek’s MT project holds only at its later, revised
stage, when (as we noted earlier) it was proposed to take over the classic
structures of NLP, but recapitulate them by statistical induction. This is, in
a sense exactly Young’s proposal for the classic linguistic structures
associated with dialogue parsing and control with the additional assumption,
not made earlier by Jelinek, that such modular
structures can be learned even when there are no distinctive and observable
input-output pairs for the module that would count as data by any classic
definition, since they cannot be word strings but symbolic formalisms like
those that classic dialogue managers manipulate.
The
intellectual question of whether the methodology of speech research, tried,
tested and successful as it is, can move in and take over the methodologies of
language research may seem to many a completely arcane issue, like ancient
trade union disputes in shipbuilding, say, as to who bored the holes and who
held the drills. But, as with those earlier labour struggles, they seem quite
important to the people involved in them and here, unlike shipbuilding, we have
a clash of expertise but no external common-sense referee to come in and give a
sensible decision.
Jelinek’s
original MT strategy was non/anti-linguistic with no intermediate
representations hypothesized between speech input and speech output, whereas
Young assumes roughly the same intermediate objects as linguists but in very simplified
forms. So, for example, he suggests methods for learning to attach Dialogue
Acts to utterances but by methods that make no reference to linguistic methods
for this [known since Samuel et al., 1998] and, paradoxically, Young’s
equations do not make the Dialogue Acts depend on the words in the utterance,
as all linguistic methods do. His overall aim is to obtain training data for
all of them so the whole process becomes a single throughput Markov model, and Young
concedes this model may only be for simple domains, such as, in his example, a
pizza ordering system.
All parties in this
dispute, if it is one, concede the key role of machine learning, and all are
equally aware that structures and formalisms designed at one level can
ultimately be represented in virtual machines of less power but more
efficiency. In that sense, the primal [Chomsky, 1959] dispute between Chomsky
and Skinner about the nature of the human language machine was quite pointless,
since Chomsky’s transformational grammars could be represented, in any concrete
and finite case, such as a human being, as a finite state machine, of the sort
espoused by Skinner.
All that being so,
researchers nonetheless have firm predelictions as to
the kinds of design within which they believe functions and capacities can best
be represented, and, in the present case, it is hard to see how the natural clusterings of states that form a topic (such as, for
example, how to build a jet plane, piece by piece) can be represented in finite
state systems. It is equally difficult to see how the human ability to return
in conversation to a previously suspended topic can be represented plausibly in
such a way. But these are all matters that can be represented and processed
naturally in well understood virtual machines above the level of finite state
matrices [see Wilks et al. 2004].
There is no suggestion
that a proper or adequate discussion of Young’s views has been given here, only
a plea that machine learning must be possible over more linguistically adequate
structures than finite state matrices if we are to be able to represent, in a
perspicuous manner, the sorts of belief, intention and control structures that
complex dialogue modeling will need; it cannot be enough to always limit
ourselves to the simplest applications on the grounds, as Young puts it, that
« the typical system S will typically be intractably large and must be
approximated ». In the end, the case put here may be no more than that the
structures we use to represent our language, including to machines, must be
comprehensible to us as humans.
The
Semantic Web
Mention has been made
earlier of the new form [Berners-Lee et al., 2001)] of the WWW as envisaged by
Berners-Lee and colleagues to follow his original conception. This is a large
topic and suitable for a separate paper [e.g. Wilks, 2006] and can be seen in
two quite different ways: first, as the existing WWW but augmented by
annotations on the items of all the texts it contains, so as to give more
direct access to the meaning content of the texts.
On this view, the
Semantic Web (SW) is an outgrowth of both language technologies, as described
above and their notion of augmentations, which is partly inherited from
initiatives in the Humanities (e.g. the Text Encoding Initiative, [see TEI]). These
annotations could be seen as imposing a “point of view” on the SW, so that, for
example, it might be possible to use the annotations to prevent me seeing any
web pages incompatible with The Koran, and that might be an Internet-for-me
that I could choose to have. But there is no reason why such an annotated web
should necessarily, as some have argued [e.g. Nelson,
2005], impose a unique point of view. The technology of annotations is quite
able to record two quite separate annotation data (as meta-data) for the same
texts, and no uniformity of point of view is either necessary or desirable.
The second view of the
SW, and one that Berners-Lee prefers, is that of an Internet whose content is
accessible to Agents, partly through annotations and partly through data-bases
whose semantics are well-known and understood. These agents [see e.g. Walton,
2006] operate on the Internet and provide services to customers, such as
updating their diaries, finding cheap gas supplies etc. Such agents are
therefore rather different from the concept of Companions, for they are transitory, and not designed for a permanent relationship
with an owner based on extensive knowledge about the owner. One should note
here, however, that contemporary work on the SW [e.g. Bontcheva
et al., 2003, Ciravegna
et al., 2003] has no need to choose
between these two sources and functionalities I have distinguished, but rather
seeks to combine both.
A third strand in the
genesis of the SW is that of traditional AI itself and its long and honourable tradition of modeling reasoning, planning and
knowledge representation. Some would argue the SW is no more than weaker form
of AI which has sacrificed representational power to gain a system that works
on a large scale.
Companions will draw on
all these strands in the SW as well as that of the ECAs,
or Embodied Conversational Agents [see e.g. Ruttkay
and Pelachaud, 2004], although these have
conventionally been conceived of not in language terms but of graphical,
avatar, glance, expression and presence terms—i.e. with the emphasis on the
visual, whereas the Companion is fundamentally an agent that establishes a
relationship through talking, with all that entails in terms of politeness,
emotion, personality and how those slippery but real concepts can be modeled in
automata. But again, none of these borderlines are firm: ECA: Companion, SW
Agent: ECA, and most of the questions and technologies touched on in this paper
apply not only to possibly permanent Companions but to a whole range of
interactions with the Internet, from pseudo-boyfriends and –girlfriends, to
recent results on determining and simulating author personalities in weblog texts [see Oberlander and Nowson, 2006].
In the coming decade the European Commission is planning huge
investments in all these technologies under its Information Society
Technologies (IST) program, and the edges of this research and the barriers to
its advance should be much clearer during the coming Seventh Framework Programme [The COMPANIONS project will be supported 2006-2010
as the Integrated Project IST-34434: Intelligent, Persistent,
Personalised Multimodal Interfaces to the Internet].
References
Ballim, A., and Wilks, Y. (1991) Artificial
Believers: the ascription of belief. Lawrence Erlbaum,
Hillsdale NJ.
Berners-Lee,
T., Hendler, J., and Lasilla,
O. (2001). The Semantic Web.
Scientific American.
Bontcheva, K., and
Cunningham, H. (2003) Information Extraction as a Semantic Web Technology:
Requirements and Promises. Adaptive Text Extraction and
Mining workshop.
Chomsky, N. (1959) Review of
Skinner's Verbal Behaviour, Language
35: 26-58.
Chomsky, N. (1972) Language
and Mind, Harcourt Brace, New York.
Ciravegna,
F., (2003) Designing adaptive information extraction for the Semantic Web in Amilcare. In S. Handschuh and S. Staab, (eds.),
Annotation for the Semantic Web, Frontiers in Artificial Intelligence and
Applications. IOS Press.
Cole, R., Mariani, J., Uszkoreit, H., Varile, N., Zaenen, A., Zampolli, A., and V. Zue, (1998)
Survey of the State-of-the-Art in Human Language Technology ,
Cambridge University Press.
Ferguson, C. H. (2005)
What’s Next for Google, In MIT Technology Review:
http://www.technologyreview.com/articles/05/01/issue/ferguson0105.asp?trk=nl
FLIKR: http://www.flickr.com/
Memories for
Life and Photocopains: http://www.memoriesforlife.org/
Oberlander, J. and S. Nowson, (2006) Whose thumb is it anyway?:
Classifying author personality from weblog text.
http://www.hcrc.ed.ac.uk/~jon/papers/drafts/pc8.pdf
Page, L., Brin, S., Motawani, T., and T. Winograd
(1998), The PageRank Citation Ranking: Bringing
Order to the Web, Stanford Digital Library Technologies Project
Ramchurn, S. D., Huynh, D. and Jennings, N. R.
(2004) Trust in multiagent systems. The Knowledge Engineering Review 19(1).
Ruttkay, Z., and C. Pelachaud,
(2004) Evaluating Embodied Conversational Agents, Kluwer,
Berlin.
Samuel, K., Carberry, S., and Vijay-Shankar, R. (1998). Dialogue Act Tagging with
Transformation-Based Learning. In Proc. COLING98,
Montreal.
TEI: http://www.tei-c.org/
Walton, C.
(2006). Agents and the Semantic Web. Oxford, Oxford
University Press.
Wilks, Y. (1994). Stone Soup and the French Room: the
empiricist-rationalist debate about machine translation. Reprinted
in Zampolli, Calzolari and
Palmer (eds.) Current Issues in Computational Linguistics: in honor of Don Walker. Kluwer:
Berlin
Wilks, Y., Webb, N., Setzer, A., Hepple, M., and Catizone, R. (2004) Machine Learning approaches to human
dialogue modelling. In Kuppervelt,
Smith (eds.) Current and New Directions in Discourse and Dialogue, Kluwer, Berlin.
Wilks,
Y. (submitted 2006) The Semantic Web and the Apotheosis of annotation. Journal of Web
Semantics.
Young, S. 2002. Talking
to machines—statistically speaking, Proc. ICSOS02.