|
Research >> Basic
Research
Basic research at ILIT falls
largely in the realm of knowledge-based natural language
processing. We continue to develop the theory and applications
of ontological semantics. Within
this general framework, we work on a variety of microtheories,
including basic semantic
dependency extraction, methods for ambiguity resolution
(knowledge-based and corpus-based, syntactic and semantic,
static or dynamic), non-literal language, recovery from
unexpected input, discourse relations, reference
resolution, aspect, modality,
semantics of modifiers, lexical and compositional semantics
of closed-class lexical items, and many others.
- Ontological
Semantics
Ontological semantics studies
the processes of extracting, representing and manipulating
meaning in natural language texts.
Computational processing in ontological semantics relies on a language-independent
ontology, an ontology-related lexicon (and onomasticon, or lexicon of proper
names) for each language involved and a fact repository consisting of instances
of ontological concepts as well as remembered text meaning representations.
The major R&D projects associated
with this approach included KBMT-89, Pangloss, Mikrokosmos,
CREST and MOQA.
A recent slide
presentation can provide
a medium-depth introduction to ontological semantics.
A book-length introduction to
ontological semantics, Ontological Semantics,
by Sergei Nirenburg and Victor Raskin, will be published
in September 2004 by MIT Press.
- Treatment
of Reference
It is said that during Napoleon’s
march from Elba to Paris at the beginning of his “One
Hundred Days” in 1815, the successive headlines
in Paris newspapers ran something like:
The Corsican Monster Lands
at Toulon
The Usurper Marches North
Bonaparte Reaches Lyons
Ex-Emperor in Fontainebleu
Paris Welcomes His Imperial Majesty |
It might be difficult to attempt
to build a computer system that would emulate the nimbleness,
inventiveness and political adroitness of the Parisian
editors. However, it is within our reach to build a system
that would determine that the text elements in boldface
all refer to the same person, and that this person is
Emperor Napoleon I Bonaparte.
We understand processing reference
in NLP as finding all referring expressions in a text
or a corpus and associating them with the representation
of real-world entities or events. This definition implies
that coreference—that is, the search for surface
antecedents—is only a means to a more fundamental
end. We propose a novel approach to the treatment of
reference in NLP that extends the current state of the
art in both breadth and depth. It covers a broader array
of phenomena and uses deeper – but available or
attainable – sources of knowledge to power its
heuristic algorithms than any extant approach. The proposed
algorithms for treatment of reference will be incorporated
into an existing natural language analysis system (Mahesh et
al. 1997, Beale et al. 2002) that includes
semantic analysis and produces meaning representations
of input texts with the help of a formal ontology.
Our procedure for reference treatment
addresses all the types of referring expressions and
consists of two components, detection and resolution.
Detection consists of the following three tasks: a) determining
which objects and events have referential function (not
all do, as in My son is a doctor);
b) categorizing the referential ones, of which there
are many subclasses (as shown in Figure 1); and c) detecting
elliptical references. Resolution then finds conceptual
references for the expressions, possibly using textual
antecedents as clues.

Figure
1. Types of expressions.
We will illustrate the types of
referring expressions with examples drawn from the following
text, taken at random from the CNN website (it is a typical
text and it demonstrates how important it is to be able
to treat reference adequately). In the text itself, for
purposes of illustration, we marked just two of the many
coreference chains in the text (the one referring to
the Afghan foreign minister, in bold; the one referring
to the Afghan people, in italics).
| WASHINGTON (CNN) -- Afghanistan's
interim foreign minister expressed optimism
Saturday that his nation can rebuild
after more than two decades of conflict, provided
that the international community remains committed
to supplying support. “What we need is continued
engagement from the United States, first of all,
in the war against terror, which will help stability
in Afghanistan and the whole region ... and also
in the reconstruction efforts of our people,” Abdullah
Abdullah told CNN. “It is a major
challenge. We are aware of it.” “What
is going on in the political process is a transition
from war to peace. After 22 years of war, we have
won the war, virtually, and we have to
win the peace,” Abdullah said. “It
is rebuilding the state from scratch in all aspects
of it – political, economical, from the infrastructure
point of view, cultural, social. It is an enormous
task. But I'm sure the Afghans will do
it with the support of the international community,” he said. Abdullah is
in Washington to prepare for a visit by interim
Afghanistan chairman Hamid Karzai, who is scheduled
to meet with President Bush Monday, his first official
meeting with Bush since assuming control after
the fall of the Taliban regime. On Friday, Abdullah met
with Secretary of State Colin Powell and National
Security Advisor Condoleezza Rice. Powell, who
visited Kabul, the Afghan capital, this month,
vowed that the United States would stand by the
Afghan people. Abdullah also
gave the Council on Foreign Relations an outline
of Afghanistan's reconstruction plan to rebuild
the devastated country. He told
the group that the interim administration is developing
a constitution for Afghanistan and will make substantial
efforts to include women and the nation's various
ethnic groups in the government. Members of the
commission that will organize the tribal council
or Loya Jirga, whose task is to choose a transitional
government at mid-year, were announced Friday.
Women are included among the commission's members. “The
opportunity is there,” Abdullah said
Saturday. “We were optimistic even
before September 11 when there were no opportunities
and we were trying hard, struggling hard,
to create that opportunity,” he said. “We,
as Afghans, have to seize it, and have
to seize it quickly, and our friends should
support us. Together we can make
it.” |
Direct reference
is referring to an object or an event by its basic name.
For people, this will typically be their full name (Abdullah
Abdullah, the Afghans), their full name expanded
by a description (interim Afghanistan chairman Hamid
Karzai, Secretary of State Colin Powell) or a canonical
abbreviation (President Bush, Bush, Powell, Abdullah).
For organizations and places, this will typically be
their full name (the United States, the Council on
Foreign Relations, Loya Jirga) or a known acronym
(CNN, Washington [for Washington, DC]). For
events, this will typically be their full name (the
war against terror) or a known abbreviation (September
11). One can view these expressions as keys for
the database records for their referents. These referring
expressions can in some cases be ambiguous (e.g., if
the database contains more than one Abdullah Abdullah).
All other referring expressions
are indirect. They subdivide into descriptions
and pointers. Descriptions denote their
referents by mentioning some of their non-key properties.
They can be definite (e.g., Afghanistan's interim
foreign minister, the international community, the Afghan
capital) or indefinite (a transition from war
to peace, continued engagement from the United States).
Unlike descriptions, pointers just contain
enough information to allow hearers to reconstruct to
which referring expressions they point. Pointers can
be further subdivided into textual pointers (those
that typically point to coreferents in the text itself,
like he) and deictic pointers (pointing
to objects in the “universe of discourse”,
that is, to some expected properties of facts – e.g.,
time, like at mid-year; space, like here;
identity of the speaker and hearer, like we;
etc.). People are adept at resolving references in well-constructed
texts. Out task is to build a computer program that emulates
that capability.
Detecting ellipsis involves
locating syntactic gaps as well as semantically incomplete
structures. In English, syntactic gaps include such things
as elided verbs in gapping structures (Mary likes
politics but Bill Ø only sports), elided
VPs (Mary wants to watch CNN but Bill doesn’t Ø ),
and elided head nouns after modifiers (Mary watched
CNN for 40 minutes and Bill for only five Ø).
Semantically incomplete structures are found in phrases
like continued engagement from the United States,
where the full interpretation of the term engagement
requires the addition of modifiers like military
and peace-keeping.
The output of the detection step in reference treatment is, then, a list of
all referring expressions, marked by their type, that were either overtly present
in the text or were introduced in it through the detection of ellipsis.
Once all referring expressions
are detected and classified, they must be resolved.
We use the term ‘resolve’ in a broader sense
than is typical in the literature: for us, resolution
means that all referring expressions must ultimately
be associated with representations of objects or events,
not only put in a coreference relation with another text
element. In our approach, the representations are stored
in a fact database and the ontology (see below). Direct
referring expressions are resolved through a direct link
to the relevant database entry (or, if there is none
such yet, to a newly created one), whereas indirect referring
expressions require specialized processing by type. Definite
and indefinite descriptions, e.g., Afghanistan's
interim foreign minister, must be linked to the
expression corresponding to a database key, e.g., Abdullah
Abdullah. All pointers and ellipses must be expanded
into their full referential form based upon the establishment
of a coreference chain within the text (he -> Abdullah
Abdullah) or extra-textual information (mid-year
-> the middle [perhaps May, June, July, August] of
the year 2002). Once expanded they, too, must be
either linked to a database entry or initiate a new one.
|