Projects
>> MOQA
1. Team
2. Project Tasks
3. Deliverables
MOQA (Meaning-Oriented Question
Answering) is a project sponsored by ARDA under its AQUAINT
program.
1.
Team
This project is carried out
by a team consisting of New Mexico State University’s Computing
Research Laboratory (CRL) of Las Cruces, NM, the
Institute for Language and Information Technologies,
University of Maryland Baltimore County (ILIT) and CoGenTex,
Inc. of Ithaca, NY. It is a comprehensive system-level
effort covering all three main technical areas of the
ARDA AQUAINT program: question understanding and interpretation,
determining the answer and presenting the answer. The
first complete version of the system concentrates on
answering questions about travel and meetings and uses
two kinds of data sources—open text (in English,
Arabic and Persian) and a structured database, called
a fact database (Fact DB) whose entries are instances
of concepts in the system’s ontology, or world
model. The Fact DB, the ontology and the lexicons for
the three languages, together with the working memory
that includes the intermediate results of the system
operation, are the major static knowledge sources in
the system under development. The top-level architecture
of the system is illustrated in Figure 1:

Figure 1. Architecture
of the System
The general strategy for
the development of this system is rapid prototyping,
that is, developing a working system in a short period
of time, in order to be able to test and evaluate it
as a whole. We do not believe that evaluating component
routines and algorithms of a comprehensive application
such as QA is a cost-effective pursuit. Therefore, once
the basic system is assembled, we will work on adding
new domains, new knowledge and new types of processing
to it: our development approach facilitates rapid portability
to other domains. We will be evaluating the performance
of the complete system continuously by comparing the
time and effort it takes an analyst to perform a task
with and without the system. The development of an unconstrained
system of this kind will take much more time and resources
that can be made available through the AQUAINT program,
certainly, in the initial two years.
The CRL/ILIT/CoGenTex team has had a running start on this project, as collectively
we have in the past developed a significant percentage of the resources, processors,
formalisms and control architectures required for the MOQA system. This state
of affairs is summarized in Figure 2:

Figure 2. Development
status of the components of the proposed system.
2. Project
Tasks
The work on the project involves the following tasks.
Task 1: Design and Implementation of System Architecture. This task
involves integrating all the required system components available to our team
from the previous projects, developing a testing and debugging environment and
continuous integration and testing of new and expanded system modules.
Task 2: Knowledge Acquisition. This task
involves acquiring the goal component of the ontology (size
estimate: 25 concepts; extending its plan/script (“complex
event”) component to include both domain scripts and
workflow scripts whose instantiations in the Fact DB and
the extended test meaning representation (TMR) will, inter
alia, encode dialog history and context, user profiles, the
status of goal attainment. etc. (size estimate: 1,000 concepts);
acquiring semantic lexicons for Arabic and the third language
(Persian, Russian or Spanish—in the case of Spanish,
CRL already has a semantic lexicon); expanding the semantic
lexicon for English (the target size of each lexicon in Phase
I is set at 20,000 lexical units); adapting and further developing
a module (first developed in the TIDES CREST effort) for
ontology-based automatic acquisition of Fact DB elements;
and populating the Fact DB (size estimate, for the travel
and meetings domain in Phase I: 100,000 facts; size estimate,
for the workflow, user profile, user intention and QA context-related
fact in Phase I: 1,000 facts).
Task 3: Question Understanding. This task
includes improving the coverage and quality of the preprocessing
modules, especially, the tokenizers and the syntactic analyzers
for each language involved—a usable version of each
of the preprocessing modules exists at CRL for each of the
languages mentioned in the proposal (and for many others!);
coverage and quality adjustments and enhancements to the
Mikrokosmos semantic analyzer, with special attention paid
to co-reference and treatment of unattested lexical items;
testing and evaluating semantic analysis throughput for texts
in all three languages (a reminder: while the analyst/system
dialog will be conducted in English, open text IE will be
carried out in each of the system’s languages, which
then necessitates translation of results to the TMR form).
Task 4: Question Interpretation. This task
uses knowledge (stored in the Fact DB and/or in the extended
TMR) about dialog context (current and past), about the user,
about the user intentions (goals) and the status of the tasks
to present a complete view of the state of affairs in the
process of task completion and dialog communication; the
decision about what action(s) the system must take at this
juncture in the dialog and task completion is also made at
this stage.
Task 5: Answer Determination. The decisions
made at the previous step will be carried out during answer
determination. The actions may involve looking for information
in any of two kinds of sources—open text (the TREC
TDT corpus will be used for this) and a structured Fact DB.
They will also, centrally, include the maintenance of relevant
dialog with the user: the system will carry out a running
commentary on its own actions; it will also ask clarification
questions, make judgments about task priorities and order
in which they are attempted, etc. Work on open text answer
generation involves the task of generating queries in any
of the three languages off of the extended TMR obtained through
question understanding and interpretation; this task also
involves testing the available IR and IE systems, integrated
in Task 1; and the translation of the results of IE (template
slot fillers) into the language of TMR.
Task 6: Answer Formulation. This task is
the main text generation task in the system. It involves
generating a hypertext response to the user query as well
as generating the running commentary of the system’s
operations, decisions, and inferences. The output language
will be English.
Task 7: Documentation; User, Tester and Evaluator
Training; Testing; and System Evaluation. This set
of tasks will be ongoing over the entire duration of the
project. Evaluations of a complete system will be prepared
and run at the end of months eight and sixteen and end of
the project. The complexity of the system and the limited
amount of resources that can be made available makes the
formal evaluation of individual components of the system
cost-inefficient.
3. Deliverables
At the end of the project, the CRL/ILIT/CoGentex team plans to deliver the
following components:
• a comprehensive, self-aware, goal-and-plan-based, context-sensitive,
ontological-semantic QA system in the domain of travel and meetings, with a capability
to search for information in open texts in three languages and in a structured,
ontology-based Fact DB;
• an enhanced text analysis system for each of the languages;
• a question interpretation module that takes into account user goals and
the context of the dialog, as well as the awareness of the quality intermediate
and final results and rate of progress toward a goal;
• an integrated IR/IE module working on open text in three languages, on
the basis of ontologically defined extraction templates;
• a decision-making module that determines the answer(s) and action(s)
that the system must produce at each step of the dialog/task processing;
• an ontological-semantic text generation module;
• an enhanced ontology of about 6,500 concepts;
• an enhanced Fact DB of about 100,000 facts;
• a system for automating the acquisition of the Fact DB;
• a semantic lexicon for each of the languages in the system, at about
20,000 entries
• a set of system evaluation results;
• a final technical report describing the system;
• a user manual for the system.
The CRL/ILIT/CoGentex team expects that at least a subset of the above deliverables
will be included in the integrated testbed demonstrations and evaluations. |