Projects
>> BOAS
Most natural language processing (NLP) applications seek to analyze every text element as a combination of lexical meaning and grammatical features, as applicable. Cross-linguistically, many types of entities—stems, inflectional affixes, derivational affixes, etc.—can singularly or in combination form a text element, and any given language uses some subset of these. Creating inventories of such entities is more typical of descriptive, typological and, to a lesser degree, theoretical linguistics than of NLP: after all, most NLP systems are built to cover some specific language(s) to whatever extent is required by the given application. However, if one’s goal is eliciting knowledge about any natural language for use in an NLP application, creating a comprehensive cross-lingual inventory of types of text elements and their composite entities becomes an essential preliminary stage of work. Once an inventory of this kind is established, one must develop a practice-oriented approach to organizing linguistic reality, a methodology of knowledge elicitation, and a scheme for turning elicited knowledge into processing rules. All of these challenges were met in development of the linguistic knowledge-elicitation (KE) system called Boas that was developed by a research team led by the PIs at NMSU CRL in 1997-2001.
Our goal in Boas I was to build
a KE system to guide a linguistically naïve speaker
of any alphabetic language (L) through the process of providing
sufficient information about L to support the automatic
ramping up of an L-to-English machine translation (MT)
system. This KE system must elicit from the user information
about the ecology (writing system, orthographic conventions,
punctuation, etc.), morphology and syntax of L, as well
as a large bilingual lexicon. The entire elicitation environment,
training materials, and means of converting the elicited
information into operational static knowledge resources
for the MT system must be specified and developed from
the outset, with no language-specific adjustments or retrofitting.
In other words, all phenomena from all natural languages
must (to the extent feasible) be covered, the collected
information must be automatically convertible into processing
resources, and the elicitation process must be understandable
to an untrained informant.
Given an initially untrained user,
the methodological initiative and a large degree of the
responsibility for coverage must rest with the system itself.
As the technological solution to the above puzzle should
be practical, the informant's time must be used efficiently.
If time were not a factor and resources were truly unlimited,
one could resort to listing many things—like inflectional
and productive derivational forms of each word—rather
than generalizing by rules. However, in the real world
the informant’s time is a concern, so the listing
option is used judiciously in Boas. To enhance the utility
of the system in practical applications, the target KE
time was set at six months, which can be increased or decreased
as resources allow. The common working language of the
interface is English, which not only permits some degree
of English-orientation in KE (e.g., using English seed
lexicons to drive lexical acquisition and preparing resident
transfer rules), but also facilitates the preparation of
a vast apparatus of training and reference materials, which
amount to an on-line introduction to descriptive linguistics.
In order to lead the informant
through the process of supplying the necessary information
in a directly usable way, Boas must be supplied with resident
(meta)knowledge about language – not L, but language
in general – which is organized into a typologically
and cross-linguistically motivated inventory of parameters,
their potential value sets, and modes of realizing the
latter. The inventory takes into account phenomena observed
in a large number of languages. Particular languages would
typically feature only a subset of parameters, values and
means of realization. The parameter values employed by
a particular language, and the means of realizing them,
differentiate one language from another and can, in effect,
act as the formal “signature” of the language.
Examples of parameters, values and realizations that play
a role in the Boas knowledge-elicitation process are shown
in Table 1. The first block illustrates inflection, the
second, closed-class meanings, the third, ecology and the
fourth, syntax.
Parameter |
Values |
Means of Realization |
|
| Case Relations |
nominative, accusative, dative, instrumental,
abessive, etc. |
flective morphology, agglutinating
morphology, isolating morphology, prepositions,
postpositions, etc. |
|
| Number |
singular, plural, dual, trial, paucal |
flective morphology, agglutinating morphology,
isolating morphology, particles, etc. |
|
| Tense |
present, past, future, timeless |
flective morphology, agglutinating morphology,
isolating morphology, etc. |
|
| Posession |
+/- |
case-marking, closed-class affix,
word or phrase, word order, etc. |
|
| Spatial Relations |
above, below, through, etc. |
word, phrase, preposition or postposition, case-marking |
|
| Expression of Numbers |
integers, decimals, percentages, fractions, etc. |
numerals in L, digits, punctuation
marks (commas, periods, percent signs, etc.) or
a lack thereof in various places |
|
| Sentence Boundary |
declarative, interrogative, imperative, etc. |
period, question mark(s), exclamation point(s),
ellipsis, etc. |
|
| Grammatical Role |
subjectness, direct-objecness, indirect-objectness,
etc. |
case-marking, word order, particles, etc. |
|
| Agreement (for pairs of elements) |
+/- person, +/-number, +/- case, etc. |
flective, agglutinating or isolating inflectional
markers |
|
Table
1. Sample parameters, values and means of their realization.
In the elicitation process, the
parameters (left column) represent categories of phenomena
that need to be covered in the description of L, the values
(middle column) represent choices that orient what might
be included in the description of that phenomenon for L,
and the realization options (right column) suggest the
kinds of questions that must be asked to gather the relevant
information.
A summary of tasks in Boas is given
below. The listed tasks are themselves complex, each involving
large amounts of metaknowledge and, consequently, involved
sequences of elicitation screens:
Ecology
inventory of characters
inventory and use of punctuation marks
proper name conventions
transliteration
expression of dates and numbers
list of common abbreviations, geographical entities, etc.
Morphology
selecting language type: flective, agglutinating, mixed
paradigmatic inflectional morphology, if needed
non-paradigmatic inflectional morphology, if needed
derivational morphology
Syntax
structure of the noun phrases: NP components, word order,
etc.
realization of grammatical functions: subject, direct object,
etc.
realization of sentence types: declarative, interrogative,
etc.
special syntactic structures: topic fronting, affix hopping,
etc.
Closed-Class Lexical Acquisition
Provide L translations of
some 150 closed-class meanings, which can be realized as
words, phrases, affixes or features
(e.g., Instrumental Case used to realize instrumental ‘with’,
as in hit with a stick). Inflecting forms of any of the
first three realizations must be provided as well, as
applicable.
Open-Class Lexical Acquisition
Build a L-to-English lexicon by a) translating word and
phrase meanings from an English seed lexicon, b) importing
then supplementing an on-line bilingual lexicon, c) composing
lists of words and phrases in L and translating them
into English, or d) any combination of the above. Grammatically
important inherent features and irregular inflectional
forms must be provided.
In Boas II we will carry out the following operations
leading to improvements and extensions of the capabilities
of the original system:
1. Incorporate inflectional
form checking in the open-class lexicon.
When building
the open-class lexicon the user
should have the opportunity to check that the results
of morphological learning would correctly and fully analyze
all extant forms of any newly entered word. Under our
original
morphological learning scheme (supported by the Xerox
toolset), form generation was not only possible but was
crucial for
the “learning loop” methodology of paradigmatic
KE. Provided we can import a morphological learner with
generation capabilities, we will exploit the generation
capabilities in a new way. First, using elicitation pages
already designed for, but not incorporated into, Boas I,
we will elicit paradigm diagnostics for each paradigm (e.g.,
paradigm 1 might contain feminine nouns that end in -a
or -e). The diagnostics might overlap, permitting a given
word to fall into more than one paradigm, with the correct
choice being a matter of lexical specification. Then we
will incorporate into the open-class interface for flective
languages a button called “Show Paradigm,” which
would tell the system to generate inflectional forms
using a paradigm guessing method grounded in the abovementioned
heuristics. If multiple paradigm membership is possible,
it will cycle through the possibilities until: 1) the
correct
one is found, 2) the user decides that a new productive
paradigm must be initiated, 3) the user decides to list
some forms as irregular without initiating a new productive
paradigm type.
2. Develop approaches to paradigm specification for paradigms
with multiple stems.
In Boas I, if an inflectional
paradigm required different stems for different subparadigms
(e.g.,
Persian verbs, which have different stems for different
tenses), the user needed to create the subparadigms
separately and enter the citation forms for the different
stems separately
in the open-class lexicon since the morphological
learners all expected a single citation form for a paradigm.
In Boas II we will streamline this process, shifting
the organizational
part of the work to the system such that the user
can
type in multi-stem paradigms as a whole, conveniently
indicate
which stems are associated with which inflectional
forms (using a similar elicitation methodology as
we currently
use for multi-word inflectional forms), and list
multiple citation forms for a given word in the open-class
lexicon.
Initial specifications for this process were written
for Boas I but not fully developed. 3. Expand the repertoire of syntactic threads.
In Boas I, we treated the
cross-linguistically most prevalent syntactic phenomena.
We also developed threads (extant
as web pages but not incorporated because not programmed)
that cover syntactic features and processes that are
not found quite so universally, like affix hopping, subject
and object ellipsis, various categories of verbal ellipsis,
heavy-NP shift, approximation, special types of agreement,
etc. We decided to treat these as “singletons” and
to add to their inventory through continued cross-linguistic
research and testing. The threads that we have need to
be incorporated and more such threads - many more, in
fact - should be incorporated. The challenge lies not
only in eliciting the information, of course, but in
deciding how it can be turned into rules.
4. Boundary alternations in agglutinating languages and
flective languages with derivational morphology.
Boas I
treats those boundary alternations that occur in flective
paradigms, but the many productive types that occur in
other word-formation processes are not elicited. The
first step in doing so, especially with
the most inexperienced
users, will be to teach users to find and recognize such
alternations on the basis of examples. Candidate examples
can be gathered from a corpus and presented to the user
based on the information provided in the agglutinating-
and derivational-affix portions of KE. Formalizing the
associated rules might be a good place for an expert/novice
bifurcation in Boas II (we have a few such in Boas I).
Experts can be permitted to write patterns of boundary
alternations and, to some degree of precision, indicate
the contexts (e.g., prefix + verb). Novices might be
asked just to select relevant examples
and highlighting the boundary
alternation, after which a program -- necessarily, more
open to error than a human analyst -- would guess the
alternations and/or learn to use fuzzy
matching in analysis.
5. Morphotactics of agglutinating languages.
This topic
is linked to #4. The morphological analyzer should benefit
from knowledge not only of potential boundary alternations
during affixation, but also the typical order of affixes
- to the extent that this can be formalized. Again, developers
can design example-based methods anticipating and supported
by the linguistic knowledge the user will already have
provided in Boas.
6. Derivational and inflectional reduplication.
Boas I
does not have any productive treatment of reduplication,
since it requires machine-learning approaches different
than for typical inflectional morphology. If we anticipate
machine-learning methods that can cover reduplication,
we can orient elicitation in that direction. If not,
we will develop more convenient ways
of prompting the user
to list the relevant reduplicative forms for every open-class
item.
7. Improve the English list of open-class items.
The list
is currently too long and not organized well enough in
terms of word frequency for the types of texts expected
to be of interest.
8. Implement an interface for writing phrasals with variables.
Boas I accepts phrasals with inflecting
elements, but does not permit the indication
of variables such as:
[np] avoir [#] [an(s)] --> [np] am
[#] [year(s)] old. We have specifications
for this facility but they
have
not been implemented yet.
|