Projects >> BOAS

Most natural language processing (NLP) applications seek to analyze every text element as a combination of lexical meaning and grammatical features, as applicable. Cross-linguistically, many types of entities—stems, inflectional affixes, derivational affixes, etc.—can singularly or in combination form a text element, and any given language uses some subset of these. Creating inventories of such entities is more typical of descriptive, typological and, to a lesser degree, theoretical linguistics than of NLP: after all, most NLP systems are built to cover some specific language(s) to whatever extent is required by the given application. However, if one’s goal is eliciting knowledge about any natural language for use in an NLP application, creating a comprehensive cross-lingual inventory of types of text elements and their composite entities becomes an essential preliminary stage of work. Once an inventory of this kind is established, one must develop a practice-oriented approach to organizing linguistic reality, a methodology of knowledge elicitation, and a scheme for turning elicited knowledge into processing rules. All of these challenges were met in development of the linguistic knowledge-elicitation (KE) system called Boas that was developed by a research team led by the PIs at NMSU CRL in 1997-2001.

Our goal in Boas I was to build a KE system to guide a linguistically naïve speaker of any alphabetic language (L) through the process of providing sufficient information about L to support the automatic ramping up of an L-to-English machine translation (MT) system. This KE system must elicit from the user information about the ecology (writing system, orthographic conventions, punctuation, etc.), morphology and syntax of L, as well as a large bilingual lexicon. The entire elicitation environment, training materials, and means of converting the elicited information into operational static knowledge resources for the MT system must be specified and developed from the outset, with no language-specific adjustments or retrofitting. In other words, all phenomena from all natural languages must (to the extent feasible) be covered, the collected information must be automatically convertible into processing resources, and the elicitation process must be understandable to an untrained informant.

Given an initially untrained user, the methodological initiative and a large degree of the responsibility for coverage must rest with the system itself. As the technological solution to the above puzzle should be practical, the informant's time must be used efficiently. If time were not a factor and resources were truly unlimited, one could resort to listing many things—like inflectional and productive derivational forms of each word—rather than generalizing by rules. However, in the real world the informant’s time is a concern, so the listing option is used judiciously in Boas. To enhance the utility of the system in practical applications, the target KE time was set at six months, which can be increased or decreased as resources allow. The common working language of the interface is English, which not only permits some degree of English-orientation in KE (e.g., using English seed lexicons to drive lexical acquisition and preparing resident transfer rules), but also facilitates the preparation of a vast apparatus of training and reference materials, which amount to an on-line introduction to descriptive linguistics.

In order to lead the informant through the process of supplying the necessary information in a directly usable way, Boas must be supplied with resident (meta)knowledge about language – not L, but language in general – which is organized into a typologically and cross-linguistically motivated inventory of parameters, their potential value sets, and modes of realizing the latter. The inventory takes into account phenomena observed in a large number of languages. Particular languages would typically feature only a subset of parameters, values and means of realization. The parameter values employed by a particular language, and the means of realizing them, differentiate one language from another and can, in effect, act as the formal “signature” of the language. Examples of parameters, values and realizations that play a role in the Boas knowledge-elicitation process are shown in Table 1. The first block illustrates inflection, the second, closed-class meanings, the third, ecology and the fourth, syntax.

Parameter
Values
Means of Realization
Case Relations nominative, accusative, dative, instrumental, abessive, etc. flective morphology, agglutinating morphology, isolating morphology, prepositions, postpositions, etc.

Number singular, plural, dual, trial, paucal flective morphology, agglutinating morphology, isolating morphology, particles, etc.

Tense present, past, future, timeless flective morphology, agglutinating morphology, isolating morphology, etc.
Posession +/- case-marking, closed-class affix, word or phrase, word order, etc.

Spatial Relations above, below, through, etc. word, phrase, preposition or postposition, case-marking
Expression of Numbers integers, decimals, percentages, fractions, etc. numerals in L, digits, punctuation marks (commas, periods, percent signs, etc.) or a lack thereof in various places

Sentence Boundary declarative, interrogative, imperative, etc. period, question mark(s), exclamation point(s), ellipsis, etc.
Grammatical Role subjectness, direct-objecness, indirect-objectness, etc. case-marking, word order, particles, etc.

Agreement (for pairs of elements) +/- person, +/-number, +/- case, etc. flective, agglutinating or isolating inflectional markers

Table 1. Sample parameters, values and means of their realization.

In the elicitation process, the parameters (left column) represent categories of phenomena that need to be covered in the description of L, the values (middle column) represent choices that orient what might be included in the description of that phenomenon for L, and the realization options (right column) suggest the kinds of questions that must be asked to gather the relevant information.

A summary of tasks in Boas is given below. The listed tasks are themselves complex, each involving large amounts of metaknowledge and, consequently, involved sequences of elicitation screens:

Ecology
inventory of characters
inventory and use of punctuation marks
proper name conventions
transliteration
expression of dates and numbers
list of common abbreviations, geographical entities, etc.

Morphology
selecting language type: flective, agglutinating, mixed
paradigmatic inflectional morphology, if needed
non-paradigmatic inflectional morphology, if needed
derivational morphology

Syntax
structure of the noun phrases: NP components, word order, etc.
realization of grammatical functions: subject, direct object, etc.
realization of sentence types: declarative, interrogative, etc.
special syntactic structures: topic fronting, affix hopping, etc.

Closed-Class Lexical Acquisition
Provide L translations of some 150 closed-class meanings, which can be realized as words, phrases, affixes or features (e.g., Instrumental Case used to realize instrumental ‘with’, as in hit with a stick). Inflecting forms of any of the first three realizations must be provided as well, as applicable.

Open-Class Lexical Acquisition
Build a L-to-English lexicon by a) translating word and phrase meanings from an English seed lexicon, b) importing then supplementing an on-line bilingual lexicon, c) composing lists of words and phrases in L and translating them into English, or d) any combination of the above. Grammatically important inherent features and irregular inflectional forms must be provided.

In Boas II we will carry out the following operations leading to improvements and extensions of the capabilities of the original system:

1. Incorporate inflectional form checking in the open-class lexicon.

When building the open-class lexicon the user should have the opportunity to check that the results of morphological learning would correctly and fully analyze all extant forms of any newly entered word. Under our original morphological learning scheme (supported by the Xerox toolset), form generation was not only possible but was crucial for the “learning loop” methodology of paradigmatic KE. Provided we can import a morphological learner with generation capabilities, we will exploit the generation capabilities in a new way. First, using elicitation pages already designed for, but not incorporated into, Boas I, we will elicit paradigm diagnostics for each paradigm (e.g., paradigm 1 might contain feminine nouns that end in -a or -e). The diagnostics might overlap, permitting a given word to fall into more than one paradigm, with the correct choice being a matter of lexical specification. Then we will incorporate into the open-class interface for flective languages a button called “Show Paradigm,” which would tell the system to generate inflectional forms using a paradigm guessing method grounded in the abovementioned heuristics. If multiple paradigm membership is possible, it will cycle through the possibilities until: 1) the correct one is found, 2) the user decides that a new productive paradigm must be initiated, 3) the user decides to list some forms as irregular without initiating a new productive paradigm type.


2. Develop approaches to paradigm specification for paradigms with multiple stems.

In Boas I, if an inflectional paradigm required different stems for different subparadigms (e.g., Persian verbs, which have different stems for different tenses), the user needed to create the subparadigms separately and enter the citation forms for the different stems separately in the open-class lexicon since the morphological learners all expected a single citation form for a paradigm. In Boas II we will streamline this process, shifting the organizational part of the work to the system such that the user can type in multi-stem paradigms as a whole, conveniently indicate which stems are associated with which inflectional forms (using a similar elicitation methodology as we currently use for multi-word inflectional forms), and list multiple citation forms for a given word in the open-class lexicon. Initial specifications for this process were written for Boas I but not fully developed.

3. Expand the repertoire of syntactic threads.

In Boas I, we treated the cross-linguistically most prevalent syntactic phenomena. We also developed threads (extant as web pages but not incorporated because not programmed) that cover syntactic features and processes that are not found quite so universally, like affix hopping, subject and object ellipsis, various categories of verbal ellipsis, heavy-NP shift, approximation, special types of agreement, etc. We decided to treat these as “singletons” and to add to their inventory through continued cross-linguistic research and testing. The threads that we have need to be incorporated and more such threads - many more, in fact - should be incorporated. The challenge lies not only in eliciting the information, of course, but in deciding how it can be turned into rules.

4. Boundary alternations in agglutinating languages and flective languages with derivational morphology.

Boas I treats those boundary alternations that occur in flective paradigms, but the many productive types that occur in other word-formation processes are not elicited. The first step in doing so, especially with the most inexperienced users, will be to teach users to find and recognize such alternations on the basis of examples. Candidate examples can be gathered from a corpus and presented to the user based on the information provided in the agglutinating- and derivational-affix portions of KE. Formalizing the associated rules might be a good place for an expert/novice bifurcation in Boas II (we have a few such in Boas I). Experts can be permitted to write patterns of boundary alternations and, to some degree of precision, indicate the contexts (e.g., prefix + verb). Novices might be asked just to select relevant examples and highlighting the boundary alternation, after which a program -- necessarily, more open to error than a human analyst -- would guess the alternations and/or learn to use fuzzy matching in analysis.

5. Morphotactics of agglutinating languages.

This topic is linked to #4. The morphological analyzer should benefit from knowledge not only of potential boundary alternations during affixation, but also the typical order of affixes - to the extent that this can be formalized. Again, developers can design example-based methods anticipating and supported by the linguistic knowledge the user will already have provided in Boas.

6. Derivational and inflectional reduplication.

Boas I does not have any productive treatment of reduplication, since it requires machine-learning approaches different than for typical inflectional morphology. If we anticipate machine-learning methods that can cover reduplication, we can orient elicitation in that direction. If not, we will develop more convenient ways of prompting the user to list the relevant reduplicative forms for every open-class item.

7. Improve the English list of open-class items.

The list is currently too long and not organized well enough in terms of word frequency for the types of texts expected to be of interest.

8. Implement an interface for writing phrasals with variables.

Boas I accepts phrasals with inflecting elements, but does not permit the indication of variables such as:
[np] avoir [#] [an(s)] --> [np] am [#] [year(s)] old. We have specifications for this facility but they have not been implemented yet.


ILIT University of Maryland Baltimore County ECS 202 1000 Hilltop Circle Baltimore, MD 21250
Phone: 410-455-8480 Fax: 410-455-8488 E-mail: ILIT@UMBC.EDU