An Expert Natural Language Interface for Statistical Packages Richard Lyczak Sylvia Weber-Russell University of New Hampshire SUMMARY A natural language interface has been developed to facilitate the use of statistical packages. Queries are parsed into "case frames" based on statistical primitives. A rule-based expert system uses the case frame to choose a statistical test and generate a batch file which, when executed, answers the query. CONTENT INDICATORS General terms: DESIGN Categories: H.2.4 Information Systems / Database Management / Systems Subject Descriptor: QUERY PROCESSING H.3.m Information Systems / Information Storage & Retrieval / Misc. Additional Keywords: STATISTICS, STATISTICAL PACKAGES I.2.1 Computing Methodologies / AI / Applications and Expert Systems Subject Descriptor: NATURAL LANGUAGE INTERFACES I.2.7 Computing Methodologies / AI / Natural Language Processing Subject Descriptor: LANGUAGE PARSING AND UNDERSTANDING Statistical packages are used to analyze data in a wide variety of academic, business, and research settings. Users are typically experts in some field or discipline which requires the analysis of data but have limited backgrounds in statistics and computing. The purpose of this project was to facilitate the use of statistical packages by developing an interface which allows the user to ask research questions in a natural English language format and without having to specify the statistical test to be used. The most extensive use of natural language interfaces (NLI) has been in the area of database query systems. While early systems were based on augmented transition networks (Woods et al., 1972) or semantic grammars (Waltz, 1978; Hendrix et al., 1978; Thompson & Thompson, 1975), more recent systems have employed an intermediate representation language (IRL) (Warren & Pereira, 1982; Bates & Bobrow, 1983; Grosz et al., 1986) which is then translated into the database query language. Although IRL systems typically construct the IRL representation of a query from a syntactic parse tree, methods do exist for going directly from natural language to the IRL representation. One such system is the "conceptual analyzer" developed by Riesbeck (1975) and refined by Birnbaum & Selfridge (1981). The conceptual analyzer is based on Schank's (1975) conceptual dependency theory which uses a small number of "conceptual primitives" to represent actions or states. Although this approach is well suited for applications which have a small number of primitive operations, such as database management, it has not been applied to the development of database query systems. Like database management systems, statistical packages use a query language to obtain information from a highly structured data file. While a database query typically specifies subsets of data to be retrieved, a statistical package query typically specifies subsets of data to be described or compared. Writing a statistical query requires considerable statistical expertise since the user must specify not only the data to be analyzed but also the statisti- cal test to be used in the analysis. Because of the complexity involved in choosing an appropriate statistical test, numerous expert systems have been developed to guide the selection process (Blum, 1982; O'Keefe, 1982; Smith et al., 1983; HaKong & Hickman, 1985; Jamison & Metzler, 1985; Gale, 1986; Marion, 1987). While most statistical expert systems employ menus or question the user about features of the research design, Bucci, et al., (1985) appear to have incorporated some level of statistical expertise into a NLI which queries a demographic database maintained by the Italian government. However, this expert NLI is dedicated to a specific database and offers little guidance for the development of more general systems. While the work of Bucci, et al., was limited in scope, we believe that the approach of building statistical expertise into a NLI for statistical packages holds promise. An ideal interface would accept a natural language query, choose an appropriate statistical test, then generate and execute the code needed to answer the query. We have therefore developed a system called "EXPERSTAT" which includes a natural language interpreter based on Riesbeck's (1975) conceptual analyzer and an expert system based on the principles employed by Marion (1985). In its current form, this interface generates and executes code for SPSS, a statistical package widely used in the social sciences. However, since the NLI, expert system, and code generation modules are all independent, the system could readily be adapted to work with any statistical package. Unlike Bucci's et al. system, it can also be used with any predefined data set. OVERVIEW Processing takes place in four phases. First, a dictionary is constructed which contains labels from the data file, key statistical terms, and various connectives such as prepositions, relative pronouns, and interrogatives. Secondly, a statistical query is obtained from the user. Queries must be entered as questions which make reference to variables in the data file. They must end with a question mark and contain no other punctuation. Queries may be entered interactively or in a file. Parsing the query involves looking up each word in the dictionary and applying functions associated with words in the dictionary to construct a case frame which represents the meaning of the query. Thirdly, the completed case frame is examined by the expert system to select an appropriate statistical test. And, finally, the expert system passes the case frame to the code generating routine for the statistical test chosen. Code is generated from the contents of the case frame and executed. The output provides an answer to the query. Once a query has been answered, phases two through four are repeated until there are no more queries. Let us consider each of these four phases in detail. DICTIONARY CONSTRUCTION Like most statistical packages, SPSS requires the user to define the data set by assigning labels to each variable. Labels may also be assigned to in- dividual values within a variable (e.g., within the variable SEX, 1 = male, 2 = female). EXPERSTAT begins by reading these labels into a hash table which serves as its dictionary. In addition to "variable labels" and "value labels" the dictionary contains "key words" (e.g., mean, difference, relationship) which determine the type of statistical operation being requested and certain "connectives" (e.g., prepositions, relative pronouns, interrogatives, conjunc- tions, and certain helping verbs) which define the relationship between labels and key words. In keeping with Riesbeck's approach, each word in the dictionary is associated with one or more "requests" which are actually executable routines designed to set up and fill the slots in a case frame. These requests are similar to production rules in that they contain a test and an action to be carried out if the test succeeds (e.g., "if the query contains this word, set up the case frame for descriptive statistics"). When a word is found in the dictionary, a list of its requests is returned. It is important to note that only the user-defined labels are added to the dictionary at run time. All of the remaining words and their requests are built into the system. PARSING The parser starts at the beginning of the query and looks up each word in the dictionary. If the word is found, the requests returned are added to a list of requests called the RLIST. If the word is not found, the parser tries a variation of the word by adding or removing suffixes, such as "s" or "es" and substituting characters, such as "man" for "men". If neither the word nor a variation of the word can be found, the parser simply moves to the next word in the query. In that way, only "significant" words are used to construct a representation of the query's meaning. This focus on significant words is made possible by the very restricted context (i.e., analysis of a specific data file) in which the queries are interpreted. After each addition to the RLIST, the parser executes all of the requests on the list. Most requests contain a conditional clause. If the conditions of the request are met and some action is performed related to setting up or filling a case frame, the request is removed from the RLIST. Otherwise, it remains on the list until its conditions are met. Ultimately, all requests are aimed at representing the query's meaning with a completed case frame. However, since the parser processes a query sequen- tially, it is often necessary to store concepts temporarily until it is known what type of case frame is needed to represent the query. This list of concepts, which have been encountered but not used, is called the CLIST. The conditional clauses of requests usually refer to the relative positions of items on the CLIST. For instance, "if the CLIST contains an item suggesting that descriptive statistics are required and that item is preceded by the name of a variable, set up a DESCRIBE case frame and insert the variable in its OBJECT slot." Once an item on the CLIST has been used, it is removed from the list. All queries are interpreted in terms of five "primitive" statistical opera- tions: describe, tabulate, compare, regress, and relate. Each primitive operation is represented by a different case frame. Slots in these case frames are designed to collect the information needed to carry out that type of analysis. The case frames constructed by our parser are analogous in concept to Schank's "conceptual case frames." That is, our parser, consis- tent with Schank's theory as implemented by Riesbeck, links concepts in a sentence to a governing primitive in a conceptual or semantic frame. In Schank's frames, this primitive is a conceptual act or state. In our higher level frames, however, the governing primitive is the desired statistical operation, regardless of the conceptual acts or states underlying the actual query. The operation called for by a specific query is determined by the presence of certain "key words" in the query. Examples of key words for each operation are: DESCRIBE: average, descriptive, mean, deviation TABULATE: fraction, frequency, many, percent, portion COMPARE: compare, differ, comparative forms of adjectives REGRESS: affect, depend, determine, effect, impact RELATE: correlation, relate, relationship The first request associated with each of these key words in the dictionary is to set up the case frame for the appropriate statistical operation. Addi- tional requests are aimed at filling the slots in the case frame and vary from one key word to another depending on where (and in what form) information is likely to be found on the CLIST. For instance, the following queries are both asking for the same comparison, but the requests needed to locate the groups to be compared would be quite different. Are the GPAs of males higher than the GPAs of females? Do the GPAs of the two sexes differ? Even so, many keywords do share the same set of requests. When this occurs, the dictionary entry for those words is simply a reference to a request list stored elsewhere in the dictionary. Having several words referring to the same generic request list considerably reduces the size of the dictionary. All case frames contain slots for the variable or variables to be analyzed (e.g., the dependent variable), a definition of the sample to be examined, and any subsamples which need to be analyzed separately. Where cause-effect relationships are being explored, there may also be slots for an independent variable and the names of groups within the independent variable which are to be included in the analysis. EXPERT SYSTEM When parsing is finished, the completed case frame is examined by a rule- based expert system which determines the appropriate statistic needed to answer the query. Decisions are based on four factors: (1) the type of primitive operation, (2) the number of variables involved, (3) each variable's level of measurement (nominal, ordinal, interval/ratio), (4) the number of groups involved in the analysis. Thus, a typical rule might state: IF the operation is a comparison AND there is a single variable in the INDEPENDENT VARIABLE slot of the case frame AND the level of measurement for that single variable is nominal AND there is a single variable in the DEPENDENT VARIABLE slot of the case frame AND the level of measurement for that single variable is interval AND the number of groups in the GROUPS slot is 2 THEN the statistic needed is a "t-test". In this case, the case frame would be passed to the routine in the code generator which generates code for t-tests. In addition to choosing a statistical test, the expert system module also traps errors. If no case frame has been created, or key slots in a case frame are empty, the expert system calls error message routines which provide feedback to the user. For instance, if a COMPARE case frame arrived with the INDEPENDENT VARIABLE slot completed but the DEPENDENT VARIABLE slot empty, the user might receive the following message. The word HIGHER implies that you wish to compare two or more groups with respect to some dependent variable. It appears that the groups are: MALE FEMALE. However, you have not mentioned the dependent variable. Please rephrase your query so that it identifies both the groups and the dependent variable using the names of variables and values from your data file. In the event that the user cannot remember the variable labels and value labels used in the data file, EXPERSTAT includes a utility which will list them on the screen. CODE GENERATOR As noted earlier, the code generator currently in use produces executable code for SPSS. The code generation module consists of a separate code generation routine for each of the statistical tests which can be "recommended" by the expert system. These routines simply construct a sequence of SPSS commands from the information contained in the case frame slots. For instance, the query: Among freshmen who work more than 20 hours per week are the GPAs of males and females significantly different? would be represented by the following case frame. OPERATION: compare INDEPENDENT VARIABLE(S): sex GROUPS: male (1) female (2) DEPENDENT VARIABLE: gpa SAMPLE: class eq 1, work gt 20 SUBSAMPLES: none From this case frame the code generator would produce the following SPSS code. GET FILE = STUDENTS SELECT IF CLASS EQ 1 AND WORK GT 20 T-TEST GROUPS = SEX (1 2) /VARIABLES = GPA This assumes, of course, that there is an SPSS system file called STUDENTS which contains the variables "class", "work", "sex", and "gpa". It also assumes that within "class" the value 1 has been labeled "freshman" and within "sex" the values 1 and 2 have been labeled "male" and "female." SPECIAL PARSING PROBLEMS Noun Groups As pointed out by Birnbaum and Selfridge (1981), special provisions must be made for parsing noun groups in order to prevent "premature" decisions about the role of each noun. In their example, encountering the word "stairway" in the sentence "George sat on the stairway handrail" could result in the premature filling of the case frame slot containing George's location. Their solution to this problem was to process only the requests associated with individual nouns until the end of the noun group was reached, at which point all requests on the RLIST were processed. We extended this approach to cover entire phrases. Consider the queries: Are the mean GPAs of senior males and senior females different? Are the mean GPAs of senior males and junior males different? Deferring action on the word "senior" until the end of the noun group "senior males" is reached would not guarantee a correct interpretation. The true role of "senior" is not known until the end of the entire prepositional phrase. In the first query "senior" defines the sample to be analyzed in a comparison of males and females; in the second, it defines one of the two groups being compared in a sample of males. We therefore collect all nouns (which are always variable labels or value labels) on a temporary list and determine which slots they should fill in the case frame when the end of the phrase is encountered. Premature Case Frame Selection Premature decisions cannot always be avoided. In the above queries, for instance, the word "mean" is a key word which triggers construction of the DESCRIBE case frame. When the end of the prepositional phrase is encountered, slots are filled in the OBJECT and SAMPLE slots of that case frame. However, the next word, "different", is a keyword which triggers construction of the COMPARE case frame. Since comparisons often involve means, but descriptive statistics do not involve differences, it is clear that this case of "con- flicting case frames" can be resolved by moving the contents of the DESCRIBE frame into appropriate slots of the COMPARE frame and destroying the former. Fortunately, the DESCRIBE/COMPARE conflict appears to be an exception because queries typically contain just a single keyword. Logical Operators Statistical queries tend to be relatively free of ambiguity when compared to everyday speech. Research questions, by their very nature, call for a precise use of language. Problems can arise, however, when the rules of formal logic conflict with the way we normally interpret human speech. Consider the following query. What is the average GPA for students who are not males or seniors? The rules of formal logic give highest precedence to "not", followed by "and", followed by "or". Applying these rules, we would interpret this query to be requesting the average for students "who are not males or who are seniors". The typical human interpretation, on the other hand, is likely to be that an average is being requested for students "who are not males and who are not seniors". The dilemma here was whether to interpret all queries in terms of the well established rules governing the precedence of logical operators or generate a whole new set of precedence rules based on common usage. The former would likely satisfy the professional researcher who is familiar with the rules of logic but would confound the casual user, while the latter would likely satisfy the casual user but would confound someone schooled in formal logic. Our solution was a compromise. Logical precedence rules remain in effect, but one further rule has been added. "Or" is given higher precedence than "not", thereby creating a complete cycle of precedences: "not" precedes "and" precedes "or" precedes "not." This scheme produces interpretations which would be expected by users familiar with logical operations, but makes an exception for queries of the type noted above in order to conform to common usage. One additional exception has been made to a strictly logic-based interpre- tation of queries. As noted by Templeton and Burger (1986) in their dis- cussion of EUFID (End-User Friendly Interface to Data Management), the natural language use of "and" and "or" in database (and statistical) queries does not always correspond to their logical meaning. For instance, the query: What is the median GPA for students who are juniors AND seniors? actually means: What is the median GPA for students who are juniors OR seniors? On the other hand, the query: What is the median GPA for students who are juniors and female? means exactly what it says. Clearly, "juniors" and "seniors" are mutually exclusive categories, while "juniors" and "females" are not. EXPERSTAT can detect mutual exclusivity by whether two value labels come from the same variable. If they do, it converts the "and" operator to "or". EVALUATION EXPERSTAT has been tested with over 500 queries generated by the authors on three different data files. One file contained data on individual students, another contained data about colleges and universities, and a third contained health-related data on each of the fifty states. Table 1 contains a sample query and resulting SPSS code involving each of the five primitive operations in each of the three sample data files. Notice in reading this table that the use of user-defined labels sometimes requires departure from a truly natural language format. For example, since variable labels in SPSS cannot exceed eight characters, the label for class rank must be spelled CLASRANK. Also, since value labels cannot contain spaces, words within the label must some- times be run together (OVER65) or separated by underscores (HAVE_NO_SEATBELT_LAW). The system does not currently handle analyses involving covariates, repeated measures, or multiple dependent variables. Apart from these exceptions, it has been able to interpret and generate SPSS code for a very wide range of complex statistical queries. CONCLUSIONS EXPERSTAT has demonstrated the feasibility of integrating an expert system and natural language processor to construct an intelligent interface for statisti- cal packages. In addition to being a new application of natural language processing techniques, the project has employed several innovations in the way that it focuses on keywords and labels, the way it handles noun groups, and particularly in the way it has resolved conflicts between formal logic and common language usage. Also noteworthy is its flexibility. EXPERSTAT can be applied to any predefined data set and can be readily adapted to any statisti- cal package. Only the modules which generate code and which read labels from the user's data file need to be rewritten when the interface is transported to a new package. REFERENCES Bates, M., & Bobrow, R. J. (1983). A transportable natural language interface. Proceedings Sixth Annual International SIGIR Conference on Research and Development in Information Retrieval, ACM. Birnbaum, L., & Selfridge, M. (1981). Conceptual Analysis of Natural Language. In R. Schank, & C. Riesbeck (Eds.), Inside Computer Understanding. Hillsdale, NJ: Lawrence Erlbaum Associates. Blum, R. L. (1982). Discovery and representation of causal relationships from a large time-oriented clinical database: the RX project. Lecture Notes in Medical Informatics. New York: Springer-Verlag. Bucci, P., Lella, G., & Pavan, S. (1985). NLI-ESD: An expert natural language interface to a statistical data bank. Expert Systems and Their Applications, 2, 667-671. Gale, W. (1986). Artificial Intelligence and Statistics. Reading, MA: Addison-Wesley. Grosz, B. J., Appelt, D. E., Martin, P., & Pereira G. (1987). TEAM: An experiment in the design of transportable natural language interfaces. Artificial Intelligence, 32, 173-243. HaKong, L., & Hickman, F. R. (1985). Expert systems techniques: an application in statistics. Proceedings of the Fifth Technical Conference of the British Computer Society. Cambridge: Cambridge University Press. Hendrix, G., Sacerdoti, E., Sagalowicz, D., & Slocum, J. (1978). Developing a natural language interface to complex data. ACM Transactions Database Systems, 3(2), 105-147. Jamison, W., & Metzler, D. (1985). An expert system for statistical consulting. Proceedings of the Forty-eighth ASIS Annual Meeting. White Plains, NY: Knowledge Industry Publications. Marion, R. (1983). An expert system for selecting the correct biomedical statistical procedure. Collegiate Microcomputer, 5, 230-236. O'Keefe, R. O. (1982). An expert system for statistics. Paper presented at the Technical Conference on Theory and Practice of Knowledge Based Systems, Brunnel University. Riesbeck, C. K. (1975). Conceptual Analysis. In R. Schank (Ed.), Conceptual Information Processing. New York: American Elsevier. Schank, R. C. (1975). Conceptual Information Processing. New York: American Elsevier. Smith, A. M. R., Lee, L. S., & Hand, D. J. (1983). Interactive user-friendly interfaces to statistical software. The Computer Journal, 26, 199-204. Templeton, M., & Berger, J. (1986). Considerations for the Development of Natural-Language Interfaces to Database Management Systems. In L. Bolc, & M. Jarke (Eds.), Cooperative Interfaces to Information Systems. New York: Springer-Verlag. Thompson, F. B., & Thompson, B. H. (1975). Practical natural language processing: the REL system prototype. In M. Rubinoff, & M. Yovits (Eds.), Advances in Computers. New York: Academic Press. Waltz, D. L. (1978). An English language question answering system for a large relational database. Communications of the Association for Computing Machinery, 21(7), 526-539. Warren, D. H. D., & Pereira, F. C. N. (1982). An efficient easily adaptable system for interpreting natural language queries. American Journal of Computational Linguistics, 8(3-4), 110-122. Woods, W. A., Kaplan, R. M., & Nash-Webber, B. L. (1972). The Lunar Sciences Natural Language Information System: Final Report BBN REP. 2378. Cambridge, MA: Bolt Beranek & Newman. Table 1 Queries from sample data files and their resulting SPSS code ---------------------------------------------------------------- WITHIN THE SENIOR CLASS WHAT IS THE RANGE OF GPAS FOR EACH SEX? GET FILE=STUDENTS SELECT IF ( CLASS EQ 4 ) TEMPORARY SELECT IF SEX EQ 1 DESCRIPTIVES VARIABLES= GPA TEMPORARY SELECT IF SEX EQ 2 DESCRIPTIVES VARIABLES= GPA ---------------------------------------------------------------- WHAT PROPORTION OF MALE STUDENTS WHO WORK OVER 20 HOURS PER WEEK ARE SENIORS? GET FILE=STUDENTS SELECT IF WORK GT 20 CROSSTABS SEX (1 2) BY CLASS (1 4) ---------------------------------------------------------------- AMONG STUDENTS OVER 21 YEARS OF AGE WHAT IS THE CORRELATION BETWEEN GPA AND THE NUMBER OF HOURS HE/SHE WORKS EACH WEEK? GET FILE=STUDENTS SELECT IF ( AGE GT 21 ) CORRELATIONS VARIABLES= GPA WORK ---------------------------------------------------------------- DO MALES HAVE A HIGHER CLASRANK THAN FEMALES? GET FILE=STUDENTS NPAR TESTS M-W = CLASRANK BY SEX (1 2) ---------------------------------------------------------------- TO WHAT EXTENT IS GPA AFFECTED BY AGE AMONG STUDENTS WHO ARE NOT FRESHMEN OR SOPHOMORES? GET FILE=STUDENTS SELECT IF ( CLASS NE 1 ) AND ( CLASS NE 2 ) REGRESSION VARIABLES = AGE /DEPENDENT = GPA ---------------------------------------------------------------- WHAT IS THE MEDIAN TUITION CHARGED BY VERMONT COLLEGES WITH ENROLMNTS UNDER 2000 STUDENTS? GET FILE=ACADEMIC SELECT IF ENROLMNT LT 2000 AND ( REGION EQ 5 ) DESCRIPTIVES VARIABLES= TUITION ---------------------------------------------------------------- IN MAINE HOW MANY PUBLIC AND PRIVATE COLLEGES ARE THERE WITH TUITIONS EQUAL TO OR GREATER THAN 10000 DOLLARS PER YEAR? GET FILE=ACADEMIC SELECT IF TUITION GE 10000 CROSSTABS REGION (1 6) BY OWNER (1 2) ---------------------------------------------------------------- HOW STRONG IS THE RELATIONSHIP BETWEEN MATH AND VERBAL SAT SCORES AMONG COLLEGES IN NEW-HAMPSHIRE AND VERMONT? GET FILE=ACADEMIC SELECT IF ( REGION EQ 3 ) OR ( REGION EQ 5 ) CORRELATIONS VARIABLES= MATH VERBAL ---------------------------------------------------------------- IS A COLLEGE'S TUITION DETERMINED BY REGION AND WHETHER IT IS PUBLIC OR PRIVATE? GET FILE=ACADEMIC ANOVA TUITION BY OWNER (1 2) BY REGION (1 6) ---------------------------------------------------------------- IS THE SIZE OF A COLLEGE'S LIBRARY DETERMINED BY ITS TUITION AND ENROLMNT? GET FILE=ACADEMIC REGRESSION VARIABLES = ENROLMNT TUITION /DEPENDENT = LIBRARY ---------------------------------------------------------------- WHAT IS THE AVERAGE DEATH RATE FROM CANCER IN NEW-ENGLAND STATES WHICH HAVE_NUKES? GET FILE=HEALTH SELECT IF ( REGION EQ 1 AND NUKE EQ 1 ) DESCRIPTIVES VARIABLES= CANCER ---------------------------------------------------------------- HOW MANY STATES IN WHICH 20 % OR MORE OF THE POPULATION IS OVER65 HAVE_NO_SEATBELT_LAW? GET FILE=HEALTH SELECT IF OVER65 GE 20 FREQUENCIES VARIABLES=SEATBELT ---------------------------------------------------------------- ARE THE NUMBER OF DOCTORS IN A STATE AND THE STATE'S INCOME LEVEL CORRELATED? GET FILE=HEALTH CORRELATIONS VARIABLES= DOCTORS INCOME ---------------------------------------------------------------- IN NEW-ENGLAND AND THE MID-ATLANTIC STATES WHAT IS THE EFFECT OF SEATBELT LAWS ON THE NUMBER OF PEOPLE ADMITTED TO HOSPITALS? GET FILE=HEALTH SELECT IF ( REGION EQ 1 OR REGION EQ 2 ) T-TEST GROUPS = SEATBELT (1 0) /VARIABLES = ADMITTED ---------------------------------------------------------------- WHAT EFFECT DOES THE NUMBER OF DOCTORS IN A MID-ATLANTIC STATE HAVE ON ITS DEATH RATE FROM HEART DISEASE? GET FILE=HEALTH SELECT IF ( REGION EQ 2 ) REGRESSION VARIABLES = DOCTORS /DEPENDENT = HEART