An Expert Natural Language Interface
                    for Statistical Packages


                         Richard Lyczak
                      Sylvia Weber-Russell
                   University of New Hampshire




                             SUMMARY
A natural language interface has been developed to facilitate the
use of statistical packages.  Queries are parsed into "case frames"
based on statistical primitives.  A rule-based expert system uses
the case frame to choose a statistical test and generate a batch
file which, when executed, answers the query.

                      CONTENT INDICATORS

General terms:  DESIGN

Categories:
H.2.4  Information Systems / Database Management / Systems
      Subject Descriptor: QUERY PROCESSING
H.3.m  Information Systems / Information Storage & Retrieval / Misc.
      Additional Keywords: STATISTICS, STATISTICAL PACKAGES
I.2.1  Computing Methodologies / AI / Applications and Expert Systems
      Subject Descriptor: NATURAL LANGUAGE INTERFACES
I.2.7  Computing Methodologies / AI / Natural Language Processing
      Subject Descriptor: LANGUAGE PARSING AND UNDERSTANDING


Statistical packages are used to analyze data in a wide variety of academic,
business, and research settings.  Users are typically experts in some field or
discipline which requires the analysis of data but have limited backgrounds in
statistics and computing. The purpose of this project was to facilitate the
use of statistical packages by developing an interface which allows the user
to ask research questions in a natural English language format and without
having to specify the statistical test to be used.

The most extensive use of natural language interfaces (NLI) has been in the
area of database query systems.  While early systems were based on augmented
transition networks (Woods et al., 1972) or semantic grammars (Waltz, 1978;
Hendrix et al., 1978; Thompson & Thompson, 1975), more recent systems have
employed an intermediate representation language (IRL) (Warren & Pereira,
1982; Bates & Bobrow, 1983; Grosz et al., 1986) which is then translated into
the database query language.  Although IRL systems typically construct the IRL
representation of a query from a syntactic parse tree, methods do exist for
going directly from natural language to the IRL representation.  One such
system is the "conceptual analyzer" developed by Riesbeck (1975) and refined
by Birnbaum & Selfridge (1981).  The conceptual analyzer is based on Schank's
(1975) conceptual dependency theory which uses a small number of "conceptual
primitives" to represent actions or states.  Although this approach is well
suited for applications which have a small number of primitive operations,
such as database management, it has not been applied to the development of
database query systems.

Like database management systems, statistical packages use a query language to
obtain information from a highly structured data file.  While a database query
typically specifies subsets of data to be retrieved, a statistical package
query typically specifies subsets of data to be described or compared. 
Writing a statistical query requires considerable statistical expertise since
the user must specify not only the data to be analyzed but also the statisti-
cal test to be used in the analysis.  Because of the complexity involved in
choosing an appropriate statistical test, numerous expert systems have been
developed to guide the selection process (Blum, 1982; O'Keefe, 1982; Smith et
al., 1983; HaKong & Hickman, 1985; Jamison & Metzler, 1985; Gale, 1986;
Marion, 1987).  While most statistical expert systems employ menus or question
the user about features of the research design, Bucci, et al., (1985) appear
to have incorporated some level of statistical expertise into a NLI which
queries a demographic database maintained by the Italian government.  However,
this expert NLI is dedicated to a specific database and offers little guidance
for the development of more general systems.

While the work of Bucci, et al., was limited in scope, we believe that the
approach of building statistical expertise into a NLI for statistical packages
holds promise.  An ideal interface would accept a natural language query,
choose an appropriate statistical test, then generate and execute the code
needed to answer the query.  We have therefore developed a system called
"EXPERSTAT" which includes a natural language interpreter based on Riesbeck's
(1975) conceptual analyzer and an expert system based on the principles
employed by Marion (1985).  In its current form, this interface generates and
executes code for SPSS, a statistical package widely used in the social
sciences.  However, since the NLI, expert system, and code generation modules
are all independent, the system could readily be adapted to work with any
statistical package.  Unlike Bucci's et al. system, it can also be used with
any predefined data set.

                            OVERVIEW

Processing takes place in four phases.  First, a dictionary is constructed
which contains labels from the data file, key statistical terms, and various
connectives such as prepositions, relative pronouns, and interrogatives. 
Secondly, a statistical query is obtained from the user.  Queries must be
entered as questions which make reference to variables in the data file.  They
must end with a question mark and contain no other punctuation. Queries may be
entered interactively or in a file. Parsing the query involves looking up each
word in the dictionary and applying functions associated with words in the
dictionary to construct a case frame which represents the meaning of the
query.  Thirdly, the completed case frame is examined by the expert system to
select an appropriate statistical test.  And, finally, the expert system
passes the case frame to the code generating routine for the statistical test
chosen.  Code is generated from the contents of the case frame and executed. 
The output provides an answer to the query. Once a query has been answered,
phases two through four are repeated until there are no more queries.  Let us
consider each of these four phases in detail.

                    DICTIONARY CONSTRUCTION

Like most statistical packages, SPSS requires the user to define the data set
by assigning labels to each variable.  Labels may also be assigned to in-
dividual values within a variable (e.g., within the variable SEX, 1 = male, 2
= female).  EXPERSTAT begins by reading these labels into a hash table which
serves as its dictionary.  In addition to "variable labels" and "value labels"
the dictionary contains "key words" (e.g., mean, difference, relationship)
which determine the type of statistical operation being requested and certain
"connectives" (e.g., prepositions, relative pronouns, interrogatives, conjunc-
tions, and certain helping verbs) which define the relationship between labels
and key words.
  
In keeping with Riesbeck's approach, each word in the dictionary is associated
with one or more "requests" which are actually executable routines designed to
set up and fill the slots in a case frame. These requests are similar to 
production rules in that they contain a test and an action to be carried out
if the test succeeds (e.g., "if the query contains this word, set up the case
frame for descriptive statistics").  When a word is found in the dictionary, a
list of its requests is returned.

It is important to note that only the user-defined labels are added to the
dictionary at run time.  All of the remaining words and their requests are
built into the system.  

                             PARSING

The parser starts at the beginning of the query and looks up each word in the
dictionary.  If the word is found, the requests returned are added to a list
of requests called the RLIST.  If the word is not found, the parser tries a
variation of the word by adding or removing suffixes, such as "s" or "es" and
substituting characters, such as "man" for "men".  If neither the word nor a
variation of the word can be found, the parser simply moves to the next word
in the query.  In that way, only "significant" words are used to construct a
representation of the query's meaning.  This focus on significant words is
made possible by the very restricted context (i.e., analysis of a specific
data file) in which the queries are interpreted. 

After each addition to the RLIST, the parser executes all of the requests on
the list.  Most requests contain a conditional clause.  If the conditions of
the request are met and some action is performed related to setting up or
filling a case frame, the request is removed from the RLIST.  Otherwise, it
remains on the list until its conditions are met.

Ultimately, all requests are aimed at representing the query's meaning with a
completed case frame.  However, since the parser processes a query sequen-
tially, it is often necessary to store concepts temporarily until it is known
what type of case frame is needed to represent the query.  This list of
concepts, which have been encountered but not used, is called the CLIST.  The
conditional clauses of requests usually refer to the relative positions of
items on the CLIST.  For instance, "if the CLIST contains an item suggesting
that descriptive statistics are required and that item is preceded by the name
of a variable, set up a DESCRIBE case frame and insert the variable in its
OBJECT slot."  Once an item on the CLIST has been used, it is removed from the
list.

All queries are interpreted in terms of five "primitive" statistical opera-
tions: describe, tabulate, compare, regress, and relate.  Each primitive
operation is represented by a different case frame.  Slots in these case
frames are designed to collect the information needed to carry out that type
of analysis.  The case frames constructed by our parser are analogous in
concept to Schank's "conceptual case frames."   That is, our parser, consis-
tent with Schank's theory as implemented by Riesbeck, links concepts in a
sentence to a governing primitive in a conceptual or semantic frame.  In
Schank's frames, this primitive is a conceptual act or state.  In our higher
level frames, however, the governing primitive is the desired statistical
operation, regardless of the conceptual acts or states underlying the actual
query.

The operation called for by a specific query is determined by the presence of
certain "key words" in the query.  Examples of key words for each operation
are:
   DESCRIBE: average, descriptive, mean, deviation
   TABULATE: fraction, frequency, many, percent, portion
   COMPARE: compare, differ, comparative forms of adjectives
   REGRESS: affect, depend, determine, effect, impact
   RELATE: correlation, relate, relationship

The first request associated with each of these key words in the dictionary is
to set up the case frame for the appropriate statistical operation.  Addi-
tional requests are aimed at filling the slots in the case frame and vary from
one key word to another depending on where (and in what form) information is
likely to be found on the CLIST.  For instance, the following queries are both
asking for the same comparison, but the requests needed to locate the groups
to be compared would be quite different.

   Are the GPAs of males higher than the GPAs of females?
   Do the GPAs of the two sexes differ?

Even so, many keywords do share the same set of requests.  When this occurs,
the dictionary entry for those words is simply a reference to a request list
stored elsewhere in the dictionary.  Having several words referring to the
same generic request list considerably reduces the size of the dictionary.

All case frames contain slots for the variable or variables to be analyzed
(e.g., the dependent variable), a definition of the sample to be examined, and
any subsamples which need to be analyzed separately.  Where cause-effect
relationships are being explored, there may also be slots for an independent
variable and the names of groups within the independent variable which are to
be included in the analysis.

                         EXPERT SYSTEM

When parsing is finished, the completed case frame is examined by a rule-
based expert system which determines the appropriate statistic needed to
answer the query.  Decisions are based on four factors:
     (1) the type of primitive operation,
     (2) the number of variables involved,
     (3) each variable's level of measurement (nominal, ordinal,
         interval/ratio),
     (4) the number of groups involved in the analysis.
Thus, a typical rule might state:
IF
  the operation is a comparison
AND
  there is a single variable in the INDEPENDENT VARIABLE slot of the
  case frame
AND
  the level of measurement for that single variable is nominal
AND
  there is a single variable in the DEPENDENT VARIABLE slot of the case
  frame
AND
  the level of measurement for that single variable is interval
AND
  the number of groups in the GROUPS slot is 2

THEN
  the statistic needed is a "t-test".

In this case, the case frame would be passed to the routine in the code
generator which generates code for t-tests.

In addition to choosing a statistical test, the expert system module also
traps errors.  If no case frame has been created, or key slots in a case frame
are empty, the expert system calls error message routines which provide
feedback to the user.  For instance, if a COMPARE case frame arrived with the
INDEPENDENT VARIABLE slot completed but the DEPENDENT VARIABLE slot empty, the
user might receive the following message.

The word HIGHER implies that you wish to compare two or more groups with
respect to some dependent variable.

It appears that the groups are: MALE FEMALE.  However, you have not
mentioned the dependent variable.

Please rephrase your query so that it identifies both the groups and the
dependent variable using the names of variables and values from your
data file. 

In the event that the user cannot remember the variable labels and value
labels used in the data file, EXPERSTAT includes a utility which will list
them on the screen.

                         CODE GENERATOR

As noted earlier, the code generator currently in use produces executable code
for SPSS.  The code generation module consists of a separate code generation
routine for each of the statistical tests which can be "recommended" by the
expert system.  These routines simply construct a sequence of SPSS commands
from the information contained in the case frame slots.  For instance, the
query:

Among freshmen who work more than 20 hours per week are the GPAs of
males and females significantly different?

would be represented by the following case frame.

     OPERATION: compare
     INDEPENDENT VARIABLE(S): sex
     GROUPS: male (1) female (2)
     DEPENDENT VARIABLE: gpa
     SAMPLE: class eq 1, work gt 20
     SUBSAMPLES: none

From this case frame the code generator would produce the following SPSS code.

     GET FILE = STUDENTS
     SELECT IF CLASS EQ 1 AND WORK GT 20
     T-TEST GROUPS = SEX (1 2)
         /VARIABLES = GPA

This assumes, of course, that there is an SPSS system file called STUDENTS
which contains the variables "class", "work", "sex", and "gpa".  It also
assumes that within "class" the value 1 has been labeled "freshman" and within
"sex" the values 1 and 2 have been labeled "male" and "female." 

                    SPECIAL PARSING PROBLEMS

Noun Groups
As pointed out by Birnbaum and Selfridge (1981), special provisions must be
made for parsing noun groups in order to prevent "premature" decisions about
the role of each noun.  In their example, encountering the word "stairway" in
the sentence "George sat on the stairway handrail" could result in the
premature filling of the case frame slot containing George's location.  Their
solution to this problem was to process only the requests associated with
individual nouns until the end of the noun group was reached, at which point
all requests on the RLIST were processed.  We extended this approach to cover
entire phrases.  Consider the queries:

Are the mean GPAs of senior males and senior females different?

Are the mean GPAs of senior males and junior males different?

Deferring action on the word "senior" until the end of the noun group "senior
males" is reached would not guarantee a correct interpretation. The true role
of "senior" is not known until the end of the entire prepositional phrase.  In
the first query "senior" defines the sample to be analyzed in a comparison of
males and females; in the second, it defines one of the two groups being
compared in a sample of males.  We therefore collect all nouns (which are
always variable labels or value labels) on a temporary list and determine
which slots they should fill in the case frame when the end of the phrase is
encountered.

Premature Case Frame Selection
Premature decisions cannot always be avoided.  In the above queries, for
instance, the word "mean" is a key word which triggers construction of the
DESCRIBE case frame.  When the end of the prepositional phrase is encountered,
slots are filled in the OBJECT and SAMPLE slots of that case frame.  However,
the next word, "different", is a keyword which triggers construction of the
COMPARE case frame.  Since comparisons often involve means, but descriptive
statistics do not involve differences, it is clear that this case of "con-
flicting case frames" can be resolved by moving the contents of the DESCRIBE
frame into appropriate slots of the COMPARE frame and destroying the former. 
Fortunately, the DESCRIBE/COMPARE conflict appears to be an exception because
queries typically contain just a single keyword.

Logical Operators
Statistical queries tend to be relatively free of ambiguity when compared to
everyday speech.  Research questions, by their very nature, call for a precise
use of language.  Problems can arise, however, when the rules of formal logic
conflict with the way we normally interpret human speech.  Consider the
following query.

What is the average GPA for students who are not males or
seniors?

The rules of formal logic give highest precedence to "not", followed by "and",
followed by "or".  Applying these rules, we would interpret this query to be
requesting the average for students "who are not males or who are seniors". 
The typical human interpretation, on the other hand, is likely to be that an
average is being requested for students "who are not males and who are not
seniors".

The dilemma here was whether to interpret all queries in terms of the well
established rules governing the precedence of logical operators or generate a
whole new set of precedence rules based on common usage.  The former would
likely satisfy the professional researcher who is familiar with the rules of
logic but would confound the casual user, while the latter would likely
satisfy the casual user but would confound someone schooled in formal logic. 
Our solution was a compromise.  Logical precedence rules remain in effect, but
one further rule has been added. "Or" is given higher precedence than "not",
thereby creating a complete cycle of precedences: "not" precedes "and"
precedes "or" precedes "not."  This scheme produces interpretations which
would be expected by users familiar with logical operations, but makes an
exception for queries of the type noted above in order to conform to common
usage. 

One additional exception has been made to a strictly logic-based interpre-
tation of queries.  As noted by Templeton and Burger (1986) in their dis-
cussion of EUFID (End-User Friendly Interface to Data Management), the natural
language use of "and" and "or" in database (and statistical) queries does not
always correspond to their logical meaning.  For instance, the query:

What is the median GPA for students who are juniors AND seniors?

actually means:

What is the median GPA for students who are juniors OR seniors?

On the other hand, the query:

What is the median GPA for students who are juniors and female?

means exactly what it says.  Clearly,  "juniors" and "seniors" are mutually
exclusive categories, while "juniors" and "females" are not.  EXPERSTAT can
detect mutual exclusivity by whether two value labels come from the same
variable.  If they do, it converts the "and" operator to "or".

                           EVALUATION

EXPERSTAT has been tested with over 500 queries generated by the authors on
three different data files.   One file contained data on individual students,
another contained data about colleges and universities, and a third contained
health-related data on each of the fifty states.  Table 1 contains a sample
query and resulting SPSS code involving each of the five primitive operations
in each of the three sample data files.  Notice in reading this table that the
use of user-defined labels sometimes requires departure from a truly natural
language format.  For example, since variable labels in SPSS cannot exceed
eight characters, the label for class rank must be spelled CLASRANK.  Also,
since value labels cannot contain spaces, words within the label must some-
times be run together (OVER65) or separated by underscores
(HAVE_NO_SEATBELT_LAW).

The system does not currently handle analyses involving covariates, repeated
measures, or multiple dependent variables.  Apart from these exceptions, it
has been able to interpret and generate SPSS code for a very wide range of
complex statistical queries.

                          CONCLUSIONS

EXPERSTAT has demonstrated the feasibility of integrating an expert system and
natural language processor to construct an intelligent interface for statisti-
cal packages. In addition to being a new application of natural language
processing techniques, the project has employed several innovations in the way
that it focuses on keywords and labels, the way it handles noun groups, and
particularly in the way it has resolved conflicts between formal logic and
common language usage.  Also noteworthy is its flexibility.  EXPERSTAT can be
applied to any predefined data set and can be readily adapted to any statisti-
cal package.  Only the modules which generate code and which read labels from
the user's data file need to be rewritten when the interface is transported to
a new package.

                           REFERENCES

Bates, M., & Bobrow, R. J. (1983).  A transportable natural language
     interface.  Proceedings Sixth Annual International SIGIR
     Conference on Research and Development in Information Retrieval,
     ACM.
Birnbaum, L., & Selfridge, M. (1981). Conceptual Analysis of Natural Language. 
     In R. Schank, & C. Riesbeck (Eds.), Inside Computer Understanding. 
     Hillsdale, NJ: Lawrence Erlbaum Associates.
Blum, R. L. (1982).  Discovery and representation of causal relationships from
     a large time-oriented clinical database: the RX project.  Lecture Notes
     in Medical Informatics.  New York: Springer-Verlag.
Bucci, P., Lella, G., & Pavan, S. (1985).  NLI-ESD: An expert natural language
     interface to a statistical data bank.  Expert Systems and Their
     Applications, 2, 667-671.
Gale, W. (1986).  Artificial Intelligence and Statistics.  Reading, MA:
     Addison-Wesley.
Grosz, B. J., Appelt, D. E., Martin, P., & Pereira G. (1987).  TEAM: An
     experiment in the design of transportable natural language interfaces. 
     Artificial Intelligence, 32, 173-243.
HaKong, L., & Hickman, F. R. (1985).  Expert systems techniques: an
     application in statistics.  Proceedings of the Fifth Technical
     Conference of the British Computer Society.    Cambridge:
     Cambridge University Press.
Hendrix, G., Sacerdoti, E., Sagalowicz, D., & Slocum, J. (1978).  Developing a
     natural language interface to complex data.  ACM Transactions Database
     Systems, 3(2), 105-147.
Jamison, W., & Metzler, D. (1985).  An expert system for statistical   
     consulting.  Proceedings of the Forty-eighth ASIS Annual Meeting.  White
     Plains, NY: Knowledge Industry Publications.
Marion, R. (1983).  An expert system for selecting the correct biomedical
     statistical procedure.  Collegiate Microcomputer, 5, 230-236.
O'Keefe, R. O. (1982).  An expert system for statistics.  Paper presented at
     the Technical Conference on Theory and Practice of Knowledge Based
     Systems, Brunnel University.
Riesbeck, C. K. (1975).  Conceptual Analysis.  In R. Schank (Ed.), Conceptual
     Information Processing.  New York: American Elsevier.
Schank, R. C. (1975).  Conceptual Information Processing.  New York: American
     Elsevier.
Smith, A. M. R., Lee, L. S., & Hand, D. J. (1983).  Interactive user-friendly
     interfaces to statistical software.  The Computer Journal, 26,
     199-204.
Templeton, M., & Berger, J. (1986).  Considerations for the Development of
     Natural-Language Interfaces to Database Management Systems.  In L.
     Bolc, & M. Jarke (Eds.), Cooperative Interfaces to Information
     Systems.  New York: Springer-Verlag.
Thompson, F. B., & Thompson, B. H. (1975).  Practical natural language
     processing: the REL system prototype.  In M. Rubinoff, & M. Yovits
     (Eds.), Advances in Computers.  New York: Academic Press.
Waltz, D. L.  (1978).  An English language question answering system for a
     large relational database.  Communications of the Association for
     Computing Machinery, 21(7), 526-539.
Warren, D. H. D., & Pereira, F. C. N. (1982).  An efficient easily adaptable
     system for interpreting natural language queries.  American
     Journal of Computational Linguistics, 8(3-4), 110-122.
Woods, W. A., Kaplan, R. M., & Nash-Webber, B. L. (1972).  The Lunar Sciences
     Natural Language Information System: Final Report BBN REP. 2378. 
     Cambridge, MA: Bolt Beranek & Newman.

                              Table 1
   Queries from sample data files and their resulting SPSS code

----------------------------------------------------------------
WITHIN THE SENIOR CLASS WHAT IS THE RANGE OF GPAS FOR EACH SEX?

GET FILE=STUDENTS
SELECT IF (  CLASS EQ 4 )
TEMPORARY 
SELECT IF SEX EQ 1
DESCRIPTIVES VARIABLES= GPA
TEMPORARY 
SELECT IF SEX EQ 2
DESCRIPTIVES VARIABLES= GPA
----------------------------------------------------------------
WHAT PROPORTION OF MALE STUDENTS WHO WORK OVER 20 HOURS PER WEEK
ARE SENIORS?

GET FILE=STUDENTS
SELECT IF   WORK GT 20
CROSSTABS SEX (1 2) BY CLASS (1 4) 
----------------------------------------------------------------
AMONG STUDENTS OVER 21 YEARS OF AGE WHAT IS THE CORRELATION
BETWEEN GPA AND THE NUMBER OF HOURS HE/SHE WORKS EACH WEEK?

GET FILE=STUDENTS
SELECT IF (  AGE GT 21 )
CORRELATIONS VARIABLES= GPA WORK 
----------------------------------------------------------------
DO MALES HAVE A HIGHER CLASRANK THAN FEMALES?

GET FILE=STUDENTS
NPAR TESTS M-W = CLASRANK BY SEX (1 2)
----------------------------------------------------------------
TO WHAT EXTENT IS GPA AFFECTED BY AGE AMONG STUDENTS WHO ARE NOT
FRESHMEN OR SOPHOMORES?

GET FILE=STUDENTS
SELECT IF (  CLASS NE 1 )  AND (  CLASS NE 2 )
REGRESSION VARIABLES = AGE 
   /DEPENDENT = GPA 
----------------------------------------------------------------
WHAT IS THE MEDIAN TUITION CHARGED BY VERMONT COLLEGES WITH
ENROLMNTS UNDER 2000 STUDENTS?

GET FILE=ACADEMIC
SELECT IF   ENROLMNT LT 2000 AND (  REGION EQ 5 )
DESCRIPTIVES VARIABLES= TUITION
----------------------------------------------------------------
IN MAINE HOW MANY PUBLIC AND PRIVATE COLLEGES ARE THERE WITH
TUITIONS EQUAL TO OR GREATER THAN 10000 DOLLARS PER YEAR?

GET FILE=ACADEMIC
SELECT IF   TUITION GE 10000
CROSSTABS REGION (1 6) BY OWNER (1 2) 
----------------------------------------------------------------
HOW STRONG IS THE RELATIONSHIP BETWEEN MATH AND VERBAL SAT
SCORES AMONG COLLEGES IN NEW-HAMPSHIRE AND VERMONT?

GET FILE=ACADEMIC
SELECT IF (  REGION EQ 3 )  OR (  REGION EQ 5 )
CORRELATIONS VARIABLES= MATH VERBAL 
----------------------------------------------------------------
IS A COLLEGE'S TUITION DETERMINED BY REGION AND WHETHER IT IS
PUBLIC OR PRIVATE?

GET FILE=ACADEMIC
ANOVA TUITION BY OWNER (1 2) BY REGION (1 6) 
----------------------------------------------------------------
IS THE SIZE OF A COLLEGE'S LIBRARY DETERMINED BY ITS TUITION AND
ENROLMNT?

GET FILE=ACADEMIC
REGRESSION VARIABLES = ENROLMNT TUITION 
   /DEPENDENT = LIBRARY 
----------------------------------------------------------------
WHAT IS THE AVERAGE DEATH RATE FROM CANCER IN NEW-ENGLAND STATES
WHICH HAVE_NUKES?

GET FILE=HEALTH
SELECT IF (  REGION EQ 1 AND   NUKE EQ 1 )
DESCRIPTIVES VARIABLES= CANCER
----------------------------------------------------------------
HOW MANY STATES IN WHICH 20 % OR MORE OF THE POPULATION IS
OVER65 HAVE_NO_SEATBELT_LAW?

GET FILE=HEALTH
SELECT IF   OVER65 GE 20
FREQUENCIES VARIABLES=SEATBELT
----------------------------------------------------------------
ARE THE NUMBER OF DOCTORS IN A STATE AND THE STATE'S INCOME
LEVEL CORRELATED?

GET FILE=HEALTH
CORRELATIONS VARIABLES= DOCTORS INCOME 
----------------------------------------------------------------
IN NEW-ENGLAND AND THE MID-ATLANTIC STATES WHAT IS THE EFFECT
OF SEATBELT LAWS ON THE NUMBER OF PEOPLE ADMITTED TO HOSPITALS?

GET FILE=HEALTH
SELECT IF (  REGION EQ 1 OR   REGION EQ 2 )
T-TEST GROUPS = SEATBELT (1 0)
   /VARIABLES = ADMITTED 
----------------------------------------------------------------
WHAT EFFECT DOES THE NUMBER OF DOCTORS IN A MID-ATLANTIC STATE
HAVE ON ITS DEATH RATE FROM HEART DISEASE?

GET FILE=HEALTH
SELECT IF (  REGION EQ 2 )
REGRESSION VARIABLES = DOCTORS 
   /DEPENDENT = HEART