Prev
| Home | Next
The
last two decades have widened the interest in and development of computer systems
for understanding and generating natural languages. What is important to observe
is that the man-machine interaction in the field of natural language processing
has motivated scholars to explore the application of computer systems with two
distinct orientations - one, which has the research objective of 'language engineering'
and other, which is confined to its research goal of 'language theory testing'.
While the former field of research activities focusses on the application of linguistics
to the various areas of natural language processing (NLP) with a pragmatic goal
in mind, the latter area of research activities concentrates on different possible
applications of NLP to linguistics for developing formal linguistic theories and
testing proposed linguistic models (Grishman, 1986). Both areas of research, on
Ene one hand, imply a sound body of linguistic knowledge and, on the other hand,
have direct implications for scientific formalization of linguistic theories.
Language engineering aspect of computer system has following classes of its
application.
1. Speech Synthesis
Speech synthesis is the production of artificial speech through man-made devices
which generate speech-like sound waves. It is interesting to no e that mechanical
'artificial talker, and electronically equipped 'Voder' and 'Vocoder' which served
as 'talking machine' are slowly being replaced by computerized device for generating
synthetic speech. High speed digital computers and electric circuits serve as
effective means for generating speech-like sounds. Computer synthesizers are now
considered to have more promise than the OVE 11, an electronic synthesizer built
at the Royal Institute of Technology in Stockholm, Sweden, the Pattern Playback
Synthesizer built by the Haskins Laboratories of New York, and the Vecoder system
produced by the Bell Telephone Laboratory. For example, a computer controlled
Line Analog Speech Synthesizer (LASS) was completed in the phonetic laboratory
of UCLA, USA even two decades earlier (Ladefoged, 1964; Harshman, et al., 1967;
Hiki, et al., 1968).
2. Machine Translation
Ever since the computer was invented, automatic translation of natural languages
became a dream of programmers. It became a viable area of language engineering
from within the embryonic field of artificial intelligence. The early researchers
viewed machine translation (MT) as basically an engineering endeavour. With repeated
failures in achieving their stated goal, scholars engaged in this area soon realized
that the field of MT belongs to both linguistics and Computer Science. 'It was
then demonstrated that fully-automated high quality machine translation is possible
only when the meaning of the input text is taken into account, in addition to
its Syntax and a version of a bilingual dictionary at the word or even phrase
level' (Nirenburg, 1987 : xv). With all the advancement in the field, it is riot
being considered as a fully automated process, simply because current software
Can neither absorb 'encyclopaedic knowledge' of the World for applying to the
translation task nor use meaning at the pragmatic level in input-output transaction
of translation process. In fact, there are three American based companies which
are at present engaged in producing commercial translation software - (a) Automated
Language Processing Systems (ALPS-Provo, Ut), (b) Weidner Communications (Northbrook,
Illinois), and (c) Logos Computer Systems (Wellesley, Mass). Their software products
also reflect the state of the art in machine translation. Alps Transactive software
is designed as 'tools to aid translators' and is engineered to operate in an interactive
manner in which a human translator guides the software. Contrary to this, software
products of Weidner and Logos are designed to take the translation as the first
input and human translator editing the output. In their perspective, machine automat
serves as a 'Junior translator' with an instruction to report to a human 'Senior
translator'.
On theoretical level, MT has evolved three competing strategies
during its development (a) 'direct' translation strategy, (b) the 'transfer' strategy,
and (c) 'interlingua' strategy (Tucker, 1987 : 22). The direct translation strategy
starts from the text of the source language and through a series of successive
operational stages produces the output text in the target language. It neither
involves 'parsers' nor an 'intermediary' language. It depends primarily on classified
dictionary information, detailed morphological analysis and sentence patterns,
and next processing software and in which the output of one stage serves as the
input of the next stage. The Georgetown MT system i s the first of its kind (Zarechnak,
1979) and can be considered as a classic example of this strategy. The transfer
strategy involves three processes and their corresponding stages - (i) analysis
of a sentence of source language as an abstract labelled structure, (ii) transference
o-f this abstracted structure and lexicon of the source language into the structure
and lexicon of the target language, and (iii) restructuring of the sentence of
the target language as the final output. This strategy has been adopted by MT
groups as GETA in Grenoble (Boitet, et al., 1985). The interlingua strategy makes
the MT possible through a universal language. The process is drawn primarily from
the area of, artificial intelligence which motivates the scholar to equate utterances
in an interlingua (i.e., Universal language) with formulae-of a knowledge representation
scheme that involves high level structures, such as, scripts, plans, etc. It also
involves three stages - (a) abstraction through analysis, i.e., it analyses the
text of source language and represents it in form of language-free conceptual
representation, (b) augmentation of information through inference, i.e., the conceptual
representation is provided at this stage with the contextual /world knowledge
implicit in the text through inference mechanism, and (c) reproduction in the
target language, that is, the language-free representation is mapped on to the
full expression of target language through natural language generator which takes
into account the appropriateness of the interaction between inferential information
and the abstracted structure (Carbonell, et al., 1981).
Undoubtedly,
with the availability of microprocessors, advancement in the field of artificial
intelligence and pragmatic oriented linguistic theories, the field of MT has received
a new vitality and optimism.
3. Man-machine Interface
This area falls within the field of artificial intelligence. The primary aim of
artificial intelligence (Al) is to make computers behave as intelligently as possible.
It has been defined as "the science of making machines do things that would
require intelligence" (Minsky, 1968). The field ranges over different types
of topics such as theorem-proving, game-playing, pattern recognition, expert systems,
knowledge engineering, use of natural language (Andrews, 1983 : 12). It has been
argued by scholars that natural language is the most convenient man-machine interface
device for communication, particularly for people other than computer scientists.
It is true that the computer system is being used in a big way for automatic information
retrieval. Nevertheless, still more concerted research is being carried out to
make the computer function as a conversational partner. Scholars are of the opinion
that 'those concerned with the design of information systems should now be concentrating
on functional requirements for the user-oriented, natural language systems of
the future' (Lancaster, 1977 : 39). The proliferating use of computer systems
and genuine interest of the different sections of a society in the use of computers
in the daily information processing function also creates a need for natural language
processing (Sager, 1981).
Here we must differentiate between 'formal'
language, 'natural' language and 'real' language. We must realize that a computer
program is the embodiment of a formal system. As all operations that undertake
inside the computer are based on the binary principle, all instruction as well
as data have to be ultimately in binary form. The computer internal language that
operates the binary world is known as machine language. Machine language is computer-sensitive
language and thus, each model of computer has its own unique machine language
and that is the reason why machine language programme of one computer cannot normally
be fed into another model of computer. Programme writing became easier only with
the introduction and subsequent development of high level languages like FORTRAN,
COBOL and BASIC. Such programming languages are excluded from the category of
'natural languages' simply because they do not correspond to aspects of real languages
like English, Russian, Hindi, etc. For example, the compiled programming language
such as ALGOL models aspects of the language of mathematics. It is to be noted
that all non-natural compiled programming languages are meant to be used by computer
specialists and are based on the model of mathematics, logic, etc. Contrary to
this, 'natural languages' have also the capabilities to respond meaningfully to
their input requirements of a computer system but are oriented basically to model
the features of real language directly. As pointed out by Benson (1979), 'natural'
languages are instances of formal language since they are also designed languages
meant to be used as input by computer programme. But they are called 'natural'
because they accept a reasonable fragment of any real language as the command
language of the programme.
Research in the Natural Language Processing
(NLP) brought scholars from many disciplines into a single fold. For the first
time a highly successful intrdisciplinary workshop called TINLAP (Theoretical
Issues in NLP) was held at MIT in 1975 with the purpose of "bringing together
researchers and students from computational linguistics, psychology, linguistics
and artificial intelligence to provide a forum at which people with different
interests in, and consequently different emphasis on, the problem of language
understanding could learn of the models developed and difficult issues faced by
people working on other aspects of understanding". TINLAP-2 was organized
in 1978 at the University of Illinois with such six wide ranging topics as: (1)
Language representation and psychology, (2) Language representation and reference,
(3) Discourse: Speech act and dialogue, (4) Language and perception, (5) Language
mechanism in natural language, and (6) Computational model as a vehicle for theoretical
linguistics.
In fact, researchers engaged in NLP made the language theory
testing aspect of computer system a promising field of enquiry. The earlier research
was primarily confined to the area of testing of grammars proposed by theoretical
linguistics, such as Friedman's Transformational Grammar Tester (Friedman, 1971).
The NLP research field motivated scholars to develop complete understanding system
by taking into account those areas of linguistic enquiry which have so far been
inadequately explored by linguistics. One can identify at least two such fields:
(a) Representation of knowledge and (b) Development of 'parsers'.
(a) Representation of Knowledge
One finds a number of suggestions for structuring information such as 'frames'
(Minsky, 1975), 'scripts' (Schank and Abelson, 1977), 'information formats' (Sager,
1975), etc. Frames identify new information in terms of known patterns central
to the analysis of texts. Scripts aim at capturing one's knowledge about stereotyped
sequences of events. Predicate argument relations in the context of any particular
verb are made the key concepts for 'information format'.
Attempts have
also been made to test the operational efficiency and psychological reality of
representational models. For example, the LNR research group of the University
of California at San Diego has developed a representational format for meaning
and a system for its testing. As reported by Gentner (1978), verb meaning is accepted
as a starting point for two reasons: (a) verbs provide the central organizing
semantic structure in sentence meaning, and (b) verbs are tractable. Meanings
of verbs in this format are represented in terms of inter-related sets of sub-predicates
such as CAUSE or CHANGE in an inter-related manner. The following basic assumptions
underlying representation can be tested as hypotheses:
(a) a verb's representation
captures the set of immediate inferences that people normally make when they hear
or read a sentence containing the verb,
(b) in general, one verb leads to many
inferences,
(c) these networks of meaning components are accessible during
comprehension, by an immediate and largely automatic process,
(d) the set of
components associated with a given word is reasonably stab,le across tasks and
contexts,
(e) surface memory for exact words fades quite rapidly, so that after
a short time, only the representational network remains (Gentner, 1978 : 3).
In this representational model the nodes and arrows correspond to the concepts
and their relationship. It is also suggested that more paths in. the representation
mean more conceptual paths in the memory. For example, let us take the following
three sentences of Hindi.
(1) mohan ke paas tasviir thii
'Mohan
had the picture'
(2) mohan ne shiva ko tasviir dii
'Mohan gave Shiva
the picture'
(3) mohan ne shiva ko tasviir bechi
'Mohan sold Shiva the
picture'
The meaning of the above mentioned three verbs - honaa, denaa
and bechnaa as interconnected subpredicates and their complexities can be shown
by the following graphic representations.
image
1
image
2
(b) Development of 'Parsers,
Central to the system for natural language information processing is a parsing
programme that produces syntactic analysis of input sentences utilizing a programming
language (specially designed for writing natural language grammars)g a word dictionary,
and procedures for transforming string i3arse trees into informationally equivalent
structures. String analysis or computation provides the structure of a sentence
through the string of its constituent units. These units serve as items to which
syntactic and semantic constraints apply. These units are also accepted as information
carriers of the sentence.
There are many kinds of 'parsers' which have
come into use, for example, PARSIFAL (of Marcus, 1980), ATN (Augmented Transition
Network) Parser; ELI-Processor (Riesbeck and Schank, 1976), Wilks' Parser (Wilks,
1975), Moptrans Parser (Lytinen, 1984). Marcus has argued in favour of deterministic
model of parser. The structure of grammar interpreter called PARSIFAL -has therefore
a structure based upon the hypothesis that a natural language parser need not
simulate a non-deterministic machine. His 'Determinism Hypothesis' claims that
'natural language' can be parsed by a computationally simple mechanism that uses
neither backtracking nor pseudoparallelism and in which all grammatical structures
created by the parser ' 'indeliable' in that it must all be output as part is
of the structural analysis of the parser's input' (Marcum, 1978 : 236). Marcus
has discussed- in detail the two specific universal properties of human languages
pointed out by Chomsky (1973, 1975 and 1976), i.e., the Subjacency Principle and
the Specified Subject -Constraint. He has demonstrated that these two constraints
fall out naturally from the structure of a - grammar interpreter called PARSIFAL.
The result thus provides indirect evidence for the Determinism Hypothesis (Marcus,
1978). In fact, most of natural language parsers are developed in the direction
of processing somewhat constrained input sentences (Carbonell and Hayes, 1983).
Attempts have also been made to develop a parser which can also parse semigrammatical
sentences by application of a formalism called Tree Adjoining Grammar (TAGs).
Formal properties of TAGs been explicated by Joshi and Levy (1982) and Shankar
and Joshi (1985). According to the followers of this type of parsers, TAGs first
define a set of elementary trees and an adjunction operation that produces complex
trees through a set of rules that combine simple trees. (A tree is simply the
structural description of a sentence of a given language). First, basic sentence
structures are studied and abstracted, and are stored into a set of tree banks.
The adjunction is then made operative as a derivative process.
In recent
years, interest has been shown by scholars to make the parser semantically impregnated.
Some of the parsers have absorbed merely certain semantic features into their
syntactic analysis, as is the case with METEO-TAUM parser (Chandioux, 1976), while
some of them incorporate semantics as the basic premise for the analysis of the
input text, as is the case with Wilks' parser (1975). An argument for an integrated
approach to processing wag put forward in which it was stated that while syntax
needs semantics, semantics also needs syntax. it was suggested that a parser should
take care of both syntactic and semantic constraints. Lytinen (1987) thus argues
that his MOPTRANS parser is an integrated parser, in the sense that syntactic
and semantic processing take place in tandem.
Linguistic Perspective
and Implications
We have so far discussed the relationship between language or linguistic studies
and computer or natural language processing. It is, however, important to know
what valuable input theoretical linguistics can provide to computer processing.
While discussing this topic linguists and computer scientists both should realize
that computer processing and translation machines are constrained by the state
of the art in linguistics. This puts on linguists and the field of linguistics
an added responsibility.
Scholars have discussed the role of computers
in language research (Srivastava, 1983; Sedelov and Sedelov, 1979) and microcomputers
in primary language learning (Keith and Glover, 1987). It is interesting to see
what the language sensitive fproblem-areas' are for natural language processing
and how linguists and the field of linguistics can help resolve those problems.
Similarly, for linguists it has become necessary to reflect upon the question
what can theoretical linguistics learn from the experience gained in the field
of natural language processing and computational models? Following are the directions
for the development of a theory of linguistics for making itself relevant to computer
processing.
(i) Making Linguistics Formal
Since a computer programme is the embodiment of formal system, it requires linguistic
knowledge to be stated in the formalized terms of logic and mathematics. The formal
characteristics of linguistic theory have been stressed in generative- grammar.
This has been achieved in this model in two ways. Firstly, it makes the notion
of 'grammar' as fundamental to its theory, rather than language, which is considered
by Chomsky as epiphenomenal. Secondly, it makes the grammar syntax-oriented. It
should be emphasized that the semantics of syntax in this model has been extended
now' to include the field of phonology and semantics and is defined as- 'the computational
component of the language faculty'. In the words of Chomsky, "so let us put
forth a thesis, which will ultimately be the thesis of autonomy of syntax. It
says that there exists a faculty of the mind which corresponds to the computational
aspects of language, meaning the system of rules that give certain representations
and derivations. What would be the parts of that system? Presumably the base rules,
the transformational rules, the rules that map S-structures onto phonological
representation, and onto logical forms, all that is syntax' (Huybregts and Riemsdijk,
1982 : 114).
(2) Making Computational Power Restrictive
Computational Power Restrictive It is generally agreed that the computational
apparatus of a grammar cannot be let loose to become too powerful, so that it
allows even those systems that are not possible in human - languages. The apparatus
should be just powerful enough to allow all the facts of natural languages, but
not so powerful - that it generates even those structures that never occur in
natural languages. A linguistic theory has therefore to specify the possible form
of a human grammar and the constraints upon such a grammar. In generative model
the term 'constraint' refers to a condition which restricts the application of
a rule. The constraints which - are universal to all languages qualify as universal
properties of language while those restricted to a given language are said to
be language - specific properties. For example, consonant clusters of English
have the sequence constraint rule that in initial position it is restricted in
number to three (e.g., split, string, scream); in a sequence +Cl C2 C3 V, Cl can
only be [S], C2 [p, t, k] and C3 [r] or [1]. As pointed out earlier, some of the
constraints claimed bychomsky, such as Subjacency Principle and Specified Subject
Constraint, fall out naturally from the grammar interpreter component of a computer
called by Marcus as PARSIFAL.
(3) Making Theory More Integrative
Man and machine both are excellent symbol manipulators. Viewed from the point
of semiotics, science of signs, language as a system of symbols has to be analyzed
on three distinct levels -(a) syntax, which studies the relationship between symbol
and symbol, (b) semantics, which studies the relationship between symbol and object
and (c) pragmatics, which studies the relationship between symbol and its users.
Parsers are to be developed in order to interpret the sentence on all these levels
in an integrative manner.
Look at the following three sentences:
(a) Mohan frightens Lila.
(b) Mohan observed the ball.
(c) Mohan did not
kiss Lila.
All these sentences give more than one reading, but the cause
of their ambiguities are different. In the case of (a) it is the relationship
between the two symbols. Mohan and frighten which is giving two possible interpretations
- (1) A Vt, i.e., Mohan as an agent-relationship with the verb which is transitive.
The Hindi equivalent will be 'mohan Lika ko daraataa hai' and (2) I - Vi, i.e.,
Mohan as an instrument and verb as intransitive. The Hindi equivalent of this
reading will be: Lila mohan se dartii hai.
In the case of (b), the ambiguity
arises because of the referent of the word 'ball' which can be interpreted either
as 'gala affair' as in the sentence 'I attended the ball' or as a 'spherical object'
as in the sentence 'I kicked the ball'.
The sentence (c) is ambiguous
because a reader can negate any major constituent of the structure:
N
NP VP
V NP
The three possible interpretations
are thus:
4. Some may have kissed but not Mohan.
5. Mohan may have done
something to Lila, but did not kiss her.
6. Mohan may have kissed someone,
but not Lila.
Similarly, one can find out the reason for the anomalous
nature of the second sentence of the following pairs.
(al) mohan so rahaa
hai
'Mohan is sleeping'
*(a2) per? so rahaa hai
'Tree is sleeping'
(bl) kal jo aadmii maaraa jaaega, vo so rahaa hai 'The man who will be killed
tomorrow is sleeping'
*(b2) kal jo aadmii maaraa gayaa, vo so rahaa hai
'The man Who was killed yesterday
is sleeping'
(cl) abe shyamu! tu kyaa
so rahaa hai
'O Shyamu! are you sleeping'
*(c2) O pitaa jii, tu kyaa so
rahaa hai
'O father, are you sleeping'
(4) Making Interpretation
Context Sensitive
Information Processing requires knowledge of the world and the context of situation
of utterances. It also requires information sharing between sentences. It is true
that the sentence: 'My typewriter has bad intentions' is anomalous because it
violates selection restriction rule, but if we replace 'typewriter' by 'dog',
'snake' and 'microbe', whether the resulting sentence is judged to be anomalous
can be determined, as pointed out by Palmer (1976 : 46), only by what we know
about intelligence of dogs, snakes and microbe, i.e., our knowledge Similarly
observe the following two sentences:
(a) Mohan was looking for the glasses.
looking for the glasses for him to drink.
(b) Mohan was looking for the glasses.
looking
for the glasses for him to read.
The
first sentence in (a) and (b) are identical, but we associate different meanings
with the word 'glasses' in each case because of the information of the following
sentences. It is obvious that there is no way to understand or translate the first
sentence in (a) and (b) by reading these sentences in isolation. Without building
a 'context' by looking across sentences and making inferences across statements,
a viable theory of information processing cannot be developed. It is to be stressed
that for natural language processing we have to make our language analysis with
all three levels grammatical analysis, information processing across sentences
and encyclopedic knowledge.
Scholars
actively engaged in the field of NLP are well aware of the fact that alongwith
the description of the phonology, morophology, syntax and semantic ' s of a natural
language, they have to face at a certain stage problems of anomalous sentence
constructions, contradictory statements and ambiguous linguistic structures. On
many occasions they have to deal with the problem of meaning at the pragmatic
level where the same sentence may play different roles in different discourse
settings. They also face the problem of organizing sentences logically so that
a coherent structure of a dialogue should get generated. For resolving 'problems'
at all these levels, a NLP project needs a qualified linguist.
Linguists
associated with a NLP project, however, must be informed about the fact that linguistic
problems may or may not be the real problems of NLP and for no reason whatsoever,
their problems be substituted for real NLP problems. It should be emphasized that
the core-linguistics centres around the question - 'What language is' and extends
its domain on periphery resolving problems related to the question 'what language
does'. While involving linguistic questions, the field of NLP is basically concerned
with problems that are related to the question - 'how language works', and that
too in the context of artificial intelligence. Secondly, most of linguistic information
comes packaged to a linguist as a part of formal grammar, but as pointed out by
Raskin (1987 : 49), 'the linguist should be smart enough to know that packages
are not ready for use in NLP'. Being an applied linguist, he has to make these
rules applicable in different situations. Lastly, as stated above linguistic implications
for NLP are many and multifaceted. As a scientific field of enquiry, linguistics
has developed itself in many areas to an extent that it can provide easy solution
to many problems of NLP but there are still many areas such as pragmatics, conceptual
processing, information sharing between sentences, discourse implicatures, meaning
in relation to world knowledge etc., in which linguistics has yet to. become ripe
for a paradigm shift. It is in this context that we said earlier that the development
of the field of NLP is also constrained by the state of art in linguistics.