Abstract
Latent Semantic Analysis (LSA) is
a theory and method for extracting and representing the contextual-usage meaning of words
by statistical computations applied to a large corpus of text (Landauer and Dumais, 1997).
The underlying idea is that the aggregate of all the word contexts in which a given word
does and does not appear provides a set of mutual constraints that largely
determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA’s
reflection of human knowledge has been established in a variety of ways. For example,
its scores overlap those of humans on standard vocabulary and subject matter tests; it
mimics human word sorting and category judgments; it simulates word–word and passage–word
lexical priming data; and, as reported in 3 following articles in this issue, it accurately
estimates passage coherence, learnability of passages by individual students, and the
quality and quantity of knowledge contained in an essay.
An Introduction
to Latent Semantic Analysis
Research reported in the three
articles that follow—Foltz, Kintsch & Landauer (1998/this i
ssue), Rehder, et al. (1998/this
issue), and Wolfe, et al. (1998/this issue)—exploits a new
theory of knowledge induction and
representation (Landauer and Dumais, 1996, 1997) that
provides a method for determining
the similarity of meaning of words and passages by
analysis of large text corpora.
After processing a large sample of machine-readable
language, Latent Semantic Analysis
(LSA) represents the words used in it, and any set of
these words—such as a sentence,
paragraph, or essay—either taken from the original
corpus or new, as points in a
very high (e.g. 50-1,500) dimensional “semantic space”.
LSA is closely related to neural
net models, but is based on singular value decomposition, a
mathematical matrix decomposition
technique closely akin to factor analysis that is
applicable to text corpora
approaching the volume of relevant language experienced by
people.
Word and passage meaning
representations derived by LSA have been found capable of simulating a variety
of human cognitive phenomena, ranging from developmental acquisition of
recognition vocabulary to word-categorization, sentence-word semantic priming, discourse comprehension,
and judgments of essay quality. Several of these simulation results will be
summarized briefly below, and additional applications will be reported in detail in
following articles by Peter Foltz, Walter Kintsch, Thomas Landauer, and their colleagues.
We will explain here what LSA is and describe what it does.
LSA can be construed in two ways:
(1) simply as a practical expedient for obtaining approximate estimates of the
contextual usage substitutability of words in larger text segments, and of the kinds of—as
yet incompletely specified— meaning similarities among words and text segments that such
relations may reflect, or (2) as a model of the computational processes and
representations underlying substantial portions of the
acquisition and utilization of
knowledge. We next sketch both views. As a practical method for the
characterization of word meaning, we know that LSA produces measures of word-word,
word-passage and passage-passage relations that are well correlated with several
human cognitive phenomena involving association or semantic similarity. Empirical evidence of
this will be reviewed shortly. The correlations
figure prominently in research on discourse processing.
It is important to note from the
start that the similarity estimates derived by LSA are not simple contiguity
frequencies, co-occurrence counts, or correlations in usage, but depend on a powerful mathematical
analysis that is capable of correctly inferring much deeper relations (thus the phrase
“Latent Semantic”), and as a consequence are often much better predictors of human
meaning-based judgments and performance than are the surface level contingencies that have
long been rejected (or, as Burgess and Lund, 1996 and this
volume, show, unfairly maligned)
by linguists as the basis of language phenomena. LSA, as currently practiced,
induces its representations of the meaning of words and passages from analysis of
text alone. None of its knowledge comes directly from perceptual information about the physical
world, from instinct, or from experiential intercourse with bodily
functions, feelings and intentions. Thus its representation of reality
is bound to be somewhat sterile
and bloodless. However, it does take in descriptions and verbal outcomes of all these
juicy processes, and so far as writers have put such things into words, or that their words have
reflected such matters unintentionally, LSA has at least potential access to knowledge
about them. The representations of passages that LSA forms can be interpreted as
abstractions of “episodes”, sometimes of episodes of purely verbal content such as philosophical
arguments, and sometimes episodes from real or imagined life coded into verbal
descriptions. Its representation of words, in turn, is intertwined with and mutually interdependent with
its knowledge of episodes. Thus while LSA’s potential knowledge is surely imperfect, we
believe it can offer a close enough approximation to people’s knowledge to underwrite
theories and tests of theories of cognition. (One might consider LSA's maximal knowledge
of the world to be analogous to a well-read nun’s knowledge of sex, a level of
knowledge often deemed a sufficient basis for advising the young.)
However, LSA as currently
practiced has some additional limitations. It makes no use of word order, thus of
syntactic relations or logic, or of morphology. Remarkably, it manages to extract correct
reflections of passage and word meanings quite well without these aids, but it must still be
suspected of resulting incompleteness or likely error on some occasions.
LSA differs from some statistical
approaches discussed in other articles in this issue and elsewhere in two significant
respects. First, the input data "associations" from which LSA induces representations are
between unitary expressions of meaning—words and complete meaningful utterances in
which they occur—rather than between successive words. That is, LSA uses as its
initial data not just the summed contiguous pairwise (or tuple-wise) co-occurrences of
words but the detailed patterns of occurrences of very many words over very large numbers of
local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary
wholes. Thus it skips over how the order of words produces the meaning of a sentence to
capture only how differences in word choice and differences in passage meanings are related.
Another way to think of this is
that LSA represents the meaning of a word as a kind of average of the meaning of all
the passages in which it appears, and the meaning of a passage as a kind of average of
the meaning of all the words it contains. LSA's ability to simultaneously—conjointly—derive
representations of these two interrelated kinds of meaning depends on an aspect of
its mathematical machinery that is its second important property. LSA assumes that the
choice of dimensionality in which all of the local wordcontext relations are simultaneously
represented can be of great importance, and that reducing the dimensionality (the
number parameters by which a word or passage is described) of the observed data
from the number of initial contexts to a much smaller—but still large—number will often
produce much better approximations to human cognitive relations. It is this
dimensionality reduction step, the combining of surface information into a deeper abstraction, that
captures the mutual implications of words and passages. Thus, an
important component of applying
the technique is finding the optimal dimensionality for the final epresentation. A possible
interpretation of this step, in terms more familiar to researchers in psycholinguistics,
is that the resulting dimensions of description are analogous to the semantic
features often postulated as the basis of word meaning, although establishing concrete relations
to mentalisticly interpretable features poses daunting technical and conceptual problems
and has not yet been much attempted. Finally, LSA, unlike many other
methods, employs a preprocessing step in which the overall distribution of a
word over its usage contexts, independent of its correlations with other words, is first taken
into account; pragmatically, this step improves LSA’s results considerably.