### Intro In 1953, Hans Peter Luhn, a researcher at IBM, published...
The term "legend" refers to a set of descriptive information or met...
Luhn is proposing a way to visualize the relationship between terms...
### Conventional Indexing Approach Imagine a comprehensive researc...
The 75 million patterns come from a combinatorial calculation (s al...
Luhn addresses a fundamental challenge in information retrieval: ba...
A specialized dictionary serves as a crucial tool in Luhn's informa...
**Calculating the Probability Factors:** - For the scenario wher...
A NEW METHOD
OF
RECORDING AND
SEARCHING INFOR MATION
H.
P.
LUHN*
This method applies to
the
procedures re-
quired to record a legend concerning a docu-
ment and to enable an inquirer to locate this
document by means of the legend,
if
it
is
related
to a specified subject.
The conventional methods of indexing and
classifying attempt to evaluate the relative im-
portance of a plurality of aspects contained in
a
document and makes the most important one the
key for locating the document within an orderly
scale of
a
certain dimension. Subordinated as-
pects are covered by way of reference in appro-
priate other locations of
the
scale.
system
is
that the standard of value on which the
indexer bases his decision may change and,
what
suddenly
is
considered an aspect of major sig-
nificance, may not have been included in the
classification or index at
the
time, even though
it
was contained in a document.
Another drawback
is
that
it
becomes diffi-
cult for an inquirer to reverse the process of
classification or indexing and pose his query in
a form matching to a reasonable degree the val-
ues of a potential reference.
acterizing a topic by a set of identifying ele-
ments or criteria. These elements may be of
any dimension and as many may be recorded as
is
desirable. Also,
they
are not weighted and no
significance need be implied by the order in
which they are given.
One of the main functions of the new method
is
that of producing a response to an inquiry in
all cases, even
if
the reference appears to be
remote,
it
being the understanding that
it
is
the
closest available.
The elements enumerated by recorders to
identify
a
topic will necessarily vary as no two
recorders
will
view a topic in identical fashion.
Similarly, no two inquirers, when referring to
the same subject
will
state their query in iden-
One of the disadvantages of the conventional
The new method uses the principle of char-
tical fashion.
It
is
therefore important that a
system recognizes that these variations arise
and that they cannot be controlled.
It must then
become the function of
the
system to overcome
these
variations to a reasonable degree.
When identifying a topic by a set of criteria
or identifying terms, the more terms are stated
the more specifically the topic
is
delineated.
Each term in turn may be a concept which in it-
self may vary as to specificity.
If
we consider
a concept
as
being a field in a multi-dimensional
array, we may then visualize a topic as being
located in that space which
is
common to all the
concept fields stated. It may further be visual-
ized that related topics are located more or less
adjacent to each other depending on the degree
of similarity and that this
is
so
because they
agree in some of
the
identifying terms and there-
fore share some of
the
concept fields.
Figure
1
is
a diagrammatic illustration.
The Topic,
Identified
-1.
FIGURE
1.
Other
Topics
In order to understand the nature of the ar-
rangement, let
us
assume a vocabulary of
100
concepts and
let
us
identify a topic by five con-
ceptual terms.
By using all possible combina-
tions of five terms, a total of
75
million patterns
of criteria result, each of these patterns having
a fixed location within
the
system.
Lf
then a
topic is identified by five terms of the vocabu-
lary,
it
is
thereby assigned to a definite one of
these fixed locations.
While assuming that there
is
an ideal and
true location where a topic belongs,
it
is
un-
*International Business Machines Corporation, Engineering Laboratory, Poughkeepsie, New York.
14
METHOD
OF
RECORDING AND SEARCHING INFORMATION
15
were to do the same job. There will result a
deviation from the true location proportionate
to the degree of disagreement of either.
For
in-
stance one recorder may diverge to the extent
of matching only
3
of
the
5
criteria while the
other matches
4.
The resultant displacement
is
shown in diagram Figure
2.
3/5
-_
/
4/5
-
5/5
-
(Ideal location)
h-
3/5
-_
/
4/5
-
5/5
-
(Ideal location)
h-
FIGURE
2.
Such disagreements will be the more pro-
nounced the more specific
the
conceptual terms
are and it
is
a further function of the new meth-
od
to minimize variations by broadening the con-
cept used in the terms and by using as large a
number of broadened criteria as possible even
to the extent of redundancy.
This
approach
is
based on the realization that an inquirer could
not match excessive specificity when stating
his
query and that
his
position
is
similar to that of
the recorder.
The process of broadening the concept in-
volves the compilation of a dictionary wherein
key terms of desired broadness may be found to
replace unduly specific terms, the latter being
treated as synonyms of a higher order than or-
dinarily considered. Translating criteria into
these key terms
is
a process of normalization
which will
eliminate many disagreements in the
choice
of
specific terms amongst recorders,
amongst inquirers, and amongst the two groups,
by merging the terms at issue into a single key
term. However the dictionary does not classify
or index but maintains
the
idea of terms being
fields and applies the identification principle to
the terms in the manner
it
is
applied to
the
top-
ics, even though to a lesser degree.
A
specific
term may appear under the heading of several
key terms and
if
according to
its
application an
overlapping of concepts exists then the term
is
represented by
the
several key terms involved,
as shown diagrammatically in Figure
3
for
‘b’.
The manner which an inquirer approaches
the process of searching for desired information
becomes one similar to that performed by the
recorder.
He
first
states hiw query in as many
and as specific terms as he desires.
Then with
the aid of the special dictionary he normalizes
the
conceptual terms of identification to arrive
at a statement adjusted to the requirements of
the system.
comparing of his statements with all the state-
ments contained in
the
collection of records
prepared by
the
recorder.
This task, being be-
yond human capability, may be performed au-
tomatically by a scanning machine
which
is
capable of not only matching similar portions
of informationbut of doing this in accordance
with
any conceivable pattern of conditions.
As
indicated earlier, the intended purpose
of a search
is
to ,produce a response to a query.
Because
it
is
noti usually known how specific a
response can be expected, the initial query
is
stated rather broadly thereby extending
the
field to include less related material. The ex-
tent of responses obtained on
this
bases
is
a
valuable indication of the amount of attention
devoted to the subject area in the past. The
material abtained would then be subjected to in-
creasingly more specific searches in order to
get
the
closest match possible. Also, material
uncovered by this approach may lead to the
discovery of unsuspected, but pertinent other
related information.
The actual process of searching involves the
Key
term
Field
A,
Key term
Field
B
Spec
if
ic
Terms
FIGURE
3.
In
particular,
the
scheme of broadening the
field of response consists of asking that a fixed
fraction of the given terms be met by the rec-
ords. This procedure
is
quite different from
that used when broadening a generic search by
dropping subclasses. The effect
is
illustrated
by the following diagrams, Figure
4,
showing
progressively broader fields formed by
5
terms.
Using the proportions of the example pre-
viously given and assuming an evenly distrib-
uted population of topics, the relative probabil-
ity of response
is
expressed by
the
factors
16
AMERICAN DOCUMENTATION
listed below each fraction. While applied to an
idealized situation, the results are neverthe-
less
indicative of the advantages the method of
identification has over other methods of index-
ing information.
5/5
4/5
3/5
2/5
1/5
probability factor of response:
1
96 4,656 152,096
3,764,376
FIGURE
4.
THE USE
OF
THE
UNIVAC FAC-TRONIC SYSTEM
IN
THE
LIBRARY REFERENCE
FIELD
HERBERT
F.
MITCHELL, JR.*
The tremendous increase in the volume of
technical literature of all kinds and fields
is
presenting the librarian with an almost impos-
sible reference task, The sheer volume of these
documents is creating a filing problem of the
first magnitude. When this volume
is
combined
with
the fact that many documents cut across
classification lines, the problem of providing
reference bibliographies
is
made that much
more difficult.
Several persons concerned with the furnish-
ing of reference material have approached
those of us engaged in the manufacture and uti-
lization of digital computers to
see
if
these ma-
chines might be of assistance to the librarian,
Such an occasion arose a
little
over a year
ago
when the Centralized Air Document Office in
Dayton, Ohio, approached Remington Rand to
as-
certain
the
suitability of our equipment for
this
work.
A
study was made to
see
how the UNIVAC
Fac-Tronic System might be applied to the task
of obtaining all possible documents from a
large file which could answer
a
specific query
submitted to
this
office. The model studied in-
visioned a library of
1,000,000
documents. Each
document was identified by an eight-digit shelf
number.
A
master reference file was to be
compiled, each item of which would consist of
the shelf number followed by a series of coded
approaches. Each such approach would repre-
sent some pertinent feature of
the
document,
such as: author, data, contract number, and de-
scriptors of the subject
or
subjects treated by
the document. It was anticipated that each doc-
ument would have an average of fifteen ap-
proaches with
a
maximum
of
thirty.
In
order to obtain
a
list
of all documents
which might possibly answer a given query, the
computer would be supplied
with
the appropriate
coded approaches included in the query.
It
would then search through the entire master file
and select
all
document items which contain the
approaches given in
the
query.
For
such a system
as
the above to be work-
*Director, UNIVAG Applications Department Remington Rand, Inc.

Discussion

### Conventional Indexing Approach Imagine a comprehensive research document that covers various topics within the field of cancer treatment, including chemotherapy, immunotherapy, radiation therapy, and surgical procedures. The indexer might decide that “chemotherapy” is the most important aspect of the document because it is the most detailed and occupies a significant portion of the content. Consequently, the document is indexed under “chemotherapy”. Other important aspects, such as “immunotherapy”, “radiation therapy”, and “surgical procedures”, are referenced under their respective categories but are given less emphasis. These topics might be mentioned in cross-references within the index but are not the primary entry points for the document. One drawback of this approach is that the importance of topics is subjective and can change over time. For instance, if immunotherapy becomes the most promising and researched area in cancer treatment in the future, the document indexed primarily under “chemotherapy” might not be easily found by researchers interested in immunotherapy. If the indexer later decides that “immunotherapy” has gained more relevance and that the document has become pivotal in the “immunotherapy” body of work, they would need to reclassify the document, which can be inefficient. The term "legend" refers to a set of descriptive information or metadata about a document. This includes key details such as keywords, summaries, titles, and other relevant descriptors that provide an overview of the document's content. Luhn addresses a fundamental challenge in information retrieval: balancing specificity with findability. He recognizes that using very specific terms to describe topics can lead to mismatches between how information is recorded and how it's searched for, as different people may use different specific terms for the same concept. To solve this, Luhn proposes a counterintuitive approach: use broader, more general terms, and use more of them. This strategy of employing multiple broad terms, even to the point of redundancy, increases the likelihood of a match between the recorder's and the inquirer's language. The 75 million patterns come from a combinatorial calculation (s all possible ways to select 5 items from a set of 100, where order doesn't matter). With 100 concepts and choosing 5 at a time, we get: $C(100,5) = \frac{100!}{5!(100-5)!} = \frac{100!}{5!95!} = 75,287,520$ **Calculating the Probability Factors:** - For the scenario where you require a match for all 5 specific terms, there is only 1 possibility. - For the scenario where you need 4 specific matches out of 5, you have 4 terms that are specified, leaving $100 - 4 = 96$ remaining options. - For the scenario where you require 3 specific matches out of 5, you have 3 terms specified. For the other two terms, you have $ \frac{97 \times 96}{2!} = 4,656 $ options remaining (since the order doesn’t matter). - For the scenario where you only need 2 matches, the calculation involves $\frac{98 \times 97 \times 96}{3!} = 152,096 $. - For the scenario where you only require 1 specific match, the number of options is $ \frac{99 \times 98 \times 97 \times 96}{4!} = 3,764,376 $. This would seem to prefigure vector space search by some 22 years. Luhn is proposing a way to visualize the relationship between terms and topics. He suggests that we can think of each term as a dimension in a multi-dimensional space. A topic is then a point in this space, and the coordinates of this point are the values of the terms that define the topic. For example, if we have a five-dimensional space defined by the terms "cat", "dog", "pet", "animal", and "mammal", then the topic "domestic cat" would be located at the point (1,0,1,1,1), since it is defined by all of these terms except "dog". The topic "dog" would be located at the point (0,1,1,1,1). Luhn's idea is that related topics will be located near each other in this space. For example, the topics "domestic cat" and "dog" are more closely related to each other than they are to the topic "rock", since they share the dimensions "pet", "animal", and "mammal". This is because they agree in some of the identifying terms and therefore share some of the concept fields. A specialized dictionary serves as a crucial tool in Luhn's information retrieval system. This dictionary normalizes terminology by mapping specific terms to broader key terms treating specific terms as high-level synonyms. It treats terms as conceptual fields, allowing a single specific term to be linked to multiple key terms for nuanced representation. For example, the term "dolphin" might be linked to key terms like "mammal”, "cetacean” and "predator”. This approach reduces terminology mismatches between those recording and searching for information. If one person uses "dolphin" and another searches for "whale," the system can still make the connection due to their shared link to "cetacean." The dictionary doesn't enforce strict hierarchies; instead, it allows for overlapping concepts, as shown in Figure 3. ### Intro In 1953, Hans Peter Luhn, a researcher at IBM, published “A New Method of Recording and Searching Information," a landmark paper in the field of information retrieval. This work addressed the growing challenge of managing and accessing the rapidly expanding volume of scientific and technical literature in the 1950s, when traditional classification systems were becoming inadequate. Luhn proposed a revolutionary approach that broke away from rigid classification, introducing a flexible method using sets of identifying terms or "criteria" to describe documents. This laid the groundwork for modern search concepts like keyword searching and relevance ranking. Luhn's method anticipated the need for machine-assisted searching and addressed the subjectivity in how different people might describe the same information. Luhn's idea of representing documents as sets of terms conceptually paved the way for modern vector search.