Fermat's Library | A new method of recording and searching information annotated/explained version.

### Intro In 1953, Hans Peter Luhn, a researcher at IBM, published...

The term "legend" refers to a set of descriptive information or met...

Luhn is proposing a way to visualize the relationship between terms...

### Conventional Indexing Approach Imagine a comprehensive researc...

The 75 million patterns come from a combinatorial calculation (s al...

Luhn addresses a fundamental challenge in information retrieval: ba...

A specialized dictionary serves as a crucial tool in Luhn's informa...

**Calculating the Probability Factors:** - For the scenario wher...

A NEW METHOD

RECORDING AND

SEARCHING INFOR MATION

LUHN*

This method applies to

the

procedures re-

quired to record a legend concerning a docu-

ment and to enable an inquirer to locate this

document by means of the legend,

to a specified subject.

The conventional methods of indexing and

classifying attempt to evaluate the relative im-

portance of a plurality of aspects contained in

document and makes the most important one the

key for locating the document within an orderly

scale of

certain dimension. Subordinated as-

pects are covered by way of reference in appro-

priate other locations of

the

scale.

system

that the standard of value on which the

indexer bases his decision may change and,

what

suddenly

considered an aspect of major sig-

nificance, may not have been included in the

classification or index at

the

time, even though

was contained in a document.

Another drawback

that

becomes diffi-

cult for an inquirer to reverse the process of

classification or indexing and pose his query in

a form matching to a reasonable degree the val-

ues of a potential reference.

acterizing a topic by a set of identifying ele-

ments or criteria. These elements may be of

any dimension and as many may be recorded as

desirable. Also,

they

are not weighted and no

significance need be implied by the order in

which they are given.

One of the main functions of the new method

that of producing a response to an inquiry in

all cases, even

the reference appears to be

remote,

being the understanding that

the

closest available.

The elements enumerated by recorders to

identify

topic will necessarily vary as no two

recorders

will

view a topic in identical fashion.

Similarly, no two inquirers, when referring to

the same subject

will

state their query in iden-

One of the disadvantages of the conventional

The new method uses the principle of char-

tical fashion.

therefore important that a

system recognizes that these variations arise

and that they cannot be controlled.

It must then

become the function of

the

system to overcome

these

variations to a reasonable degree.

When identifying a topic by a set of criteria

or identifying terms, the more terms are stated

the more specifically the topic

delineated.

Each term in turn may be a concept which in it-

self may vary as to specificity.

we consider

a concept

being a field in a multi-dimensional

array, we may then visualize a topic as being

located in that space which

common to all the

concept fields stated. It may further be visual-

ized that related topics are located more or less

adjacent to each other depending on the degree

of similarity and that this

because they

agree in some of

the

identifying terms and there-

fore share some of

the

concept fields.

Figure

a diagrammatic illustration.

The Topic,

Identified

-1.

FIGURE

Other

Topics

In order to understand the nature of the ar-

rangement, let

assume a vocabulary of

100

concepts and

let

identify a topic by five con-

ceptual terms.

By using all possible combina-

tions of five terms, a total of

million patterns

of criteria result, each of these patterns having

a fixed location within

the

system.

then a

topic is identified by five terms of the vocabu-

lary,

thereby assigned to a definite one of

these fixed locations.

While assuming that there

an ideal and

true location where a topic belongs,

un-

*International Business Machines Corporation, Engineering Laboratory, Poughkeepsie, New York.

METHOD

RECORDING AND SEARCHING INFORMATION

were to do the same job. There will result a

deviation from the true location proportionate

to the degree of disagreement of either.

For

in-

stance one recorder may diverge to the extent

of matching only

the

criteria while the

other matches

The resultant displacement

shown in diagram Figure

3/5

4/5

5/5

(Ideal location)

3/5

4/5

5/5

(Ideal location)

FIGURE

Such disagreements will be the more pro-

nounced the more specific

the

conceptual terms

are and it

a further function of the new meth-

to minimize variations by broadening the con-

cept used in the terms and by using as large a

number of broadened criteria as possible even

to the extent of redundancy.

This

approach

based on the realization that an inquirer could

not match excessive specificity when stating

his

query and that

his

position

similar to that of

the recorder.

The process of broadening the concept in-

volves the compilation of a dictionary wherein

key terms of desired broadness may be found to

replace unduly specific terms, the latter being

treated as synonyms of a higher order than or-

dinarily considered. Translating criteria into

these key terms

a process of normalization

which will

eliminate many disagreements in the

choice

specific terms amongst recorders,

amongst inquirers, and amongst the two groups,

by merging the terms at issue into a single key

term. However the dictionary does not classify

or index but maintains

the

idea of terms being

fields and applies the identification principle to

the terms in the manner

applied to

the

top-

ics, even though to a lesser degree.

specific

term may appear under the heading of several

key terms and

according to

its

application an

overlapping of concepts exists then the term

represented by

the

several key terms involved,

as shown diagrammatically in Figure

for

‘b’.

The manner which an inquirer approaches

the process of searching for desired information

becomes one similar to that performed by the

recorder.

first

states hiw query in as many

and as specific terms as he desires.

Then with

the aid of the special dictionary he normalizes

the

conceptual terms of identification to arrive

at a statement adjusted to the requirements of

the system.

comparing of his statements with all the state-

ments contained in

the

collection of records

prepared by

the

recorder.

This task, being be-

yond human capability, may be performed au-

tomatically by a scanning machine

which

capable of not only matching similar portions

of informationbut of doing this in accordance

with

any conceivable pattern of conditions.

indicated earlier, the intended purpose

of a search

to ,produce a response to a query.

Because

noti usually known how specific a

response can be expected, the initial query

stated rather broadly thereby extending

the

field to include less related material. The ex-

tent of responses obtained on

this

bases

valuable indication of the amount of attention

devoted to the subject area in the past. The

material abtained would then be subjected to in-

creasingly more specific searches in order to

get

the

closest match possible. Also, material

uncovered by this approach may lead to the

discovery of unsuspected, but pertinent other

related information.

The actual process of searching involves the

Key

term

Field

Key term

Field

Spec

Terms

FIGURE

particular,

the

scheme of broadening the

field of response consists of asking that a fixed

fraction of the given terms be met by the rec-

ords. This procedure

quite different from

that used when broadening a generic search by

dropping subclasses. The effect

illustrated

by the following diagrams, Figure

showing

progressively broader fields formed by

terms.

Using the proportions of the example pre-

viously given and assuming an evenly distrib-

uted population of topics, the relative probabil-

ity of response

expressed by

the

factors

AMERICAN DOCUMENTATION

listed below each fraction. While applied to an

idealized situation, the results are neverthe-

less

indicative of the advantages the method of

identification has over other methods of index-

ing information.

5/5

4/5

3/5

2/5

1/5

probability factor of response:

96 4,656 152,096

3,764,376

FIGURE

THE USE

THE

UNIVAC FAC-TRONIC SYSTEM

THE

LIBRARY REFERENCE

FIELD

HERBERT

MITCHELL, JR.*

The tremendous increase in the volume of

technical literature of all kinds and fields

presenting the librarian with an almost impos-

sible reference task, The sheer volume of these

documents is creating a filing problem of the

first magnitude. When this volume

combined

with

the fact that many documents cut across

classification lines, the problem of providing

reference bibliographies

made that much

more difficult.

Several persons concerned with the furnish-

ing of reference material have approached

those of us engaged in the manufacture and uti-

lization of digital computers to

see

these ma-

chines might be of assistance to the librarian,

Such an occasion arose a

little

over a year

ago

when the Centralized Air Document Office in

Dayton, Ohio, approached Remington Rand to

as-

certain

the

suitability of our equipment for

this

work.

study was made to

see

how the UNIVAC

Fac-Tronic System might be applied to the task

of obtaining all possible documents from a

large file which could answer

specific query

submitted to

this

office. The model studied in-

visioned a library of

1,000,000

documents. Each

document was identified by an eight-digit shelf

number.

master reference file was to be

compiled, each item of which would consist of

the shelf number followed by a series of coded

approaches. Each such approach would repre-

sent some pertinent feature of

the

document,

such as: author, data, contract number, and de-

scriptors of the subject

subjects treated by

the document. It was anticipated that each doc-

ument would have an average of fifteen ap-

proaches with

maximum

thirty.

order to obtain

list

of all documents

which might possibly answer a given query, the

computer would be supplied

with

the appropriate

coded approaches included in the query.

would then search through the entire master file

and select

all

document items which contain the

approaches given in

the

query.

For

such a system

the above to be work-

*Director, UNIVAG Applications Department Remington Rand, Inc.

Discussion

The term "legend" refers to a set of descriptive information or metadata about a document. This includes key details such as keywords, summaries, titles, and other relevant descriptors that provide an overview of the document's content. Luhn addresses a fundamental challenge in information retrieval: balancing specificity with findability. He recognizes that using very specific terms to describe topics can lead to mismatches between how information is recorded and how it's searched for, as different people may use different specific terms for the same concept. To solve this, Luhn proposes a counterintuitive approach: use broader, more general terms, and use more of them. This strategy of employing multiple broad terms, even to the point of redundancy, increases the likelihood of a match between the recorder's and the inquirer's language. The 75 million patterns come from a combinatorial calculation (s all possible ways to select 5 items from a set of 100, where order doesn't matter). With 100 concepts and choosing 5 at a time, we get: $C(100,5) = \frac{100!}{5!(100-5)!} = \frac{100!}{5!95!} = 75,287,520$ **Calculating the Probability Factors:** - For the scenario where you require a match for all 5 specific terms, there is only 1 possibility. - For the scenario where you need 4 specific matches out of 5, you have 4 terms that are specified, leaving $100 - 4 = 96$ remaining options. - For the scenario where you require 3 specific matches out of 5, you have 3 terms specified. For the other two terms, you have $ \frac{97 \times 96}{2!} = 4,656 $ options remaining (since the order doesn’t matter). - For the scenario where you only need 2 matches, the calculation involves $\frac{98 \times 97 \times 96}{3!} = 152,096 $. - For the scenario where you only require 1 specific match, the number of options is $ \frac{99 \times 98 \times 97 \times 96}{4!} = 3,764,376 $. This would seem to prefigure vector space search by some 22 years. A specialized dictionary serves as a crucial tool in Luhn's information retrieval system. This dictionary normalizes terminology by mapping specific terms to broader key terms treating specific terms as high-level synonyms. It treats terms as conceptual fields, allowing a single specific term to be linked to multiple key terms for nuanced representation. For example, the term "dolphin" might be linked to key terms like "mammal”, "cetacean” and "predator”. This approach reduces terminology mismatches between those recording and searching for information. If one person uses "dolphin" and another searches for "whale," the system can still make the connection due to their shared link to "cetacean." The dictionary doesn't enforce strict hierarchies; instead, it allows for overlapping concepts, as shown in Figure 3. ### Intro In 1953, Hans Peter Luhn, a researcher at IBM, published “A New Method of Recording and Searching Information," a landmark paper in the field of information retrieval. This work addressed the growing challenge of managing and accessing the rapidly expanding volume of scientific and technical literature in the 1950s, when traditional classification systems were becoming inadequate. Luhn proposed a revolutionary approach that broke away from rigid classification, introducing a flexible method using sets of identifying terms or "criteria" to describe documents. This laid the groundwork for modern search concepts like keyword searching and relevance ranking. Luhn's method anticipated the need for machine-assisted searching and addressed the subjectivity in how different people might describe the same information. Luhn's idea of representing documents as sets of terms conceptually paved the way for modern vector search. Luhn is proposing a way to visualize the relationship between terms and topics. He suggests that we can think of each term as a dimension in a multi-dimensional space. A topic is then a point in this space, and the coordinates of this point are the values of the terms that define the topic. For example, if we have a five-dimensional space defined by the terms "cat", "dog", "pet", "animal", and "mammal", then the topic "domestic cat" would be located at the point (1,0,1,1,1), since it is defined by all of these terms except "dog". The topic "dog" would be located at the point (0,1,1,1,1). Luhn's idea is that related topics will be located near each other in this space. For example, the topics "domestic cat" and "dog" are more closely related to each other than they are to the topic "rock", since they share the dimensions "pet", "animal", and "mammal". This is because they agree in some of the identifying terms and therefore share some of the concept fields. ### Conventional Indexing Approach Imagine a comprehensive research document that covers various topics within the field of cancer treatment, including chemotherapy, immunotherapy, radiation therapy, and surgical procedures. The indexer might decide that “chemotherapy” is the most important aspect of the document because it is the most detailed and occupies a significant portion of the content. Consequently, the document is indexed under “chemotherapy”. Other important aspects, such as “immunotherapy”, “radiation therapy”, and “surgical procedures”, are referenced under their respective categories but are given less emphasis. These topics might be mentioned in cross-references within the index but are not the primary entry points for the document. One drawback of this approach is that the importance of topics is subjective and can change over time. For instance, if immunotherapy becomes the most promising and researched area in cancer treatment in the future, the document indexed primarily under “chemotherapy” might not be easily found by researchers interested in immunotherapy. If the indexer later decides that “immunotherapy” has gained more relevance and that the document has become pivotal in the “immunotherapy” body of work, they would need to reclassify the document, which can be inefficient.

Comments

Products

Project