Indexing: some 21st century thoughts by Kevin P. Jones

In 1976 Jones considered that the world of information science needed a theory of indexing and a note entitled Towards a theory of indexing appeared in Journal of Documentation. This paper was in a folder in the loft of his North Norfolk bungalow (and has been scanned) as for copyright reasons he does not have access to his own work held by some profiteering online database. This is reproduced below with all the problems inherent in scanning.

THE RECENT correspondence in this journal1-3 nominally about PRECIS, has served to introduce what Professor Swift2 correctly terms a 'major blind spot': namely that few attempts have been made to establish the nature of the indexing process. Most studies claiming to be about indexing are, in fact, about indexes. Even practical guides to index construction rarely progress beyond concrete topics like alphabetization, variant spellings, and proof correction. Few writers achieve the level of Anderson's simple but helpful comments4 which guide the novice indexer to the more potentially useful sections within a text such as chapter and section headings. Anderson also observes that the preface 'rarely needs indexing' and that footnotes should not be ignored. This is extremely helpful, practical advice which is frequently missing elsewhere, such as in the British Standard,5 but it scarcely amounts to theory. Nevertheless, it does give some indication as to the way that a theory might develop.

At least within the present context, indexing is (1) the process of extracting words from text; (2) augmenting these words when necessary by other words not in, but which are related to, the original text; (3) if necessary the transcription of these words into some form of controlled language (many technologies attempt to employ standardized terminologies and abbreviations—the indexer may decide to comply with these even though the author/s may not have done so); (4) the use of words to form an index or as input to an information retrieval system and (5) the construction of cross references and thesauri if considered to be necessary. The first process is essential and it is with this that the present note will be mainly concerned.

Unfortunately, the correspondents1-3 mentioned earlier fail to draw a clear distinction between indexing as defined above and document classification. The former is a word-based process and is based on closely identifying all the topics mentioned within a document, whereas the latter seeks to establish what a document is 'about'. Whilst not wishing to denigrate the occupation of book classification, it is frequently possible to establish what a document is about by fairly superficial examination. In many cases the title is capable of giving a correct indication of subject content—indeed, if this was not so the procedure would not be castigated as being dangerous.

Indexing can be considered on a number of levels; the most primitive of which corresponds with the concordance. A crude concordance merely consists of references to all the words in the original text, arranged in alphabetical order. More refined listings can be achieved by excluding uninformative words like articles, pronouns, and prepositions. This is the level at which KWIC indexes operate.

An indexer is more likely to accept a word for indexing if it occurs fifty, rather than 500, or five widely scattered times within a text. If a word occurs very frequently, then it is unlikely to be selected; the document is 'all about that'. The word rock may occur frequently in books about geology, but is seldom likely to be used as an index-entry in that environment. Some rare words are obvious candidates for selection particularly when they are highly pertinent to the main theme (this will be discussed in greater depth subsequently), but many rarely occurring words are introduced solely as aids in support of arguments, or as illustrations. For instance, in indexing this note it would not be helpful to extract the words rock or geology (above). Similarly, the words aeroplane and bombing might occur once in a document about volcanoes, but only the latter would be selected (bombing of lava flows). The former is merely the agent involved in bombing and unlikely to be of interest to geologists.

Word length and letter frequency also seem to influence the potentiality for selection. Short words are likely to be more common, and therefore less likely to be selected. Very long words are physically difficult to incorporate in indexes. Unusual letter combinations (e.g. fj in fjord), bold and italic type (used in moderation) and side-headings are all likely to catch the indexer's attention. These factors are quantifiable; therefore there would seem to be an information theoretic element within indexing. It is necessary to add that there seems to be a critical density for word selection. An author may introduce a new term, define it, use it relatively frequently in expanding and illustrating the definition (it is here that it is selected for indexing), then use it merely as part of a larger vocabulary (when it is no longer selected) and perhaps reintroduce it with an expanded definition, or in relation to something else at a later stage (when it will be selected once again).

To follow Anderson4 some parts of a document are especially rewarding to the indexer. Opening paragraphs (in chapters or sections) and opening and closing sentences of paragraphs seem to be particularly rich in indexable words. Definitions are particularly important and marked by a distinctive syntax. This can be illustrated by reference to example 1. The paragraph begins with a defmition: 'A bar is a spit...'. The importance of this textbook definition is stressed by italicization. The definition does not end with the first sentence, other defmitory elements are introduced by called as in 'called "haffs" and 'a similar construction ... is a connecting bar'. Other linguistic clues for definitions include:

X may be defined as
The notion of X ness
The idea' of X ness
The concept of X ness

and so on. Definitions are important for a number of reasons. Firstly they act as markers for the introduction of new ideas and the word or words associated with these ideas. They also provide clues for structuring indexes or thesauri (see example 1a).

The example considered above introduces the linguistic element into indexmg. Traditional syntax, and within the present context Chomskyan linguistics6 must be regarded as traditional, has been confined to studies of the sentence and has tended to be rigidly divorced from semantics. Both, but especially the former, restrict the applicability of syntax to an indexing theory which must be capable of explaining how meaningful words are extracted from large units of text.

EXAMPLE 1

Index words Text
bar/spit A bar is a spit which extends from
headland one headland to another, or nearly so. If
bay the bay inside is completely enclosed it
shore-line lake becomes a shore-line lake. More usually,
however a narrow channel is kept open
tidal scour by tidal scour and outflowing drainage.
Between Danzig and Memel on the south
Baltic coast Baltic coast, there are two very long
bar</dunes/sand bars, surmounted by sand dunes, with
lagoons/haffs extensive lagoon, called 'haffs', on
the landward side (Fig. 162). A beautiful
bar</Iceland example of bar in Iceland is illustrated
in Fig. 163. A similar construction of
sand </shingle/headland< sand or shingle connecting a headland to
island< an island, or one island to another is
bar<connecting bar/tombolo a connecting bar or tombolo
(A. Holmes, Principles of Physical Geology. London: 1944.)

Example 1A showed potential entries for a thesuarus of geomorphology. As this aspect is not relevant to the Kindle Age it has not ben reproduced.

Fortunately, linguists are increasingly interrelating structure and meaning, and a few workers, admittedly mainly those aware of problems in indexing or artificial intelligence (notably Heaps,7 Winter,8 Dea,9 Masterman.10 Fisher,11 and Wilks12), are exploring structure in large units of text.

It has been shown that definitions can be identified and used to produce index structure. Similarly, synonyms can be identified through syntax: 'a connecting bar or tombolo' (example 1). Syntax does not provide mutually exclusive categories. Thus synonymy can be expressed within the same syntactic frameword as that used for dcfmitions:

tombolos are connecting bars
or connecting bars are (also known as) tombolos
or connecting bars or tombolos.

This reinforces Lyons' claim13 that synonymy is a reversible form of hyponymy (the hierarchal relationship within linguistics). All connectives, but particularly prepositions, are capable of exerting a strong influence on meaning. An extreme example should make the point self-evident: there is a vast difference between the two undermentioned phrases, yet only the prepositions are different:

the aircraft over the sea
the aircraft under the sea.

The above prepositions are antonymous, but substitution of others (near, for, on, etc.) still produces marked shifts in meaning.

Thus far the theory has been limited to considering individual words or phrases. The next level to be identified poses problems of nomenclature; nevertheless, its nature can be readily outlined. Authors are instructed to approach their task in an organized manner by producing skeletal structures which are subsequently clothed in text. In example 1 one can imagine how the author originally possessed a sheet of paper on which the word bar appeared as a sub-heading of coasts; beneath bar, words like haffs, Baltic, and tombolos were probably jotted. It may be argued that some authors write without a physical outline, but it is probable that most of these employ some form of mental framework. The successful indexer needs to disinter this skeleton and this can be achieved by searching for clues on the surface. Some of the ways that this may be achieved have been described at the Informatics conferences. Masterman's rhetorical figures10 break down text into manageable units and illuminate the inherent structure through its repetitory patterns. Related work by Winter8 and Dea9 attempts to identify the markers which hold a text together and accentuate its key elements. In example I, a similar construction draws the reader's attention back up the page to the original definition.

Another approach has been provided by Fisher,11 whose work is significant to information retrieval for a number of reasons. In the present context its importance lies in the stress which Fisher places on verbs. In information systems verbs tend to be ignored. Even linguists frequently fail to stress the pivotal nature of verbs sufficiently—Chafe14 is a notable exception, however. Fisher classifies performative verbs (actions) into nine interactive groups. Each group is characterized by a presupposition (wants is one characteristic) and these can be negated. This can be illustrated simply: to ask or presupposes that someone 'wants' and to advocate that presupposes 'good' and implies 'wants'.

From these approaches plus personal observation, it would seem that there should be an inferential level in an indexing theory. From the fourth sentence in example 2, it is possible to infer that oaks are trees and that birch and ash may also be trees. It is possible to infer that regeneration is related to tree growth and that woods are in some unstated way greater than stands. These inferences can be made because the paragraph/sentence structure suggests that they may be made. If the third sentence is stripped of detail it reduces to: 'Many parts are pure oak, but others are mixed with birch and ash.' From the first and fourth sentences it is clear that an oak is a tree: therefore, it is probable that birch and ash are also trees. This inference is assisted by the quasi-antonymous: pure/mixed. Without an inferential level it would be impossible for the indexer to operate in novel subject areas.

EXAMPLE 2

Index entries

Text

oak woods/Letterewe

The oak woods of Letterewe in

Wester Ross

Wester Ross are the most northerly

stands

of this series apart from very small stands

in Strathoykell. Mention has already

been made in chapter 4 of their connection

iron-smelting/Highlands

with the iron-smelting era in the Highlands.

Letterewe/oak<

Many parts of Letterewe are of pure oak but .

birch/ash

in others it is mixed with birch and ash

understorey

and there is little or no understorey of

hazel scrub/oaks<

hazel scrub. The oaks are even-aged

symmetry/spacing

and there is a symmetry and spacing of

trees

trees which suggests that they may have

planted

been planted in the early 19th century.

regeneration

There is very limited regeneration and

the owner is co-operating with the

Nature Conservancy

Nature Conservancy with a view to improving

growth/young trees

the growth of young trees

The text chosen was F. Fraser Darling and J. Morton Boyd's The Highlands and Islands. London. 1964. In retrospect (2012) this was not a good example. Too many words were chosen and Strathoykell should have been selected.

Five levels have been identified. These are:
1. Concordance
2. Information theoretic
3. Linguisti (syntactic/semantic)
4. Textual framewor or skeletal
5. Inferential

The indexer may relate hat he reads a out to what he has read about elsewhere, or seen, or heard about ut this experi ntial condition is very secondary: a deaf man could index book about music. Th indexer must be capable of learning and may be influenced by his immediate environment—whether it is hot or cold or noisy, but these are transient condition. No consideration has been given to indexing non-text materials such as illustrations. Most commonly these materials will be pre-captioned, in which case the usual techniques for word extraction will apply. More rarely the indexer has to cope with realia and illustrations of realia without captions, in which case the rules of classification rather than of indexing will apply.15

The present theory is very incomplete and far from tested. Nevertheless, it is presented in the hope that it may stimulate thought and discussion. At present most comment is trivial; and there is even a tendency to consider that theories of indexing should come from outside information science.

REFERENCES
MOSS, R. PRECIS (letter). Joumal of Documentation, 31,1975, pp. 116-17.
SWIFT, D. F. PRECIS (letter). Joumal of Documentation, 31,1975, pp. 117-18.
AUSTIN, D. PRECIS (letter). Joumal of Documentation, 31,1975, pp. 118-20.
ANDERSON, M. D. Book indexing. Cambridge, University Press, 1971.
BRITISH STANDARDS INSTITUTION. Recommendations for the preparation of indexes for books, periodicals and other publications. (BS 3700: 1964).
CHOMSKY, N. Syntactic structures. The Hague, Mouton, 1957.
HEAPS, D. and INGRAM, W.D. Computer recognition and graphical reproduction of pattern in scientific and technical style. Proceedings of the American Society for Information Science, 8, 1971, pp. 257-61.
WINTER, E.O. What do words mean? A consideration of the role of words in information structure. Informatics 3: Proceedings (to be published).
DEA, W. Beyond the sentence: clause relations and textual relations. Informatics 3: Proceedings (to be published).
MASTERMAN, M. Chasing the enthymeme: natural language handling with an online text-editor. Informatics 1: Proceedings of a conference held by the Aslib Co-ordinate Indexing Group on 11-13 April 1973 at Durham University. London, Aslib, 1974: Informatics 2: Proceedings. London, Aslib, 1975.
FISHER, J.N.D. Computational semantics: a study of a class of verbs. Birmingham University of Aston, 1974 (Ph.D. thesis).
Wll.KS, Y.A. Grammar, meaning and the machine analysis of language. London, Routledge & Kegan Paul, 1972.
LYONS, J. Introduction to theoretical linguistics. Cambridge, University Press, 1968.
CHAFE, W.L. Meaning and structure of language. Chicago, University Press, 1970.
JONES, K.P. The environment of classification. Jonrnal  of the American Society for Information Science, 24, 1973, 157-63; 25, 1974, 44-51.

The above appeared in Journal of Documentation, 1976, 32 (Part 2), 118-23 and if Kevin had really thought that this would be read in the 21st century he would have made hypertext references within the text.

Kevin still considers indexing to be a vital component of book production, but remains uncertain about the extent of its potential in the age of (1) the electronic book (as per Kindle) and (2) the "widespread" availability of texts searchable on the Internet. The growth of the latter is disappointing and will endanger the first as a means of commuication. The following is an extract (pp. 66-7) from Brian Reed's excellent 150 years of British steam locomotives which is the sort of book which should be in electronic form if Kindle use is to advance beyond light fiction. It is part of  Chapter 8 From Iron to Steel and it was quite a shock to discover that the steamindex section on locomotive design had no entry under frames despite their key influence on locomotive construction.

In the extract some attempt has been made to insert anchor tags, but this was far more difficult to achieve than had been anticipated as the text had not been constructed with hypertext being envisaged. There were the same old problems with compound words, the relationship to other sections of the text (not reproduced) and where to stop.

Perfection of forging processes and improvement in steel increased the number of inside-cylinder locomotives in the late Victorian age until in 1897 there was scarcely a railway building outside-cylinder locomotives when only two cylinders were used; but with the larger engines constructed from 1899-1900 increasing cylinder diameters and piston thrusts caused a reversion to outside cylinders or the adoption of multi-cylinder propulsion, because when cylinders were increased above 19½in dia space could not be found for adequate axle and crankpin journals and web widths plus four eccentrics. With the increasing stresses doubt also arose as to crank axle steel characteristics, some lines like the Midland favouring a 'soft' steel of low tensile strength and great ductility while others like the NER preferred harder steels of 38-40 tons/sq in ultimate strength. To get full advantage of the latter, and to ease the whole manufacture, built-up crank axles were introduced on some railways. They were made practicable by the higher standard of shop methods, and here again Webb was a pioneer for he introduced the first one in the late 1890s and is believed to have used a low nickel steel. Large rolled iron plates transformed main frame design, strength and manufacture principally because numerous bolted and rivetted connections along the side plates could be eliminated and no working could occur between individual frame and horn plates. In this aspect the plate frame was ahead of the bar frame, for not until the casting of a whole side frame in one piece in the 20th Century could a main bolted joint at a highly stressed location be eliminated from bar frame construction. By the 1860s single rolled iron plates to suit 2-4-0s and 0-6-0s were available though not in general use, but by then in the larger works the old type of sectional frame was being welded up under 10cwt steam hammers. Advantage was sometimes taken of the sectional method to enlarge the area round driving and coupled axleboxes to get the box central in the frames, and even, as in the Met tanks at a width of 5in, to act as the axlebox guide thrust face. In some designs this thickened section was used to change the distance between frames fore and aft of it to get an extra two or three inches firebox width or to suit cylinder spacing. By the time of these developments frame conditions themselves had eased through the elimination of the frame-firebox tie-in and the substitution of expansion brackets, so that despite greater locomotive size and power the frames from the 1860s were a far more rigid and better maintenance proposition than those of the 1840s and 1850s.

With the increase in size of rolled plates, progressive steps had to be made in machine tools that could handle them, but probably not until the Smith & Beecroft machine introduced 1858-9 was there a frame slotter of accuracy able to take more than one set of frame plates. For long years thereafter the method of frame contouring remained awkward. From the rectangular plate the contour was shaped roughly by punching or drilling overlapping holes round a template, annealing and straightening to remove any strains caused by punching, and then putting six to 10 plates together on a slotter for the final machining, after which the plater and his mate with large hammers gave a final straightening to individual plates laid on large cast iron slabs.

Steel plates without any thickening over iron permitted higher piston loads to be absorbed and greater weight supported, but in the 20th Century with larger 4-4-2, 4-6-0 and 4-6-2 locomotives the almost standard 1in was thickened whenever weight permitted to 11/8in and even 1¼:in; so small an addition as 1/16in was appreciated by some designers. Cross staying was the weak point, though alleviated by the use of steel castings for inside motion plates. Some of these castings, as on GCR 4-4-2, 4-6-0, 2-8-0 and 2-6-4T classes, were used to give great support at the location where the frame plates were set in at the front to clear the side movement of guiding wheels.

Only in the twilight of steam were horizontal or racking stays adopted to any extent. They were difficult to apply with inside cylinders or inside motion, and an advantage of outside-cylinders with outside valves and motion was always the possible stronger frame structure – if designers were so minded. They were not always so minded, because long decades of inside cylinders and motion and flabby frames brought designers to a self-defensive postulate that frames ought to be made deliberately with a little lateral flexibility. So frames became the weak point in large 20th Century multi-cylinder locomotives such as the GWR four-cylinder types and the LMS Royal Scots. By 1939 not one of the 79 Gresley 180lb and 220lb three-cylinder Pacifies of 1922-34 was still running with its original frames and rate of Royal Scot frame cracks had more than quadrupled in six years.

Not until the Bulleid Pacifies of 1941 did a designer show how ample horizontal bracing could be provided with an inside cylinder, crank throw and motion. The later BR two-cylinder 4-6-2s and 2-10-0s with clear space between the frames also had full racking stays. In all three classes just men- tioned was revived the old Beyer-Met practice of the frame centred above the axleboxes; and the BR types also followed the practice evolved on the LMS of link-and-pin cross tie-rods between the frames at the horns, and manganese steel liners for boxes and horns that greatly reduced the wear.

Towards the end of steam the composition of the usual 26/30-ton mild steel for frames was adjusted to suit oxy-acetylene flame cutting, for that process reduced considerably the time taken in preparation and machining. Then steel suitable for welding was introduced with small quantities of copper, chromium and manganese, and ultimate strengths up to 35-40 tons/sq in. By this means many bolted and rivetted connections could be obviated and the whole frame structure made up as one piece, an idea foreshadowed in England in 1869 when Webb had proposed that 'the frames, cross stays and back carriages are cast in steel with the necessary horn blocks enclosed and fixed in one piece'. Undeveloped foundry and machining techniques at that time prevented practical application.

From 1920 alloy steels were used increasingly for driving parts partly to cope with rising piston thrusts but mainly to reduce the weight of reciprocating and revolving parts that had to be balanced by weights in the wheels, and to decrease hammer blow. Only the GWR retained carbon steel for these constituents, but by heat treatment in- creased the strength to 38-40 tons/sq in. This railway also kept to rectangular sections for coupl- ing rods. Other Group lines used 3 per cent nickel steel and nickel-chrome-molybdenum steels up to 60-65 tons/sq in ultimate strength, and favoured pronounced I-sections for connecting and coupling rods, sometimes with webs only 3/8in thick. To reduce weight and eliminate joints Gresley forged the piston and rod in one piece of nickel steel, a practice tried by Hackworth in 1849 and by McConnell in the 1850s with wrought iron.

Efforts to balance revolving weights are believed to have been made late in 1837 by Dawson, running foreman on the London & Southampton at Nine Elms with a panel between two spokes opposite the crank, and this was noted and adopted by Sharp Roberts in 1838 but with weights in the rims. In the latter year Heaton, who had been balancing stationary engines from 1810-12, made suggestions for balance weights in the wheels and some Bury four-wheelers on the London & Birmingham were so treated. Working independently, Hunt on the North Union in 1839-40 fitted weights to the driving wheels of Bury 2-2-0s primarily to try and ease pronounced fore-and-aft surging that was breaking intermediate drawbars. In 1844 Gray balanced the wheels of his large 0-6-0s on the Hull & Selby. From this time the practice of adding weights to the wheel rims or boss opposite the crank arm grew slowly. The early efforts related only to revolving weights. The first man to suggest balance of reciprocating parts was J. G. Bodmer in a patent of 1834 for an opposed-piston engine, and in the 1840s two or three of his locomotives were tried on the London & Brighton and South Eastern railways but with an ingenious arrangement that brought the drive from both pistons out of the back end of the cylinders. First to make the compromise of balan- cing reciprocating parts by revolving weights in the wheels was J. Fernihough, ex-Bury locomotive superintendent of the Eastern Counties in 1844. He may have been led to this by a letter from T. R. Crampton to The Railway Times in 1843 drawing attention to the 'unbalance' arising from reciprocating parts but offering no suggestions. The practice of balancing reciprocating parts really extended only after the publication in 1855 of D. K. Clark's Railway Machinery in which he gave the rules and proportions to be observed.

Clark's recommendation that two-thirds of the reciprocating weights should be balanced in the wheels derived from his experience with 2-2-2s, 2-4-0s, 0-4-2s and 0-6-0s of 20 to 30 tons weight, bnt was largely perpetuated in Britain until after World War I, no more understanding of the compromise principle having been attained than in the case of slide valves and valve motions. The LMS class 5 4-6-0s with 6ft wheels and 66 per cent of the reciprocating weights balanced in the wheels showed a clear wheel lift of 2in at speed of 7.8 revs/sec, corresponding to 100mph, and of course the wheel came down with a corresponding blow.

Only the investigations of the Bridge Stress Committee in the early 1920s showed this proportion to be needlessly high in the large engines of the time and could be reduced without deleteriously affecting the riding and with benefit to the bridges, because at 5 revs/sec the hammer blow from driving wheels of some of the engines tested was 50 per cent more than the static axle load. An immediate result was that civil engineers raised from 20 to 22 tons the permissible axle load for carefully balanced multi- cylinder locomotives, but close attention had to be paid to low hammer blow per wheel, per axle, per rail and per engine.