Thesaurus: what it is. A thesaurus dictionary that is more than a dictionary. Thesauri Algorithm for compiling a thesaurus of a dictionary

SAMPLE

Syn: model, specimen, example, sample, standard, norm, measurement, specimen, standard, typical representative, template, stencil, prototype, drawing, construction, drawing, pattern, gestalt, frame

Thesaurus of the Russian language. 2012

See also the interpretations, synonyms, meanings of the word and what is the SAMPLE in Russian in dictionaries, encyclopedias and reference books:

  • SAMPLE
    HAFDASA 1927 - Argentine 22 caliber automatic pistol. Was an army ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    EXPERIMENTAL - single copies of any design of firearms, not accepted for serial ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    MUSHKETA - American capsule rifle of 1849-1855. 58 caliber with a barrel. Length 1016 ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    RIFLE - American capsule rifle 1849-1855 58 caliber. Length 838 ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    70 - Czechoslovak automatic pistol, caliber 7, 65 ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    63 - Polish fifteen- and twenty-five-shot submachine gun of 9 mm caliber. Length with stock 583 mm, without stock 330 mm. The weight …
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    61 - Czechoslovak ten- and twenty-shot submachine gun caliber 7, 65 mm. Length with stock 513 mm, without stock 269 mm. ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    58 P - Czechoslovakian thirty-shot machine gun of 7.62 mm caliber. Length 820 mm. Weight 3140 ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    58 V - Czechoslovakian thirty-shot machine gun of 7, 62 mm. Length with stock 820 mm, without stock 635 mm. The weight …
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    52 —1. See CHZET-513. 2. Czechoslovak ten-shot automatic self-loading carbine of 7.62 mm. Length 1003 mm. Weight 4100 ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    50 - Czechoslovak automatic pistol, caliber 7.62 mm. Reduced copy of CHZET-513. Was in service ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    25 - 1. Czechoslovakian twenty-four and forty-shot submachine gun of 9 mm caliber. Length with stock 686 mm, without stock 445 mm. The weight …
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    23 - Czechoslovakian twenty-four and forty-shot submachine gun of 9 mm caliber. Length 686 mm. Weight 3270 ...
  • SAMPLE in the Illustrated Encyclopedia of Weapons:
    16/33 - Czechoslovakian five-shot magazine carbine of 7, 92 mm. Length without bayonet 995 mm, with bayonet 1305 mm. The weight …
  • SAMPLE
    INDUSTRIAL - see INDUSTRIAL MODEL ...
  • SAMPLE in the Dictionary of Economic Terms:
    - a representative single copy of the product used for advertising, at exhibitions, for the purpose of familiarization, display to potential ...
  • SAMPLE in the Encyclopedic Dictionary:
    , -ztsa, m. 1. Indicative or trial product; sample (in 2 values). 06 soil samples. Samples of minerals. Samples of products. Industrial about. (new, ...
  • SAMPLE in the Big Russian Encyclopedic Dictionary:
    SAMPLE INDUSTRIAL, see Industrial ...
  • SAMPLE in the Complete Accentuated Paradigm by Zaliznyak:
    sample "c, samples", sample ", sample" in, sample ", sample" m, sample "c, samples", sample "m, samples" mi, sample ", ...
  • SAMPLE in the Popular Explanatory and Encyclopedic Dictionary of the Russian Language:
    -zts "a, m. 1) (usually what) An indicative or trial copy of a product, material; part of... substance, product, giving an idea ...
  • SAMPLE in the Dictionary for solving and compiling scanwords:
    … for …
  • SAMPLE in the Thesaurus of Russian Business Vocabulary:
  • SAMPLE in Abramov's Dictionary of Synonyms:
    sample, prototype, prototype, type, prototype, ideal, model, original, example; model. Prot. ... See ideal, example, ...
  • SAMPLE in the dictionary of Synonyms of the Russian language:
    Syn: model, specimen, example, sample, standard, norm, measurement, sample, standard, typical representative, template, stencil, prototype, drawing, design, drawing, pattern, ...
  • SAMPLE in the New explanatory and derivational dictionary of the Russian language by Efremova:
    m. 1) Approximate, indicative or trial copy of a l. product, material, etc. 2) a) Indicative example of smth.... (what qualities, behavior, ...

One of the new basic concepts that appeared as a result of the development of machine methods of information processing, in particular, when translating from one language into another, searching for scientific and technical information and creating an information model of an enterprise in automated control systems, was the concept of an information system thesaurus. The term "thesaurus" implies a body of knowledge about the external world - this is the so-called thesaurus of the world T. All concepts of the external world, expressed using natural language, constitute a thesaurus, from which private thesauri can be distinguished by hierarchical division, taking into account the subordination of individual concepts or by separating parts general thesaurus of the world. Thesaurus in information retrieval systems plays an important role in search the required document by keywords. Therefore, the construction of a thesaurus is a difficult and crucial task. But this task can also be automated.

Classification in its most general definition is a partitioning and ordering of sets. It is called the distribution of objects into classes on the basis of a common feature inherent in these phenomena or objects and distinguishing them from objects and phenomena that make up other classes. Each class can be subclassed as needed. The rubricator is a special kind of classification. Therefore, they are created on the basis of general provisions:
 scientific basis for building a classification;
 reflection of the current level of development of science;
 availability of a system of links and references, as well as a reference and reference apparatus (CCA).

However, the rubricator is a pragmatic classification based on information flows and the needs of specialists. This is its difference from a priori classifications such as UDC and IPC.

The main functions of the classifications and, in particular, the rubricator are the following:
 thematic delimitation of information subsystems;
 formation of information arrays by any criteria;
 systematization of information materials and publications;
 current and retrospective search;
 indexing of documents and queries;
 relationship with other classification schemes;
 normative functions.

They are built by dividing concepts - objects of classification on the basis of established relationships between the attributes of these objects in accordance with certain logical principles. The criterion by which the classification is made is called the basis for the division of the classification. In classifications, methods of deduction and induction are widely used to fix groups, classes and identify connections between them. This is typical for hierarchical classifications. The depth of classification (the number of levels in the hierarchy) may vary depending on the purpose. One of the widely used rubrics is the state rubricator of scientific and technical information (SRSTI).

The GRNTI rubricator is designed in such a way that it is possible sharing with other classifications such as UDC and IPC. The Universal Decimal Classification (UDC) has existed for more than 70 years, but it still has no equal in its breadth of distribution and is used in many countries of the world. UDC covers the entire universe of knowledge and is successfully used for systematization and subsequent search for a wide variety of information sources.

In addition to the UDC, the library and bibliographic classification (LBC) is widely used in practice. LBC is built on the principles of logical subordination and represents an applied type classification.
V Russian Federation for the classification of inventions and the systematization of domestic collections of descriptions of inventions, the international patent classification is used - a rather complex multidimensional classification, built according to the functional and sectoral principle. The same technical concepts can be found in IPC or special classes (according to industry) or in functional classes (according to the principle of operation). The branch principle of the distribution of concepts involves the classification of objects depending on the application in a particular historically established branch of technology and technology.

Comparative characteristics of the rubricator GRNTI, UDC, LBC and IPC are shown in Table 1.

Table 1
Characteristics of the rubricator GRNTI, UDC, BBK and MPK

Name

Structure

The principle of the arrangement of divisions

Partitioning scheme

Hierarchical

Industry

From general to specific

Hierarchical

Thematic

Hierarchical

Functional and industry

From general to specific

LBC for scientific libraries

Hierarchical

Industry

From general to specific, by species


Thus, the main distinctive features of rubricators and classifiers can be identified:
 they are characterized by an applied nature and a sectoral focus;
 these are open systems that depend on the development of science and technology, the needs and requests of specialists;
 inorganic systems, as objects arise and develop in the environment and from it enter them. Elements are able to exist independently outside the system. This feature is closely related to the second feature;
 the minimum element is a concept related to the environment. The concept represents a system of definitions;
 There are connections between concepts both vertically (genus-species, whole-part) and horizontally (type-species, part-part), which indicates the hierarchy of systems.

Consequently, the structure and principles of organizing classifications and rubrics make it possible to automate the process of constructing subject area thesauri using the deduction method. The algorithm for constructing a thesaurus using the deduction method is shown in Fig. 1.

The basis for the formation of the thesaurus is the search image of a document, a task or an application for information search, filled in by the operator. Therefore, the first step is to research and analyze the application. At the first stage, the operator indicates the topic or problem of interest, possible keywords and their synonyms. As a result, we get a superficial understanding of the subject area.

Rice. 1. Algorithm for constructing a thesaurus using the deduction method

In addition, a thesaurus of KS keywords is formed using the deduction method, for which it is necessary:
 CS array, which is set by the user himself, designated in Figure 1 as MP;
 array KS, extracted from the search task, respectively MZ.

However, for a more complete and in-depth understanding of the subject area, we use the existing rubricators and classification schemes (GRNTI, UDC, LBC, MPK). In order to maximize the coverage of the subject area, it is necessary to view all available ones. The rubricator array represents MR. The deduction search algorithm consists of two steps:
1. Finding generic concepts (Fig. 2);
2. Finding specific terms within generic concepts (Fig. 3).


Rice. 2. Processing a generic concept

We load the first rubricator from the array and organize a cycle for checking the presence of KS in the rubricators entered by the user. Each KS is searched for in the heading list and compared with a generic concept or "nest", and then the condition is checked - is there a reference to specific terms. If there is such a reference, then the KS is compared with the species terms. If no links are found, go to the next generic concept. When the keywords KS, entered by the operator, are viewed, we go to the array of KS extracted from the task. The verification procedure is similar - we are looking for COPs that correspond to generic concepts, and then their links to specific terms.


Rice. 3. Processing of species terms

Note that within each generic concept, it is important to review all available species terms in order to get the maximum understanding of the problem area. The result of these actions is the formation of an array of KS keywords, which is a complete thesaurus corresponding to a task to search for information or a search image of a document.

On the basis of a complete set of search images of documents (denote), it is possible to create branch thesauri and a unified classifier of the library. Obviously, the complete set  itself represents the simplest thesaurus.

However, using the selection criterion
, (1)
we can build industry-specific thesauri. At the same time, the set of all industry-specific thesauri forms a complete thesaurus
, (2)
whose sections can be hierarchically structured in accordance with the requirements of GOSTs according to the main classifiers (GRNTI, UDC, BBK, MPK) or according to an internal single classifier.

Automation of the process of building a thesaurus and classification makes it possible to maximally facilitate the work of an operator working with distributed information resources.

In addition to constructing a thesaurus, based on a document search image, the proposed approach can be used for automatic document summarization and text clustering.

Referencing of documents is one of the tasks aimed at providing expert specialists with reliable information necessary for making a managerial decision about the value of documents received from the Internet. Referencing is the process of transforming documentary information, ending with the preparation of an abstract, and an abstract is a semantically adequate presentation of the main content of the primary document, characterized by an economical sign formatting, the constancy of linguistic and structural characteristics and is designed to perform a variety of information and communication functions in the system of scientific communication. The algorithm for summarizing documents is shown in Fig. 4.


Rice. 4. Algorithm for summarizing documents

In general, the algorithm includes the following main stages.
1. The selection of sentences from the document, uploaded from the Internet and located in the data storage, is made by highlighting punctuation marks and saving it in the array.
2. Each sentence is divided into words by separating separators, and save them to an array, and the array is different for each sentence.
3. For each sentence, for each word of this sentence, count the number of words in other sentences (before and after). The sum of repetitions for each word (before and after) will be the weight of the given sentence.
4. The specified number of sentences with the maximum weighting coefficient and select in the abstract in the order of appearance in the text.

The proposed model for constructing a thesaurus and thematic catalogs of an information system is a theoretical basis for automating semantic search and allows an expert not only to carry out search work, but also in an automated mode, to abstract documents obtained as a result of a search in distributed information systems on the Internet.

Literature:
1. Barushkova R.I. Classification schemes of scientific and technical information. Textbook. allowance. - M., 1981 .-- 80s.
2. Barushkova R.I. Rubricator as a classification scheme of scientific and technical information. Toolkit. - M., 1980 .-- 38p.
3. Trusov A.V., Babarykin E.P. Evaluation of the boundaries of the area of ​​a thematic information request in distributed information systems. Materials of the All-Russian (with international participation) conference "Information, Innovations, Investments", November 24-25, 2004, Perm / Perm Center for Science and Technology. - Perm, 2004. - P.76-79.
4. Yatsko V.A. Logical and linguistic problems of analysis and abstracting of a scientific text. - Abakan: publishing house of the Khakass state. University, 1996 .-- 128 p.

Computing technology

Volume 12, Special Issue 2, 2007

TECHNOLOGY FOR CREATING A THESAURUS OF A SUBJECT AREA BASED ON THE SUBJECT INDEX OF ENCYCLOPEDIA

V. B. Barakhnin

Institute of Computational Technologies SB RAS, Novosibirsk, Russia

e-mail: [email protected]

V. A. Nekhaeva Novosibirsk State University, Russia e-mail: [email protected]

This work describes a technology for creation of object domain thesaurus, which is based on subject heading for specialized encyclopedia. Such technology offers a high quality description of the object domain using reliable terms thus allowing to build up a first stage of thesaurus with a minimal engagement of experts in this particular field of knowledge. The proposed technology also contains a thesaurus building algorithm and web based application implementing this algorithm.

Introduction

One of the most important factors ensuring the successful implementation of integration research projects is effective scientific and information support. In particular, the joint work of researchers from several (moreover, not always related) specialties requires careful coordination of the terminology used, because the same concept can be denoted in different fields of science by different terms, and by one term - different concepts.

Another challenge information support projects - creation of an integrated card index of bibliographic descriptions of documents (i.e. articles, books, etc.) on the subject of the project, compiled by combining the resources of collaborating researchers, each of whom has already accumulated a card index on a particular topic over the years (at present, such card indexes are stored, as a rule, on electronic media). To facilitate the search in the card index, it is desirable that the keywords characterizing the documents be selected, whenever possible, from a single dictionary. For automatic classification of documents included in the card index or potentially being entered into it from electronic databases

© Institute of Computational Technologies, Siberian Branch of the Russian Academy of Sciences, 2007.

scientific publications such as a database of abstract journals, "Current Contents", etc., it seems appropriate to use the coordinate indexing algorithm. This algorithm is based on taking into account the classification features of the terms (words and phrases) included in the text that characterize a particular subject area.

The solution of all the tasks listed above is impossible without creating a dictionary of terms of the subject area, and in this dictionary connections between terms should be established and the classification of terms should be carried out. Such a dictionary is called a thesaurus (see details in). A thesaurus (or normative thesaurus) is a reference dictionary containing all lexical units of an information retrieval language - descriptors (together with keywords that are considered synonyms of these descriptors within this information retrieval system), and the descriptors in the dictionary must be systematized according to sense, and the semantic connections between them are explicitly expressed.

However, the compilation of the thesaurus "with blank slate"may require a very significant labor input of experts, who must collect all the terms that sufficiently cover the subject area, agree on their meanings, establish links and carry out a classification. Similar difficulties arising in solving an important but still auxiliary task are negatively affect the prospects for its solution.

We have developed and implemented a technology for creating a thesaurus based on the subject index of specialized encyclopedias. This technology provides a highly qualified description of the subject area using reliably verified terms, allowing First stage building a thesaurus with minimal involvement of specialists - experts in the given subject area. A detailed presentation and justification of the algorithm are given in the work. Below is a brief description of the algorithm, as well as the web application that implements it.

1. Algorithm for creating a thesaurus

It is proposed to use the subject index of a specialized encyclopedia (or several encyclopedias) as a list of keywords and phrases for the thesaurus. The choice of a particular encyclopedia is made by a subject matter specialist, and this choice depends on the goals pursued when creating the thesaurus. So, to solve complex environmental problems, it is advisable to use encyclopedias (or, in their absence, encyclopedic dictionaries) in physics, chemistry, geology, biology, medicine, mathematics, etc. , then, at least, as a basic list of keywords, which will be replenished if necessary.

Subject indexes of most of the encyclopedias are structured in a similar way - they contain terms that are the names of the encyclopedia's articles, terms, the definitions of which are given in the articles, as well as the most important results mentioned in the articles.

The names of the encyclopedia's articles are taken as descriptors (i.e., terms that are the names of classes of similar concepts), and the words from the subject index found in the corresponding

articles. The main advantage of this method is that you do not need to be an expert in a given subject area to establish the types of relationships between terms - general knowledge is enough to understand the text of the encyclopedia - more specific information required in the process of classifying concepts can always be gleaned from a specific article ...

Since the created thesaurus is designed to work using the Z39.50 protocol, the types of links are established in accordance with the recommendations of the / l lies scheme, which distinguishes the following types:

BT - connection with a parental term, that is, with a term of a broader meaning;

NT is a link with a child term, that is, with a term of a narrower meaning. The BT - NT relationship is reciprocal;

USE - a link to the term that is used instead;

UF - USE mutual feedback;

RT is a link that defines a related term;

LE - relationship between linguistically equivalent terms;

FE are completely identical terms.

Further, the classification of descriptors is carried out in accordance with the sections of this subject area. The choice of a specific classifier, as well as the choice of an encyclopedia, is carried out by an expert, and in the case of using several encyclopedias from different subject areas, it is possible to use several specialized classifiers. Links of the form NT, RT, LE (FE) are established between descriptors and sections of the classifier, while the classification should use, if possible, sections of the lowest possible level.

After that, keywords associated with the descriptor by relations BT, USE, RT, LE and FE are assigned the same classification number as the descriptor. However, this does not exclude such a situation that if the descriptor is assigned to a class not of the lowest level, then in the subsequent work of the expert, the terms associated with the descriptor by BT and USE relations can be assigned to the class of a lower level. In this case, the specified terms themselves become descriptors.

As a result, all terms included in the subject index are classified in accordance with the sections of this subject area.

2. Description of the web application operation

Nevertheless, the process of constructing a thesaurus in accordance with this methodology involves a large amount of routine work and, in addition, requires the participation of a person with programming skills. Therefore, in addition to the methodology, a web application was developed that has a user-friendly interface and supports the following functions:

1) automatic translation of information from digitized pages of the subject index into a database table;

2) highlighting descriptors in the general list of terms;

3) search for terms associated with a given descriptor and setting the types of links in accordance with the Zthes schema.

It is important to note that programming skills are not required to complete all of the above operations.

The developed application is universal, i.e. can be used to create thesauri of various subject areas. At the moment, the programmer is performing the conversion of the program from the subject index of one encyclopedia to the subject index of another (and only on this stage, the processes of constructing thesauri of different subject areas can differ), however, work is underway to supplement the program with functions that allow the user to carry out this operation. not having programming skills.

The application functions as follows. Digitized index pages are processed automatically. The user specifies the location of the text file with the data, after which it is read line by line and the terms themselves are entered into the database, as well as information about the numbers of the pages of the encyclopedia where they are located (Fig. 1).

Descriptors from the general list of keywords are selected by the user himself, marking the search terms in the list displayed on the screen. \ ¥ ob-appopio also supports the fix function possible mistakes(fig. 2). Recall that all terms found in the encyclopedia article dedicated to it are considered to be associated with this descriptor.

To facilitate the search for related terms, the user is only presented with a list of keywords located on the same page as the descriptor he has chosen (in fact, for this, we entered only terms into the database, and information about pomors of pages). Of course, since the article may not take up the entire page as a whole, unnecessary terms will be included in the list. The user, making connections,

Rice. 1. Entry text files with terms from the index

№ Creation of descriptor dictionary - Microsoft Internet Explorer!

File Edit View Favorites Tools Help

Q Back "©" @ | í | & uR Search ^ Favorites -. v

Address; | ¡J§ http: ^ localhost / math_dict / Deskj-_Slovar / Descr / gen_ss.phtml; V ¡¿3 Transition Links y>

fiBár JOQQ- © - I * 1] 0 l de: * - F

1 Abacus | 1, 13 1111111

2 Abelian automaton | 1, 67 1111111

3 Abelian group object | 1, 1149 111 1 | |

4 Abelian differential 11.13-15 I 2, 240 111111

5 Abelian differential, basis | 1, 13 1111111

6 Abelian differential, divisor | 1, 15 | | | | | 1 |

7 Abelian differential normal | 1, 14 1111111

8 Abelian differential, normalized | 1, 14 1111111

9 Abelian differential, polar period | 1, 14 | | | | | | |

10 Abelian differential, cyclic period | 1, 14 1111111

11 Abelian idempotent 14, 941 1111111

12 Abelian integral 11.15-17 1111111

13 Abelian integral, Abel's theorem | 1, 17 1111111

14 Abelian integral canonical | 1.16 ||||||

16 Abelian integral, period matrix | 1.16 ||||||

15 Abelian integral normal | 1, 16 |||||||

17 Abelian integral, polar period | 1.16 ||||||| 1S Abelian integral, cyclic period | 1, 16 | | | | |

19 Abelian potential | 2, 239 1111111

20 Abelev a group 11.17-20 1111111

21 Completely decomposable Abelian group | 1.19 ||||||

22 Abelian group divisible | 1, 19 |||||||

23 Abelian group finitely generated | 1.18 1111111

24 Abelian group, Kulikov criterion | 1, 18 | | | | | |

25 Abelian group, zero | 3.1082 1111111

26 Abelian group, periodic part | 1, 18 111 | |

http: // locdlhostymath_dict / Deskr_Slovar / Descr / goto, phtml? ss 1 + 4 + 1 + A + 1 + 3

j 5tartApache.bat

I Svoj.NET: PHP Edit

J Adobe Photoshop || w

^ Local intranet

EN Sch / m K 21: 0;

Rice. 2. List of keywords and highlighting descriptors

Rice. 3. Choice of related terms

Rice. 4. Establishing the types of connections.

selects only a part of the keywords from the proposed list, however, this automation significantly reduces the amount of routine work (Fig. 3).

Tin of the relationship between the descriptor and the keyword is specified by filling out the appropriate form (Fig. 4).

Conclusion

The efficiency of this algorithm and the web application was tested by creating a thesaurus of a number of sections of the "Mathematics" subject area ("Differential Equations", "Partial Differential Equations", "Numerical Analysis", "Fluid Mechanics", etc.) on the basis of the subject index " Encyclopedia of Mathematics ". It has been established that for the classification of terms and the establishment of links between them, a bachelor's qualification is sufficient (provided that in rare cases an expert with a scientific degree is involved in consultations). This proves the high efficiency of the developed algorithm.

Bibliography

Mikhailov A.I., Chernyi A.I., Gilyarevsky P.C. Fundamentals of Informatics. Moscow: Nauka, 1968.

Barakhnin V.B. Development of the thesaurus of the subject area "Mathematics" // Mater, conf. "Computing and information Technology in science, technology and education. Part 1. Novosibirsk; Almaty; Ust-Kamenogorsk, 2003. P. 111-115.

Zthes: a Z39.50 Profile for Thesaurus Navigation

http://lcweb.loe.gov/z3950/agency/profiles/zthes-04.html

3.1. Thesaurus concept

Thesaurus (from the Greek θήσαϋροξ - treasure, reserve) or an ideographic dictionary (from the Greek idea - concept, representation, idea and grapho - I write, describe) - in modern linguistics: 1) a special kind of dictionaries of general or special vocabulary, which indicate semantic relations between lexical items; 2) a dictionary for searching for a word by its semantic connection with other words; 3) a certain way of organizing (positioning) words in the dictionary; 4) a way of organizing the lexical composition, which makes it possible to economically “model the world”.

In the first, primordial, meaning - repository, treasure, the term thesaurus was used by L.V. Shcherba in the article "The Experience of General Lexicography" (the third opposition: thesaurus is an ordinary (explanatory or translated) dictionary). The scientist writes: “When they say thesaurus, nowadays they most often mean“ Thesaurus linguae latinae ”, an enterprise of five German academies, begun in 1900 and still brought with omissions only to the letter M. A characteristic feature This type of dictionaries consists in the fact that they contain all decisively words that have occurred in a given language at least once, and that under each word there are decisively all quotations from the texts available in the given language. The above opposition - thesaurus - an ordinary (explanatory or translation) dictionary - is based on the opposition of "linguistic material" and "linguistic system" - the concepts that I tried to substantiate in my article "On the threefold aspect of linguistic phenomena and on experiment in linguistics."

The second meaning of this term is associated with the widely known dictionary-thesaurus "Thesaurus English words and expressions "P.M. Roget (Roget's Thesaurus of English Words and Phrases, 1852) and its continuation, the dictionary of OV Baranov.

In this interpretation, the term thesaurus denotes a certain way of organizing, placing the lexical composition in the dictionary (see the third meaning of the term).

The fourth meaning of the term thesaurus is associated with the general recognition of such a way of organizing the lexical composition, which makes it possible to economically “model the world”. From this point of view, a thesaurus-dictionary is "a systematic ordering of the vocabulary of any scientific or technical field, and in the most general form - general literary vocabulary, and moreover, the entire vocabulary of a given language."

According to Yu.N. Karaulov, a general language thesaurus, fixing in the structure and relationships of its headings, sections, zones, areas the wide possibilities of non-verbal connection of ideas, ensures that human values ​​are taken into account.

A.N. Baranov and D.O. Dobrovolsky in the preface "From the Editors" to his "Dictionary-thesaurus of modern Russian idioms" give the thesaurus the following definition - special kind a dictionary that differs from others (in particular, explanatory, bilingual, etc.) in the way of organizing linguistic material. In the thesaurus, language units are not presented in alphabetical order, as in a regular dictionary, but are grouped based on their meaning.

L.P. Krysin calls the thesaurus (ideographic dictionary) an explanatory dictionary of a special kind, a dictionary “on the contrary”. “If in the explanatory dictionary, the scientist writes, the“ entrance ”to the dictionary entry is the word, and the content of the dictionary entry is the interpretation of the meaning of this word, then in the ideographic dictionary the“ input ”is the meaning, the idea (hence the name of this type of dictionaries - ideographic), and the content of a dictionary entry is a list of words expressing the given meaning. And if the explanatory dictionary is an indispensable tool for understanding the text, then the ideographic dictionary can be used when generating the text: very often a person wants to express a certain thought, but cannot find suitable words for this; an ideographic dictionary makes these searches easier. There are two main types of thesauri:

linguistic thesaurus - a dictionary containing a list of natural language words selected as a result of meaningful analysis of texts and systematized in accordance with the adopted classification system;

statistical thesaurus is an information retrieval dictionary containing a list of words selected as a result of statistical analysis of texts on a specific topic and grouped into dictionary entries based on the frequency of joint occurrence of these words in the same texts.

Information retrieval thesauri (IPT) facilitate the search for information during its automatic processing. IPT maximally reveal the semantic relations between lexical units. As stated in the State Standard for IPT, "a monolingual information retrieval thesaurus is a controlled and changing dictionary of lexical units based on the vocabulary of one natural language, displaying semantic relations between lexical units and intended for information processing and retrieval."

The basic unit of IPT is descriptor terms. The alphabetic, lexico-semantic part of the IPT is a collection of descriptor entries.

Descriptive dictionaries are intended for a complete description of the vocabulary of a certain area and fixing all the uses there; they record all available relevant cases. A typical example of a descriptive dictionary is the Explanatory Dictionary of the Living Great Russian Language by V.I. Dahl (the first edition in four volumes was published in 1863-1866). The goal of its creator was not to standardize the language, but to fully describe the entire variety of Great Russian speech, including its dialectal forms of vernacular.

Each descriptor dictionary entry begins with a descriptor, in which the synonyms of this descriptor, as well as other lexical units associated with the main descriptor, generic or associative relations, are given below within the GOST article.

Thus, thesauri, especially in electronic format, are one of the most effective tools for describing individual subject areas.

Pure thesaurus is rare. In real thesauri, the initial idea is simplified or additional information is added, but potentially necessary for the user. The most famous today are the "Russian Semantic Dictionary" by Yu.N. Karaulova, "Dictionary of the Identical Name" N.Yu. Shvedova, “Thematic Dictionary of the Russian Language” by L.G. Smekhova and others.

Summary. Term thesaurus L.V. Shcherba used it in relation to the dictionary, which recorded, whenever possible, all the contexts in which the given word occurs. A characteristic feature of thesauri is that they contain all the words that have occurred in a given language at least once, and under each word are all quotations from the texts available in the given language. The content of the thesaurus dictionary is linguistic material, and the content of an ordinary dictionary is linguistic material and linguistic system (terms of L.V. Shcherba).

This characteristic is complemented by cross-links of all sorts - more often paradigmatic (synonymous or antonymic), which indicate the commonality or opposition of meanings. In addition, various assoc. links (i.e. syntagm. links).

Thus, the task of the thesaurus (ideographic dictionary) is to give an idea of ​​the semantic organization of a certain slice of linguistic material, showing the main semantic fields, their internal structure and external connections. The thesaurus is a clear demonstration of the systemic nature of the language, allowing you to see the many types of relationships that connect individual linguistic units and groups of units.

3.2. The history of the presentation of conceptual knowledge about the world in the form of a thesaurus

The need to arrange words by similarity, contiguity, analogy of their meanings was felt throughout the observable history of human thought.

To trace the origin of the idea of ​​representing conceptual knowledge about the world in the form of a thesaurus, we will be helped by referring to the history of compilation of thesauri (ideographic dictionaries).

So, at the dawn of civilization, when people could express their thoughts in writing only with the help of ideograms and symbols, the only possible dictionary was probably one in which words were arranged in thematic groups. It was simply difficult for a lexicographer at that time to find another criterion for the classification of words, except for the relations existing in reality itself.

Unfortunately, we do not have evidence of whether the peoples who used the ideographic writing really had such dictionaries. Among the most ancient attempts at ideographic classification known to us is called Attikai Lexeis of the Greek grammar, director of the Library of Alexandria Aristophanes of Byzantium (died 180 BC).

In the II century. n. NS. a major work "Onomasticon" appears, compiled on the material of the Greek language by the lexicographer and sophist Julius Pollux (real name Polidevkus), a native of the Egyptian city of Navcratis. Y. Pollux wrote several works, but only "Onomasticon" has survived to us (Pollux Y. Onomasticon. M., 1956).


Onomasticon consists of 10 books. Books are essentially separate treatises and contain the most important words related to a particular topic. Thus, the first book speaks of gods and kings; in the second - about people, their life and physiological structure; in the third - about kinship and civil relations, etc. The words in the dictionary are accompanied by brief explanations. In modern times, the dictionary was first published in 1502 in Venice.

Between the 2nd and 3rd centuries n. NS. the wonderful Sanskrit dictionary "Amarakosha" (Amarakosha. Paris, 1839) is published. Its author is the ancient Indian poet, grammar and lexicographer Amara Sina, who was called "one of the nine pearls that adorn the Vikramaditya throne." Amarakosha, translated into Russian, means Amara's treasury. The dictionary contains 10 thousand words. For better memorization of the interpretation of the meanings of words, dictionary entries are built in the form of verses. All material of the dictionary is divided into 3 books. Each book includes several chapters, and the chapter, in turn, if necessary, is divided into a number of sections. The first book is dedicated to the sky, gods and everything that is directly related to them. The second book contains words related to the earth, settlements, plants, animals and man (first, man is considered as a living being, and then as a social being; the entire caste structure of modern society appears before our eyes; priests, as God's confidants, are at the very top , and below are the military and kings, even below are the landowners, and at the very bottom are artisans, jugglers, servants, etc.). The third book is actually linguistic, as is evident from the titles of its six chapters.

The dictionary became known to European scholars only at the end of the 18th century, when in 1798 its first part was published in Rome. It was published in full with a translation into English in 1808 by the English Sanskritologist G.T. Colebrooke. In 1839, his French translation by A.L. Delonshan (A.L. Deslongchamps). Further development ideas of semantic classification of vocabulary associated with the problem of the so-called world language.

Summary. This is, in the most general terms, the first stage in the development of the tradition of the ideographic classification of vocabulary. This stage can be called the prehistory of ideographic dictionaries. Now it is advisable to turn to the modern classification of thesaurus dictionaries.

It is easy to see how unlike the described works are to alphabetical dictionaries. If in alphabetic dictionaries the presentation of words is regulated by such a conditional and highly neutral instrument as the alphabet, then in the construction of an ideographic dictionary, the worldview of the lexicographer himself acquires decisive importance.

3.3. Principles for the classification of thesaurus dictionaries

As already shown above, the problem of compiling a classification of thesauri is not new and for several decades has attracted the attention of a number of domestic and foreign linguists (K. Marello, V.V. Morkovkin, L.P. Stupin, V.V.Dubichinsky, etc. ). The result of research in this area was the creation of alternative classifications of these lexicographic works. One of the latest classifications is based on the following criteria: a) the type of semantic connections between vocabulary units; 2) the volume of the vocabulary; 3) generalized vocabulary; 4) development of the meaning of lexemes; 5) grammatical and stylistic qualification of lexemes; 6) demonstration of the functioning of lexemes; 7) the number of languages ​​represented; 8) the type of semiotic means used for the semantization of lexemes. The named classification is based on the classifications created earlier by O.M. Karpova and I. Burkhanov (Burchanov I. On the Ideographic Description of Stylistically and Pragmatically Relevant Aspects of Lexical Meanings. London, 1996); the terminology used in the classification is introduced into the lexicographic apparatus


V.V. Morkovkin, Yu.N. Karaulov, K. Marello. The classification criteria were formulated by O.M. Karpova. At the same time, K. Marello distinguishes three types of thesauri:

cumulative, which are groupings of words without defining their meanings;

definitive, interpreting each lexical unit of word grouping;

bilingual and multilingual thesauri for travelers (Marello C. TheThesaurus // W.D.D. 1990. V. 2. P. 1083).

Cumulative thesauri not only provide an opportunity to find a more understandable, accurate, stylistically correct word in a situation of being in a certain semantic field, but also become the basis for the formation of thematic computer data banks.

Definitive thesauri can include, along with the definition of meaning, etymological information and citations from literary works, which shows the direct encyclopedic orientation of this type of thesauri. In addition, dictionaries of this type introduce the user to the necessary system of concepts, explain the essence, similarities and differences of concepts, their paradigmatic and syntagmatic connections, sometimes provide information about the pronunciation, grammatical, word-formation and other possibilities of lexical units denoting these concepts.

Bilingual and multilingual thesauruses for travelers are usually created according to thematic sections: numbers, food, transport, hotel, etc. with the translation of equivalents of two or more languages.

For the most complete display of the types of existing thesaurus dictionaries, a multilevel classification is created. First, according to the type of semantic connections between units of the vocabulary, thesauri are divided into three large classes:

1. Associative thesaurus (terminology of Yu.N. Karaulov

2. A similar thesaurus (terminology of V.V. Morkovkin

3. Ideographic (ideological) thesaurus (terminology of L.V.Scherba, V.V. Morkovkin. The named three types of thesaurus reflect the following types of semantic connections of lexemes, respectively:

1. Semantic-syntactic relations, on the basis of which
words are combined into groups or pairs, predetermined in their origin and existence by double bonds: semantic and syntactic. Semantic connections of words are established mainly between verbs and adjectives that perform a predicative function in a sentence, and nouns, for example:

a) between the action and the organ (instrument) with the help of which it is performed: grab - hand, see - eye, swim - boat, etc .;

b) between the verbs of action, requiring one subject, and the subject: bark - a dog, neigh - a horse, etc.; c) between verbs and a certain grammatical addition, which the former require: chop wood, eat food, etc.

Hence, an associative thesaurus is a dictionary-thesaurus that organizes lexical units on the basis of semantic and syntactic connections existing between them and arranges groups in accordance with the graphic form of word centers.

2. Lexical and semantic connections. Unification into groups with this type of connection occurs according to the main attribute for words - lexical meaning. This also takes into account the lexico-grammatical connections, in the form of which the individual meanings of words are realized.

Thus, an analogous thesaurus is a lexicographic reference book, the basic unit of the macrostructure of which is the lexical-semantic group; the groups are systematized in alphabetical order of semantic dominants.

3. Subject or thematic connections, where the combination of words into one group occurs due to the similarity or commonality of the functions of objects and processes indicated by words: objects
household items, body parts, types of clothing, buildings, etc.

Thus, an ideographic thesaurus is a lexicographic work that represents lexical units within subject (thematic) groups and organizes them into a hierarchical structure designed to represent conceptualized knowledge about the world.

Within the framework of the same criterion, we carry out a further subdivision of types. So, the ideographic thesaurus is represented by the following 4 types:


An ideographic thesaurus itself.

Thematic dictionary.

Systematic dictionary.

Subject-systematic dictionary


The ideographic thesaurus itself is a special type of ideographic vocabulary, the macrostructure of which is organized in accordance with a priori synoptic map superimposed on the lexical composition of the language. Unlike other types of ideographic vocabulary, the ideographic thesaurus itself is characterized by a logical and strictly ordered classification structure based on scientific taxonomy, even if the general vocabulary is subject to lexicographic description (New Webster "Thesaurus. Landoll, 1991).

A thematic dictionary is a special type of ideographic thesaurus, the main unit of the macrostructure of which is a thematic group, which includes lexemes combined on the basis of the classification of their denotations (referents) and considered from the point of view of their relevance to a specific topic.

A systematic dictionary is a special type of ideographic thesaurus, the classification structure of which is intended to represent the actual semantic relations that exist between the lexical units of the language. At its core, the classification structure represents the lexical and grammatical classification of the vocabulary, in other words, its paradigmatic structure, described in terms of subordination and composition.

A thematic-systematic dictionary is a special type of ideographic dictionary, which is a combination of a thematic and systematic dictionary.

Summary. The considered classification of linguistic thesauri includes the following types of dictionaries: analogous thesaurus (terminology of VV Morkovkin); ideographic (ideological) thesaurus (terminology of L.V.Scherba and V.V. Morkovkin); assoc. thesaurus (terminology by Yu.N. Karaulov). Pop will be presented next. thesauri and their features are revealed.

3.4. Popular thesauri and their features

The most famous of the available thesaurus dictionaries, to which the term itself owes its existence, was created on the basis of of English language; it is a constantly reprinted thesaurus by P.M. Roger Roget's Thesaurus of English Words and Phrases (1852).

It is important to note that the author of The Thesaurus of English Words and Expressions made full use of his experience at that time. “The principle that I followed when classifying words,” writes P.M. Roger is the same one that is used to classify individuals in various areas of natural history. Therefore, the sections highlighted by me correspond to the natural families of botany and zoology, and the rows of words are cemented by the same relationships that unite the natural rows of plants and animals. "

P.M. Roger believed that a convincing classification of words according to their meanings is impossible until the objects of reality called these words are properly studied and organized. Therefore, he begins his work by dividing the conceptual field of the English language into four large classes: abstract relations, space, matter and spirit (mind, will, feelings). These classes are further subdivided into a number of genera, which in turn are subdivided into a certain number of species.

Among the shortcomings of the ideographic dictionary of P.M. Roger scientists attribute the following: 1) not quite convincing nomenclature of basic conceptual classes; 2) abstract consistency prevails over the natural connections of words; 3) the relative inconvenience of use (to a large extent, this shortcoming is corrected in subsequent editions).

In modern Russian lexicography there are several dictionaries that should be classified as thesaurus dictionaries (ideographic dictionaries). This, for example, created under the leadership of Yu.N. Karaulov "Russian Semantic Dictionary", "Russian Semantic Dictionary" edited by N.Yu. Shvedova, “Thematic Dictionary of the Russian Language” by L.G. Sayakhova, D.M. Khasanova and V.V. Morkovkina, "Dictionary of lexical-semantic groups of Russian verbs", ed. E.V. Kuznetsova, "Ideographic Dictionary of the Russian Language" by O.S. Baranov, "The concept of the inner world of man in the Russian language" V.I. Ubiyko, a comprehensive educational dictionary "Lexical basis of the Russian language" under the guidance of V.V. Morkovkin.

Let's get acquainted with some of them.

Dictionary-thesaurus of modern Russian idioms "edited by A.N. Baranova and D.O. Dobrovolsky includes four main parts: 1) synopsis; 2) a legend; 3) the main body of the Thesaurus Dictionary; 4) pointers. The purpose of the Synopsis is to provide an overview of the structure of the Main Corpus of the Thesaurus. It lists all taxa with subtaxons and corresponding paradigmatic references. The main corpus of the Thesaurus Dictionary is a collection of dictionary entries combined into groups (taxa) and subgroups (subtaxons) in accordance with the meaning of the idioms described in them. Each article contains an idiom and examples of its use in modern Russian. Synopsis, Legend, Pointers are service parts of the above-mentioned Dictionary-thesaurus, which provide the user with the ability to work quickly and efficiently. The legend is used in cases when examples of the use of idioms are not needed, because it reproduces all information except examples. In fact, this is the vocabulary of the Dictionary. The units of the vocabulary are lemmas. The lemma in this case is an idiom in its original (dictionary) form and includes, if possible, all of its essential variants. For example, the idiom to stand still is part of the lemma to mark time, stand still, and slip in place.

The dictionary contains two pointers. At the end of the book there is an article "The theoretical concept of the Dictionary-thesaurus of modern Russian ideomatics", which analyzes in detail the scientific features of this project.

"Russian Semantic Dictionary", created under the guidance of Yu.N. Karaulova includes 10 thousand Russian words, which are divided into 1600 conceptual groups. The selection of groups is based on repeating elements of the interpretation of words in explanatory dictionaries: for example, "action", "property", "instrument", etc.

"Russian Semantic Dictionary", created under the leadership of Academician N.Yu. Shvedova, is based on slightly different principles typical for the compilation of both ideographic and explanatory dictionaries. Firstly, all words of the language are divided here into four classes: 1) indicating units (pronouns), 2) naming (significant words), 3) actually connecting (conjunctions, prepositions, linking verbs), 4) classifying (modal words, particles, interjections). Secondly, within each class, all words are divided into parts of speech. Thirdly, within each part of speech, sets and subsets are identified based on thematic proximity or, conversely, opposition of word meanings.

DUDEN is a book with pictures (drawings) on the left side (according to different software) with numbered details (down to the smallest). On the right-hand side, this numbered list is accompanied by titles (even in two languages). For example, a whole page contains railway equipment, stations, tracks. On the right are the names of arrows, semaphores, crutches, etc.

"Thematic dictionary of the Russian language" L.G. Sayakhova, D.M. Khasanova and V.V. Morkovkina contains 25 thousand lexical units, grouped into three large classes: "Man", "Society", "Nature", which branch stepwise into smaller subclasses. For example, in the class "Human" subclasses "Human body and organism", "Human life", " Appearance, human appearance "," Emotional appearance of a person ", etc. Each of the subclasses, in turn, is divided into even more particular ones:" The emotional world of a person "-" Mental properties of a person "-" Temperament "," Character "-" General character traits " etc. The meaning and use of words belonging to each class are illustrated by the most common phrases. For example, the word “laughter”, which is in the subgroup “expression of feelings, emotions” of the class “Person”, is accompanied by the indication of such combinations with this word as cheerful laughter, joyful laughter, child's laughter, bursting into laughter, etc.

Summary. One of the most effective tools for describing individual subject areas, especially in electronic format, are thesauri.

The term thesaurus has long been widely used in linguistics to denote special type dictionaries, to one degree or another reflecting the "picture of the world", "the linguistic model of the world" (according to Yu.N. Karaulov). The thesaurus as a “treasury” has grown in its semantic volume and acquired a new meaning. They began to call a dictionary that not only absorbs all the lexical wealth of the language, but organizes them in a certain logical-systemic way. In the thesaurus dictionary, words are brought together into groups, and this combination occurs on the basis of the ability of a word to convey a certain concept.

Thesaurus dictionary has always been considered in linguistics as a kind of universal system that ensures the storage of collective (for a particular society) knowledge about the world in verbal form. Unlike other dictionaries, the thesaurus-dictionary stores this knowledge in a structured form that reflects our ideas about the "structure of the world."

The most famous and popular thesauri at the present time are the English Roger Thesaurus, the Ideographic Dictionary of the Russian Language by O.V. Baranova, Russian Semantic Dictionary Yu.N. Karaulova, Russian Semantic Dictionary of Academician N.Yu. Shvedova, DUDEN, Thematic Dictionary of the Russian Language L.G. Sayakhova, D.M. Khasanova and V.V. Morkovkin.

In accordance with the conclusions of Chapter 1, the thesaurus, the compilation and study of which our work is devoted to, is the ideographic thematic dictionary "Mountain and Hiking Tourism". It will consist of the vocabulary of the Russian and Spanish languages.

Thus, in order to compose a thesaurus, it is necessary to solve a number of problems:

Highlight terms that describe the subject area;

Carry out a logical division of terms into semantic groups;

Compare the terms of the Russian and Spanish languages;

Arrange groups alphabetically.

Methods and algorithm for manual thesaurus compilation

An information retrieval thesaurus is a dictionary compiled by hand by an expert linguist, a specialist in the field of building dictionaries and semantic resources. When compiling such a dictionary, the task is to obtain a thesaurus description of one or several subject areas, while there is often a corpus of texts that is the basis for creating a dictionary. The expert analyzes the text corpus and, guided by the technology of manual construction of the thesaurus, compiles a list of terms describing a given subject area and includes their thesaurus as descriptors. After that, the terms are grouped into concepts and hierarchical and associative relationships are established between them.

The process of manually creating a thesaurus is characterized by such disadvantages as the high cost and duration of creating a resource, the conditionality of the result from the qualifications of an expert, the impossibility of manually analyzing the entire corpus of texts, and some others. Obviously, when manually compiling a thesaurus, an expert needs to use existing methods of information retrieval and internet search engines- systems.

First of all, a bilingual thesaurus does not represent word-by-word translations, its structure is a list of lexemes in Russian organized on the basis of proximity of semantic meaning - definition of a concept in Russian - definition of a concept in foreign language- a list of text options in a foreign language. In this case, the lists of lexical units should be as complete as possible on each side, including those expressions that are usually not represented in dictionaries, since they seem obvious to a person.

When creating traditional bilingual dictionaries, the main goal is to provide a set of the most frequent translations of a word in various texts. Translations are given as if with a margin, the list of translations includes both exact translations and translations with a narrower meaning and with a broader one (this is why Spanish-Russian and Russian-Spanish dictionaries are not reversible). It is assumed that the reader will understand the context of which translation to choose.

The main steps in compiling a thesaurus are as follows:

1) Pre-processing of the text corpus in order to highlight keywords.

2) Formation of a set of words and phrases for inclusion in the thesaurus and the study of relationships between descriptors of the thesaurus. The expert, guided by this set, makes a list of the key concepts of the subject area.

3) Allocation of hierarchical relations between descriptors (in our case - alphabetical order) and their classification (in our study, the classification is based on semantic relations between descriptors).

4) Building a set of associative relationships between descriptors in Russian and Spanish.