Unas cuantas definiciones para entendernos.

Tom McArthur define «Corpus» (latinajo de uso habitual en la jerga, plural «corpora»)[1] como

  1. A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse.

  2. In linguistics and lexicography, a body of texts, utterances or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analysed by means of «tagging» (the addition of identifying and classifying tags[2] to words and other formations) and the use of «concordancing programs».

    «Corpus linguistics» studies data in any such corpus.

El marcado del corpus reponde a la necesidad de lo que se llama «text annotation»: adding linguistic information

  1. Part-of-speech (POS) tagging

  2. Syntactic annotation (parsed corpora)

  3. Pragmatic annotation

  4. Rhetorical information

  5. Discourse structure


índice (normalmente alfabético) de las palabras de un texto, en el cual la palabra analizada figura en el centro de una línea rodeada a derecha e izquierda de otras con las que aparece en un contexto determinado.

Continúa el “Tutorial: Concordances and Corpora [3]

The most common form of concordance today is the «Keyword-in-Context (KWIC) index», in which each word is centered in a fixed-length field (e.g., 80 characters).

«Concordance programs (concordancers)»[4]:

Concordance programs are basic tools for the corpus linguist. Since most corpora are incredibly large, it is a fruitless enterprise to search a corpus without the help of a computer. Concordance programs turn the electronic texts into databases which can be searched. Usually (1) word queries are always possible, but most programs also offer (2) the possibility of searching for word combinations within a specified range of words and (3) of looking up parts of words (substrings, in particular affixes, for example). If the program is a bit more sophisticated, it might also provide its user with (4) lists of collocates (colocaciones) or (5) frequency lists.

Interesante, el siguiente texto de Melamed (

A «bitext» consists of two texts that are mutual translations. A bitext map is a fine-grained description of the correspondence relation between elements of the two halves of a bitext. Finding such a map is the first step to building translation models. It is also the first step in applications like automatic detection of omissions in translations.

Alignments (‘alineaciones’, ‘alineamientos’, ‘emparejamientos’ o ‘correspondencias’ se lee en la literatura técnica) are “watered-down” bitext maps that we can derive from general bitext maps.

El Informe Final del proyecto POINTER se esfuerza —y creo que lo consigue— por aclarar los términos ‘lexicología’, ‘lexicografía’, ‘terminología’ y ‘terminografía’ ( La cita es larga pero creo que no tiene desperdicio.

While lexicology is the study of words in general, terminology is the study of special-language words or terms associated with particular areas of specialist knowledge[5]. Neither lexicology nor terminology is directly concerned with any particular application. Lexicography, however, is the process of making dictionaries, most commonly of general-language words, but occasionally of special-language words (i.e. terms). Most general-purpose dictionaries also contain a number of specialist terms, often embedded within entries together with general-language words. Terminography (or often misleadingly "terminology"), on the other hand, is concerned exclusively with compiling collections of the vocabulary of special languages. The outputs of this work may be known by a number of different names —often used inconsistently— including "terminology", "specialised vocabulary", "glossary", and so on.

Dictionaries are word-based: lexicographical work starts by identifying the different senses of a particular word form. The overall presentation to the user is generally alphabetical, reflecting the word-based working method. Synonyms —different form same meaning— are therefore usually scattered throughout the dictionary, whereas polysemes (related but different senses) and homonyms (same form, different meaning) are grouped together.

While a few notable attempts have been made to produce conceptually-based general-language dictionaries — or "thesauri", the results of such attempts are bound to vary considerably according to the cultural and chronological context of the author.

By contrast, high-quality terminologies are always in some sense concept-based, reflecting the fact that the terms which they contain map out an area of specialist knowledge in which encyclopaedic information plays a central role. Such areas of knowledge tend to be highly constrained (e.g. "viticulture"; "viniculture"; "gastronomy"; and so on, rather than "food and drink"), and therefore more amenable to a conceptual organisation than is the case with the totality of knowledge covered by general language. The relations between the concepts which the terms represent are the main organising principle of terminographical work, and are usually reflected in the chosen manner of presentation to the user of the terminology. Conceptually-based work is usually presented in the paper medium in a thesaurus-type structure, often mapped out by a system of classification (e.g. UDC) accompanied by an alphabetical index to allow access through the word form as well as the concept. In terminologies, synonyms therefore appear together as representations of the same meaning (i.e. concept), whereas polysemes and homonyms are presented separately in different entries.

Dictionaries of the general language are descriptive in their orientation, arising from the lexicographer's observation of usage. Terminologies may also be descriptive in certain cases (depending on subject field and/or application), but prescription (also: "normalisation" or "standardisation") plays an essential role, particularly in scientific, technical and medical work where safety is a primary consideration. Standardisation is normally understood as the elimination of synonymy and the reduction of polysemy/homonymy, or the coinage of neologisms to reflect the meaning of the term and its relations to other terms.

«Terminology management», itself a neologism, was coined to emphasise the need for a methodology to collect, validate, organise, store, update, exchange and retrieve individual terms or sets of terms for a given discipline. This methodology is put into operation through the use of computer-based information management systems called «Terminology Management Systems» (TMS).

Dice Martínez de Sousa, sub voce terminología, en el Diccionario de lexicografía práctica

Hoy la terminología es una ciencia bien estructurada que se ocupa en crear los catálogos léxicos propios de las ciencias, las técnicas, los oficios, etc., partiendo de sistemas coherentes establecidos por organismos nacionales e internacionales.

El proyecto SALT distingue entre «lexbases» y «termbases», pensadas para ser usadas en traducción automática las primeras y como recursos de ayuda a la traducción las segundas; EAGLES habla de «termbanks».

EAGLES-I proporciona la siguiente definición de «Memoria de Traducción»[6]:

a multilingual text archive containing (segmented, aligned, parsed and classified) multilingual texts, allowing storage and retrieval of aligned multilingual text segments against various search conditions.

[1] McArthur, Tom «Corpus», en: McArthur, Tom (ed.) 1992. The Oxford Companion to the English Language. Oxford, 265-266.

[2] Se habla de etiquetas, marquillas o anotaciones.

[3] De Catherine Ball, de la Universidad de Georgetown,

[5] Abaitua habla de «lenguajes de especialidad».

[6] La confusión terminológica sobre el concepto es evidente: si habla de ‘translation databases’ y de ‘catálogos’ (kbabel), ‘compendia’ (gettext), ‘learn buffers’ (gtranslator).