410 likes | 551 Views
SEL3053: Analyzing Geordie Lecture 4. Digital electronic corpora. This lecture introduces digital electronic natural language corpora, one of which will be analyzed subsequently. The discussion is in four parts: the first part distinguishes language from language representation,
E N D
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora This lecture introduces digital electronic natural language corpora, one of which will be analyzed subsequently. The discussion is in four parts: the first part distinguishes language from language representation, the second sketches the history of language representation technology to the present day, the third shows how language is electronically represented, and the fourth outlines the development and current state of printed and electronic text and text collections.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 1. Language and language representation Language and language represented as text are often confused, and many people aren't even aware of the distinction. There is, however, a fundamental distinction: Language is a genetically determined aspect of human cognition. No one knows when this cognitive faculty developed beyond the communicative capabilities of other animals, but humans have certainly had it for tens of thousands of years.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 1. Language and language representation Representation of language as text is a humanly-invented technology. It works by identifying the phonemic structure of the language of interest, and associating each phoneme with symbol: the English phoneme /c/ is represented by the symbol C, the phoneme /a/ by A, and /t/ by T, thereby permitting the representation of the word /cat/ as CAT in writing or print. Such language representation is referred to as 'alphabetic' to distinguish it from, for example, pictographic systems, which do not represent language but physical reality.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 1. Language and language representation The distinction between language and language representation is easily seen in young children and non-literate adults. Both have language but are incapable of representing it; the ability to do so must be explicitly learned.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.1 Mesopotamia As far as we know, the idea of representing the phonemic structure of language symbolically arose only once, in southern Mesopotamia (currently Iraq) about 4000 years ago, and all the world's alphabetic writing systems derive from the Mesopotamian one. The symbol system used to represent the early Mesopotamian language, Sumerian, is known as cuneiform, and consisted of marks made by pressing the end of a triangular stylus into a wet clay surface. Examples:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.1 Mesopotamia
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.2 Egypt Egypt had a well-developed pictographic system known as hieroglypic, but gradually supplemented and eventually replaced it with an alphabetic system based on the Mesopotamian one. Examples of hieroglyphic pictograms:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.3 The Mediterranean world in Antiquity The Greeks and, later, the Romans adopted and further developed the originally Mesopotamian alphabetic system; by Roman times the alphabet we currently use had been developed.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.4 The medieval West During the Middle Ages the Roman alphabetic system was used.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.5 The advent of printing From Mesopotamian times until the fifteenth century, language representation involved human putting the symbols of the alphabetic system onto some physical surface, be it clay (Mesopotamia), papyrus (Egypt, Greece, Rome), parchment (European Middle Ages). Then, in 1440, Johannes Gutenberg invented print technology, which allowed for much faster book production. It was based on using individual letters cast in lead, which were assembled into matrices that were then placed into a printing press and inked, thereby leaving an impression of the text matrix on a piece of paper.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.5 The advent of printing
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.5 The advent of printing
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 2. Outline history of language representation technology 2.5 The advent of printing Print was the primary language representation technology for the five centuries between the fifteenth and the mid-twentieth century. It has since then been increasingly superseded by electronic language representation.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language To understand digital electronic representation of language it is necessary to be clear about the nature of symbols and how language can be symbolized. What a symbol is: some physical thing What a symbol does: representation The arbitrariness of a symbol relative to what it represents
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language Since the invention of language representation technology, language has been symbolized using visible marks on some surface: stone, clay, papyrus, parchment, paper. But, given the nature of symbols, this is not in principle the only way to symbolize language. Any physical medium will do, and that includes electricity.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code i. History Scientists had been working on an electrical device for communication, the telegraph, since the mid-18th century, but it was an American named Samuel Morse who proposed the first workable system in 1838, and with it the idea of electronic representation of language. The usefulness of this invention for fast, long-distance communication was quickly appreciated. By 1854, there were 23,000 miles of telegraph wire in operation in the US. In 1851, Western Union was founded, and in 1868, the first successful trans-Atlantic cable link was established.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code ii. How Morse Code works In an alphabetic writing system, language is represented, or encoded, by assigning a symbol to every phoneme of a language. In the West, this has for many centuries been done using the familiar alphabet: /a/ is represented as A /b/ is represented as B and so on. But the shape of the symbols used to represent phonemes is entirely arbitrary, and the result of a particular historical development. Morse's idea was to use a different representation. For every letter in the conventional alphabet, he proposed a corresponding symbol consisting of dots and dashes:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code ii. How Morse Code works But the shape of the symbols used to represent phonemes is entirely arbitrary, and the result of a particular historical development. Morse's idea was to use a different representation. For every letter in the conventional alphabet, he proposed a corresponding symbol consisting of dots and dashes:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code ii. How Morse Code works For every letter in the conventional alphabet, he proposed a corresponding symbol consisting of dots and dashes:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code ii. How Morse Code works Using this system, the word CAB would look like this: * * * - * - * * - - * * * - * * *
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code ii. How Morse Code works This recoding of phonemes looks superfluous at best --we already have a perfectly good alphabetic system-- and silly at worst, but in fact it is fundamental to computational language representation technology, as we shall see.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code iii. How a telegraph works
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code iii. Telegraph and Morse code combined The key insight in this marriage stems once again from the nature of symbols, and in particular from the arbitrariness of symbols relative to what they represent. We have seen that, for each letter in the conventional alphabet, Morse proposed a symbol consisting of a sequence of dots and dashes.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code iii. Telegraph and Morse code combined Now, there is no particular reason why the dots and dashes should be, not marks on a piece of paper, but electrical pulses: a dot could be a short pulse, and a dash a long pulse. In other words, Morse Code can be translated from a visual code directly into an electronic code. This is the crucial step For the first time, there was an alternative to the traditional representation of language as visible marks on some surface, and that alternative was an electronic representation. And how can such an electronic representation be generated?
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code iii. Telegraph and Morse code combined And how can such an electronic representation be generated? By using a telegraph:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code iii. Telegraph and Morse code combined By releasing the finger press for a short time and allowing the electrical contacts to come together only briefly, this device generates a short electrical pulse, and by releasing it for longer, it generates a long one. For a short pulse, the buzzer sounds briefly, and for a longer one it sounds for longer.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code iii. Telegraph and Morse code combined Thus, the telegraph version of the Morse Code for the letter D looks (or rather sounds) like this:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.1 The first step: the telegraph and Morse Code iii. Telegraph and Morse code combined An operator who is familar with Morse Code can therefore encode and send any text message as a sequence of beeeeeeeep and beep keystrokes. All one needs is a network of electrical lines that the electronic pulses can travel along. In fact, such a network was quickly constructed in 19th-century America, and a cable was laid across the Atlantic to allow electronic communication with Europe, as already noted
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.2 Generalization of Morse Code: ASCII ASCII has been the standard text encoding scheme for representation of text in computers for the past two decades. It differs from Morse in two ways: It uses 0 and 1 instead of dots and dashes to make letter codes The code length is a constant 8 places, whereas in Morse the number of dots and dashes varies
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.2 Generalization of Morse Code: ASCII Though different in detail, however, ASCII is no different in principle from Morse.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.2 Generalization of Morse Code: ASCII In ASCII, the word CAB looks like this:
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.3 Text storage in computers A computer is an electronic device, and can only store data in electronic form in its memory. A computer memory is, in essence, just a very long sequence of numbered storage bins.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.3 Text storage in computers Each bin, or slot, on the right-hand side of the memory can contain one piece of electronic data. The computer gets at that piece of data by going to the corresponding address. How the computer knows the address, and what it does with the data once it has it, leads into the issue of how computers work, which is both beyond the scope of this module and unnecessary for present purposes.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.3 Text storage in computers We have seen that ASCII codes can be converted to electronic form by interpreting 1 as 'electrical on' and 0 as 'electrical off', and also that a computer memory is a sequence of storage slots, where each slot contains one item of electronic data. That data can be ASCII codes. Storing text in a computer memory is therefore simply a matter of putting the relevant codes in known memory locations in the right sequence. Thus, the word CAB would look like this in memory.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.4 How text gets into computer memory Text gets into computer memory by means of an input device. There are various such devices, but the most familiar and commonly-used is the keyboard, so we look at that. As with memory itself, the operation of a computer keyboard is conceptually very simple: every time a letter key is pressed, the electronic ASCII code corresponding to the key is generated and sent up the wire connecting the keyboard to the computer. When it arrives at the computer, it is placed into the memory.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 3. Digital electronic representation of language 3.4 How text gets into computer memory
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 4. Corpora 4.1 Print corpora 4.1.1 Naturally-evolving corpora i. Accumulation of printed documents ii. Examples: library collections, historical archives, the law, the canon of English literature, etc.
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 4. Corpora 4.1 Print corpora 4.1.2 Explicitly-designed corpora i. Motivated by the appearance of scientific linguistics ii. Research agendas: historical, dialectological etc. iv. The nature of print-based corpora: document collections
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 4. Corpora 4.2 Electronic corpora 4.2.1 The current position Worldwide generation of text leading to implicitly constructed corpora Explicit construction of corpora for linguistic research Standards: XML Examples
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 4. Corpora 4.3Advantages of electronic over print corpora i. Efficiency of production Keyboard, scanner, voice recognition Contrast with print and manuscript production ii. Efficiency of storage Capacity of electronic media Contrast with storage of books iii. Efficiency of reference Locating and searching electronic text Contrast with locating and searching of books
SEL3053: Analyzing GeordieLecture 4. Digital electronic corpora 4. Corpora 4.3Advantages of electronic over print corpora iv. Efficiency of transmission Electronic dissemination of text Contrast with physical dissemination of books v. Cost: electronic text is VERY cheap vi. Suitability for analysis