520 likes | 649 Views
LIS6 18 lecture 2 preparing and preprocessing. Thomas Krichel 2011-04-21. today. text files and non-text files characters lexical analysis. preliminaries. Here we discuss some preparatory stages that has the way the IR system is built. This may sound unduly technical.
E N D
LIS618 lecture2preparing and preprocessing Thomas Krichel 2011-04-21
today • text files and non-text files • characters • lexical analysis
preliminaries • Here we discuss some preparatory stages that has the way the IR system is built. • This may sound unduly technical. • But the reality is that often, when querying IR systems • we don’t know how they have been built • we can’t find something that is not there • we have to make guesses and try several things
key aid: index • An index is a list of features, with a list of locations where the feature is to be found. • The feature may be a word • The locations depend on where we need to find the feature at.
example: book • In a book the index links terms and locations are page numbers. • Index terms are selected manually by the author or a specially trained indexer. • Once indexing terms are chosen the index can be built by computer. • Note that any index terms may be on several page.
example: computer files • Here the feature can be a word. • An a location can be the location of the file with the location of the start term within the file. • The location of the term can be found by looking at the start and add the length of the term.
working with files • Usually, when indexing takes place, we study the contents of documents that are stored in files. • Usually you know something about the files • by the name of the directory • by commonly used file name part, e.g. extension • JHOVE is a tool that tries to guess file types when you don’t know much about them.
character extraction • The first thing to do is to extract the characters out of these documents. • That means translating the file content, a series of 0s and 1s into a sequence of characters. • For some types the textual information is limited.
image file, music files • Even if you have image files and music files, there usually is some textual data that is store. • jpeg files have exif metadata. • mp3 file have id3 metadata. • These data has simple attribute: value data.
example for id3 version, start • header: "TAG" • title: 30 bytes of the title • artist: 30 bytes of the artist name • album: 30 bytes of the album name • year: 4 A four-digit year • comment: 30 bytes of comment • … more fields
mixed contents files • If you look at MS word, spreadsheet or PDF file, they contain text surrounded by not textual code that the software reading it can understand. • Such code contains descriptions of document structure, as well as formatting instructions.
markup? • Markup is the non-textual contents of a text documents. • Markup is used to provide structure to a document and to prepare for its formatting. • An easy example for markup is any page written in HTML
example from a very fine page <li><a tabindex="400" class="int" href="travel_schedule.html" accesskey="t">travel schedule</a></li> <li><a tabindex="500" class="int" accesskey="l" href="links.html">favorite links</a>.</li> </ul> <div class="hidden"> This text is hidden. </div>
text extracted • The extracted text is “travel schedule favorite links. This text is hidden.” • The phrase “This text is hidden” is not normally shown on the page but it is still extracted and indexed by Google. • Well, I think it is.
tokenization • Tokenization refers to the splitting of the source to be indexed into tokens that can be used in a query language. • In the most common case, we think about splitting the text into words. • Hey that is easy: “travel” “schedule” “favorite” “links” “this” “text” “is” “hidden” • Spot a problem?
text files • Some files can either contain text, meant for the end user. • Some others contain text, plus some textual instruction that are means for processing software. Such instructions are called markup. • Typical examples for markup are web pages.
key aid: index terms • The index term is a part of the document that has a meaning on its own. • It is usually a word. Sometimes it is a phrase. • Let us stick with the idea of the indexing term being a word.
the representation of text • In principle, every file is just a sequence of on/off electrical signals. • How can these signals translated into text? • Well text is a sequence of characters and every character needs to be recognized. • Character recognition leads to text recognition.
bits and bytes • A bit is a unit of digital information. It can take the values 0 and 1. • A byte (respelling of “bite” to avoid confusion) is a unit of storage. It should be hardware dependent but it is de facto 8 bits.
characters and computer • Computers can not deal with characters directly. They can only deal with numbers. • There we need to associate a number with every character that we want to use in an information encoding system. • A character set combines characters with number.
characters • When we are working with characters, two concepts appear. • The character set. • The character encoding. • To properly recognize digital data, we need to know both.
The character set • The character set is an association of characters with numbers. Example • ‘ ‘ 32 • bell 7 • ‘’ 8594 • Without such a concordance the processing of text by computer would be impossible.
the character encoding • This regards the way the character number is written out as a sequence of bytes. • There are several ways to do this. Going into details here would be a bit too technical. • Let us look at some examples.
ascii • The most famous, but old character set is the American Standard for Information Interchange ASCII. • It contains 128 characters. • Its representation in bytes has every byte start with a 0 and then the seven bits that the character number makes up.
iso-8859-1 (iso-latin-1) • This is an extension of ASCII that has additional characters that are suitable for the Western European languages. • There are 256 characters. • Every character occupies a byte.
Windows 1252 • This is an extension of iso-latin-1. • It is a legacy character set used on the Microsoft Windows operating system. • Some code positions of iso-latin-1 have been filled with characters that Microsoft liked.
UCS / Unicode • UCS is a universal character set. • It is maintained by the International Standards Organization. • Unicode is an industry standard for characters. It is better documented than UCS. • For what we discuss here, UCS and Unicode are the same.
Basic multilingual plane • This is a name for the first 65536 characters in Unicode. • Each of these characters fits into two bytes and is conveniently represented by four hex numbers. • Even for these characters, there are numerous complications associated with them.
ascii and unicode • The first 128 characters of UCS/Unicode are the same as the ones used by ASCII. • So you can think of UCS/Unicode as an extension of ASCII.
in foreign languages • Everything becomes difficult. • As an example consider the characters • o • ő • ö • The latter two can be considered o with diarcitics or as separate characters.
most problematic: encoding • One issue is how to map characters to numbers. • This is complicated for languages other than English. • But assume UCS/Unicode has solved this. • But this is not the main problem that we have when working with non-ASCII character data.
encoding • The encoding determines how the numbers of each character should be put into bytes. • If you have a character set that is has one byte for each character, you have no encoding issue. • But then you are limited to 256 characters in your character set.
fixed-length encoding • If you have a fixed length encoding, all characters take the same number of bytes. • Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that. • If you are writing only ASCII, it appears a waste.
variable length encoding • The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8. • It is important to understand that the encoding needs to known and correct.
bascis of UTF-8 • Every ASCII character, represented as a byte, starts with a zero. • Characters that are beyond ASCII require two or three bytes to be complete. • The first byte will tell you how many bytes are coming to make the character complete.
byte shapes in UTF-8 encoding • 0?????? ASCII • 110???? first octet of two-byte character • 1110???? first byte of three-byte character • 11110??? first octet of four-byte character • 10??????? byte that is not the first byte • as you can see, there are sequences of bytes that are not valid
hex range to UTF-8 • 0000 to 007F 0??????? • 0080 to 07FF 110????? 10?????? • 0800 to FFFF 1110???? 10?????? 10??????
in the simplest case • In the simplest case, let us assume that we index a bunch of text files as full text. • These documents contain text written in some sort of a human language. • They may also contain some markup, but we assume we can strip the markup.
lexical analysis • Lexical analysis is the process of preparing a sequence or characters into a sequence of words. • These candidate words can then be used as index term (or what I have called features).
text direction • Text is not necessarily linear. • When right to left writing is used, you sometimes have to jump in the opposite direction when a left to right expression appears. Example: • حققت الجزائر استقلالها عام 1962 بعد 132 سنة من الاحتلال الفرنسي.
spaces • One important delimiter between words are spaces. • The blank, the newline, the carriage return and the tab characters are collectively known as whitespace • We can collapse whitespace and get the words. • But there are special cases.
text without spaces: Japanese • Japanese uses 3 scripts, kanji, hiragana and katakana. • All foreign words are written with katakana. Any conitunous katakana sequence is a word. • A Japanese word has at most 4 kanji/hiragana. A dictionary has to find backward from 4 to 1 if a word exists.
punctuation • Normally lexical analysis discards punctuation. But some punctuation can be part of a word, say “333B.C.” • What we have to understand here is that probably at query time, the same removal of punctuation occurs. • The user searches for “333B.C”, the query is translated to 333BC and the occurrence is found.
hyphen • It’s a particularly problematic punctuation mark. • Breaking them, like, for example, in “state-of-the-art” is good. • End of line hyphenation breaks at a line end may not be ends of words. • Some hyphenation is part of a word, e.g. UTF-8. I should use the non-breaking hyphen.
apostrophe • Example: • Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. • O Neil … aren t • The first seem ok, but second looks silly.
stop words • Some words are considered to carry no meaning because they are too frequent. • Sometimes, they are completely removed, and not removed in the index. • Then you can’t search for the phrase “to be or not to be”. • Stop word are very language dependent.
stemming • Frequently, text contains grammatical forms that a user may not look at. In English, e.g. • plurals • gerund form • past tense suffixes • The most famous algorithm for English stemming is Martin Porter’s stemming algorithm. • Evidence on usefulness is mixed.
grammatical variants • These are much more important in other languages. E.g., in Russian • “My name is Irina.” “Меня зовут Ирина.” • “I love Irina.” “Я люблю Ирину.” • Text in a query in Russian must be matched against a variety of grammatical forms, so of which are very tricky to derive.
casing • Normalization usually means that we treat uppercase and lowercase the same, meaning that the uppercase and lowercase variants of a token become the same term. • However in English • “US” vs “us” • “IT” vs “it”…
treatment of diacritics • There are rules in Unicode how diacritics can be stripped. • These involve decomposing diacritic • This is helpful for people who find the input of diacritics tough. • But there are lots of other tough choices.