1 / 52

LIS6 18 lecture 2 preparing and preprocessing

LIS6 18 lecture 2 preparing and preprocessing. Thomas Krichel 2011-04-21. today. text files and non-text files characters lexical analysis. preliminaries. Here we discuss some preparatory stages that has the way the IR system is built. This may sound unduly technical.

talbot
Download Presentation

LIS6 18 lecture 2 preparing and preprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIS618 lecture2preparing and preprocessing Thomas Krichel 2011-04-21

  2. today • text files and non-text files • characters • lexical analysis

  3. preliminaries • Here we discuss some preparatory stages that has the way the IR system is built. • This may sound unduly technical. • But the reality is that often, when querying IR systems • we don’t know how they have been built • we can’t find something that is not there • we have to make guesses and try several things

  4. key aid: index • An index is a list of features, with a list of locations where the feature is to be found. • The feature may be a word • The locations depend on where we need to find the feature at.

  5. example: book • In a book the index links terms and locations are page numbers. • Index terms are selected manually by the author or a specially trained indexer. • Once indexing terms are chosen the index can be built by computer. • Note that any index terms may be on several page.

  6. example: computer files • Here the feature can be a word. • An a location can be the location of the file with the location of the start term within the file. • The location of the term can be found by looking at the start and add the length of the term.

  7. working with files • Usually, when indexing takes place, we study the contents of documents that are stored in files. • Usually you know something about the files • by the name of the directory • by commonly used file name part, e.g. extension • JHOVE is a tool that tries to guess file types when you don’t know much about them.

  8. character extraction • The first thing to do is to extract the characters out of these documents. • That means translating the file content, a series of 0s and 1s into a sequence of characters. • For some types the textual information is limited.

  9. image file, music files • Even if you have image files and music files, there usually is some textual data that is store. • jpeg files have exif metadata. • mp3 file have id3 metadata. • These data has simple attribute: value data.

  10. example for id3 version, start • header: "TAG" • title: 30 bytes of the title • artist: 30 bytes of the artist name • album: 30 bytes of the album name • year: 4 A four-digit year • comment: 30 bytes of comment • … more fields

  11. mixed contents files • If you look at MS word, spreadsheet or PDF file, they contain text surrounded by not textual code that the software reading it can understand. • Such code contains descriptions of document structure, as well as formatting instructions.

  12. markup? • Markup is the non-textual contents of a text documents. • Markup is used to provide structure to a document and to prepare for its formatting. • An easy example for markup is any page written in HTML

  13. example from a very fine page <li><a tabindex="400" class="int" href="travel_schedule.html" accesskey="t">travel schedule</a></li> <li><a tabindex="500" class="int" accesskey="l" href="links.html">favorite links</a>.</li> </ul> <div class="hidden"> This text is hidden. </div>

  14. text extracted • The extracted text is “travel schedule favorite links. This text is hidden.” • The phrase “This text is hidden” is not normally shown on the page but it is still extracted and indexed by Google. • Well, I think it is.

  15. tokenization • Tokenization refers to the splitting of the source to be indexed into tokens that can be used in a query language. • In the most common case, we think about splitting the text into words. • Hey that is easy: “travel” “schedule” “favorite” “links” “this” “text” “is” “hidden” • Spot a problem?

  16. text files • Some files can either contain text, meant for the end user. • Some others contain text, plus some textual instruction that are means for processing software. Such instructions are called markup. • Typical examples for markup are web pages.

  17. key aid: index terms • The index term is a part of the document that has a meaning on its own. • It is usually a word. Sometimes it is a phrase. • Let us stick with the idea of the indexing term being a word.

  18. the representation of text • In principle, every file is just a sequence of on/off electrical signals. • How can these signals translated into text? • Well text is a sequence of characters and every character needs to be recognized. • Character recognition leads to text recognition.

  19. bits and bytes • A bit is a unit of digital information. It can take the values 0 and 1. • A byte (respelling of “bite” to avoid confusion) is a unit of storage. It should be hardware dependent but it is de facto 8 bits.

  20. characters and computer • Computers can not deal with characters directly. They can only deal with numbers. • There we need to associate a number with every character that we want to use in an information encoding system. • A character set combines characters with number.

  21. characters • When we are working with characters, two concepts appear. • The character set. • The character encoding. • To properly recognize digital data, we need to know both.

  22. The character set • The character set is an association of characters with numbers. Example • ‘ ‘  32 • bell  7 • ‘’  8594 • Without such a concordance the processing of text by computer would be impossible.

  23. the character encoding • This regards the way the character number is written out as a sequence of bytes. • There are several ways to do this. Going into details here would be a bit too technical. • Let us look at some examples.

  24. ascii • The most famous, but old character set is the American Standard for Information Interchange ASCII. • It contains 128 characters. • Its representation in bytes has every byte start with a 0 and then the seven bits that the character number makes up.

  25. iso-8859-1 (iso-latin-1) • This is an extension of ASCII that has additional characters that are suitable for the Western European languages. • There are 256 characters. • Every character occupies a byte.

  26. Windows 1252 • This is an extension of iso-latin-1. • It is a legacy character set used on the Microsoft Windows operating system. • Some code positions of iso-latin-1 have been filled with characters that Microsoft liked.

  27. UCS / Unicode • UCS is a universal character set. • It is maintained by the International Standards Organization. • Unicode is an industry standard for characters. It is better documented than UCS. • For what we discuss here, UCS and Unicode are the same.

  28. Basic multilingual plane • This is a name for the first 65536 characters in Unicode. • Each of these characters fits into two bytes and is conveniently represented by four hex numbers. • Even for these characters, there are numerous complications associated with them.

  29. ascii and unicode • The first 128 characters of UCS/Unicode are the same as the ones used by ASCII. • So you can think of UCS/Unicode as an extension of ASCII.

  30. in foreign languages • Everything becomes difficult. • As an example consider the characters • o • ő • ö • The latter two can be considered o with diarcitics or as separate characters.

  31. most problematic: encoding • One issue is how to map characters to numbers. • This is complicated for languages other than English. • But assume UCS/Unicode has solved this. • But this is not the main problem that we have when working with non-ASCII character data.

  32. encoding • The encoding determines how the numbers of each character should be put into bytes. • If you have a character set that is has one byte for each character, you have no encoding issue. • But then you are limited to 256 characters in your character set.

  33. fixed-length encoding • If you have a fixed length encoding, all characters take the same number of bytes. • Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that. • If you are writing only ASCII, it appears a waste.

  34. variable length encoding • The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8. • It is important to understand that the encoding needs to known and correct.

  35. bascis of UTF-8 • Every ASCII character, represented as a byte, starts with a zero. • Characters that are beyond ASCII require two or three bytes to be complete. • The first byte will tell you how many bytes are coming to make the character complete.

  36. byte shapes in UTF-8 encoding • 0?????? ASCII • 110???? first octet of two-byte character • 1110???? first byte of three-byte character • 11110??? first octet of four-byte character • 10??????? byte that is not the first byte • as you can see, there are sequences of bytes that are not valid

  37. hex range to UTF-8 • 0000 to 007F 0??????? • 0080 to 07FF 110????? 10?????? • 0800 to FFFF 1110???? 10?????? 10??????

  38. in the simplest case • In the simplest case, let us assume that we index a bunch of text files as full text. • These documents contain text written in some sort of a human language. • They may also contain some markup, but we assume we can strip the markup.

  39. lexical analysis • Lexical analysis is the process of preparing a sequence or characters into a sequence of words. • These candidate words can then be used as index term (or what I have called features).

  40. text direction • Text is not necessarily linear. • When right to left writing is used, you sometimes have to jump in the opposite direction when a left to right expression appears. Example: • حققت الجزائر استقلالها عام 1962 بعد 132 سنة من الاحتلال الفرنسي.

  41. spaces • One important delimiter between words are spaces. • The blank, the newline, the carriage return and the tab characters are collectively known as whitespace • We can collapse whitespace and get the words. • But there are special cases.

  42. text without spaces: Japanese • Japanese uses 3 scripts, kanji, hiragana and katakana. • All foreign words are written with katakana. Any conitunous katakana sequence is a word. • A Japanese word has at most 4 kanji/hiragana. A dictionary has to find backward from 4 to 1 if a word exists.

  43. punctuation • Normally lexical analysis discards punctuation. But some punctuation can be part of a word, say “333B.C.” • What we have to understand here is that probably at query time, the same removal of punctuation occurs. • The user searches for “333B.C”, the query is translated to 333BC and the occurrence is found.

  44. hyphen • It’s a particularly problematic punctuation mark. • Breaking them, like, for example, in “state-of-the-art” is good. • End of line hyphenation breaks at a line end may not be ends of words. • Some hyphenation is part of a word, e.g. UTF-8. I should use the non-breaking hyphen.

  45. apostrophe • Example: • Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. • O Neil … aren t • The first seem ok, but second looks silly.

  46. stop words • Some words are considered to carry no meaning because they are too frequent. • Sometimes, they are completely removed, and not removed in the index. • Then you can’t search for the phrase “to be or not to be”. • Stop word are very language dependent.

  47. stemming • Frequently, text contains grammatical forms that a user may not look at. In English, e.g. • plurals • gerund form • past tense suffixes • The most famous algorithm for English stemming is Martin Porter’s stemming algorithm. • Evidence on usefulness is mixed.

  48. grammatical variants • These are much more important in other languages. E.g., in Russian • “My name is Irina.”  “Меня зовут Ирина.” • “I love Irina.”  “Я люблю Ирину.” • Text in a query in Russian must be matched against a variety of grammatical forms, so of which are very tricky to derive.

  49. casing • Normalization usually means that we treat uppercase and lowercase the same, meaning that the uppercase and lowercase variants of a token become the same term. • However in English • “US” vs “us” • “IT” vs “it”…

  50. treatment of diacritics • There are rules in Unicode how diacritics can be stripped. • These involve decomposing diacritic • This is helpful for people who find the input of diacritics tough. • But there are lots of other tough choices.

More Related