Автоматическая обработка текста

Автоматическая обработка текста Представление текстового массива

Способы и форматы представления Индекс Базы данных

Полнотекстовый поиск В этой функции языка C текст строки big просматривают слева направо и для каждой позиции x запускают последовательное сравнение с искомой подстрокой little. Для этого, двигая одновременно два указателя y и z, попарно сравнивают все символы. Если мы успешно дошли до конца искомой подстроки, значит она найдена. char* strstr(char *big, char *little) { char *x, *y, *z; for (x = big; *x; x++) { for (y = little, z = x; *y; ++y, ++z) { if (*y != *z) break; } if (!*y) return x; } return 0; }

Полнотекстовый поиск • Можно загрузить текст в Word искать там: Правка: найти • Что найдем? • Найти: «дом» • форму «дом» или часть слова, совпадающего с последовательностью букв «дом» - народом • Программа ищет ту подстроку, которую мы ей зададим (точное совпадение) • ??? Как найти дома, доме, домом и т.п.? • Можно использовать специальный язык «дом.*» • Что найдем? • Дома, доме и т.п. + домашний, домовой, домолоть …

Индекс. Полнотекстовый поиск Хотя прямой просмотр всех текстов – довольно медленное занятие, не следует думать, что алгоритмы прямого поиска не применяются в интернете. Норвежская поисковая система Fast (www.fastsearch.com) использовала чип, реализующий логику прямого поиска упрощенных регулярных выражений [fastpmc], и разместила 256 таких чипов на одной плате. Это позволяло Fast-у обслуживать довольно большое количество запросов в единицу времени. (И. Сегалович)

«Загадки» (“backtracking”) • Поиск в корпусах Лидса: • Как найти: «Пока!» • Поиск в COCA • Найти все формы глагола «tell» • Поиск в НКРЯ: • Как найти слова, начинающиеся на пере- и заканчивающиеся на –вываться • ПОЧЕМУ ТАК?

Что после токена? Как представлять аннотации? Как хранить аннотации? Как обеспечить навигацию по корпусу (аннотациям)

«Упаковка» корпуса. XML разметка <sent_text></sent_text> <tree> <token> <lexemlex_text="Он" ID="1" father="3" link="от сказуемого к подлежащему, Гл. – местоим.-сущ." lemma="он" grval="Pron, PronounPersonal, Sg, Masc, Nom" ></lexem> </token> <token> <lexemlex_text="так" ID="2" father="3" link="примыкание, Гл. - наречие" lemma="так" grval="Adv, NotOAdverb" ></lexem> </token> <token> <lexemlex_text="любит" ID="3" father="-1" link="связь от корня" lemma="любить" grval="Verb, Finit, Praes, _3rd, Sg, Trans, Imperfect, GenC_No, DatC_No, AccC_AnyAnym, InstrC_NAnim, LocC_No, Unreflexive" ></lexem> </token> <token> <lexemlex_text="эту" ID="4" father="5" link="согласование, Сущ. - атрибут. ч. р." lemma="этот" grval="Pron, PronAdj, Sg, Fem, Acc" ></lexem> </token> <token> <lexemlex_text="квартиру" ID="5" father="3" link="управление, Гл. - сущ." lemma="квартира" grval="Noun, Nanim, Nverbal, Fem, Acc, Sg" ></lexem> </token> </tree> </S>

«Упаковка» корпуса. Разметка ------line1------ 1 Он synt_tag=<subj> gov_by=<3> antec=<> 2 так synt_tag=<spec> gov_by=<3> antec=<> 3 любит synt_tag=<pred> gov_by=<> antec=<> 4 эту synt_tag=<amod> gov_by=<5> antec=<> 5 квартиру. synt_tag=<obj> gov_by=<3> antec=<> ------line2------ 1 Судьба synt_tag=<subj> gov_by=<2> antec=<> 2 дала synt_tag=<pred> gov_by=<> antec=<> 3 мне synt_tag=<comp> gov_by=<2> antec=<> 4 эту synt_tag=<amod> gov_by=<5> antec=<> 5 возможность. synt_tag=<obj> gov_by=<2> antec=<>

Индекс. Инвертированный файл Эта простейшая структура данных. Знакома любому грамотному человеку, так и любому программисту баз данных, даже не имевшему дело с полнотекстовым поиском. Первая категория людей знает, что это такое, по «конкордансам» - алфавитно упорядоченным исчерпывающим спискам слов из одного текста или принадлежащих одному автору (например «Конкорданс к стихам А. С. Пушкина», «Словарь-конкорданс публицистики Ф. М. Достоевского»). Вторые имеют дело с той или иной формой инвертированного списка всякий раз, когда строят или используют «индекс БД по ключевому полю».

%% word tag morph edge parentsecedge comment #BOS 1 1 985275570 1 Mцgen VMFIN 3.Pl.Pres.Konj HD 508 Puristen NN Masc.Nom.Pl.* NK 505 aller PIDAT *.Gen.PlNK 500 Musikbereiche NN Masc.Gen.Pl.* NK 500 auch ADV -- MO 508 die ART Def.Fem.Akk.Sg NK 501 Nase NN Fem.Akk.Sg.* NK 501 rьmpfen VVINF -- HD 506 , $, -- -- 0 #500 NP -- GR 505 #501 NP -- OA 506 #EOS 1

Полнотекстовый поиск vs. ??? • Как устроена навигация по книгам? индекс

Индекс

Индекс. Немного об информационном поиске • Which plays of Shakespeare contain the words BRUTUS ANDCAESAR, but not CALPURNIA? • One could grep all of Shakespeare’s plays for BRUTUS and CAESAR, then strip out lines containing CALPURNIA • Why is grep not the solution? • Slow (for large collections) • grep is line-oriented, IR is document-oriented • “NOT CALPURNIA” is non-trivial • Other operations (e.g., find the word ROMANS near COUNTRYMAN) not feasible

Индекс. Немного об информационном поиске Entry is 1 if term occurs. Example: CALPURNIA occurs in JuliusCaesar. Entry is 0 if term doesn’toccur. Example: CALPURNIA doesn’t occur in The tempest.

Incidencevectors • So we have a 0/1 vector for each term. • To answer the query BRUTUS ANDCAESAR AND NOT CALPURNIA: • Take the vectors for BRUTUS, CAESARAND NOT CALPURNIA • Complement the vector of CALPURNIA • Do a (bitwise) and on the three vectors • 110100 AND 110111 AND 101111 = 100100 16

0/1 vectorforBRUTUS 17

Can’t build the incidence matrix • M = 500,000 × 106 = half a trillion 0s and 1s. • But the matrix has no more than one billion 1s. • Matrix isextremelysparse. • What is a better representations? • We only record the 1s. 18

Inverted Index • For each term t, we store a list of all documents that contain t. dictionary postings 19

Inverted Index • For each term t, we store a list of all documents that contain t. dictionary postings 20

Inverted index construction • Collect the documents to be indexed: • Tokenize the text, turning each document into a list of tokens: • Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: • Index the documents that each term occurs in by creating an • inverted index, consisting of a dictionary and postings.

Generate posting 22

Sort postings 23

Create postings lists, determine document frequency 24

Split the result into dictionary and postings file dictionary postings 25

Later in thiscourse • Index construction: how can we create inverted indexes for large collections? • How much space do we need for dictionary and index? • Index compression: how can we efficiently store and process indexesfor large collections? • Ranked retrieval: what does the inverted index look like when we want the “best” answer? 26

Outline • Introduction • Inverted index • Processing Boolean queries • Query optimization

Simple conjunctive query (two terms) • Consider the query: BRUTUS AND CALPURNIA • To find all matching documents using inverted index: • Locate BRUTUS in the dictionary • Retrieve its postings list from the postings file • Locate CALPURNIA in the dictionary • Retrieve its postings list from the postings file • Intersect the two postings lists • Return intersection to user 28

Intersectingtwopostinglists • This is linear in the length of the postings lists. • Note: This only works if postings lists are sorted. 29

Автоматическая обработка текста