1 / 102

Modeling the Internet and the Web: Text Analysis

Modeling the Internet and the Web: Text Analysis. Outline. Indexing Lexical processing Content-based ranking Probabilistic retrieval Latent semantic analysis Text categorization Exploiting hyperlinks Document clustering Information extraction. Information Retrieval.

katarina
Download Presentation

Modeling the Internet and the Web: Text Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling the Internet and the Web:Text Analysis

  2. Outline • Indexing • Lexical processing • Content-based ranking • Probabilistic retrieval • Latent semantic analysis • Text categorization • Exploiting hyperlinks • Document clustering • Information extraction

  3. Information Retrieval • Analyzing the textual content of individual Web pages • given user’s query • determine a maximally related subset of documents • Retrieval • index a collection of documents (access efficiency) • rank documents by importance (accuracy) • Categorization (classification) • assign a document to one or more categories

  4. Indexing • Inverted index • effective for very large collections of documents • associates lexical items to their occurrences in the collection • Terms  • lexical items: words or expressions • Vocabulary V • the set of terms of interest

  5. Inverted Index • The simplest example • a dictionary • each key is a term   V • associated value b() points to a bucket (posting list) • a bucket is a list of pointers marking all occurrences of  in the text collection

  6. Inverted Index • Bucket entries: • document identifier (DID) • the ordinal number within the collection • separate entry for each occurrence of the term • DID • offset (in characters) of term’s occurrence within this document • present a user with a short context • enables vicinity queries

  7. Inverted Index

  8. Inverted Index Construction • Parse documents • Extract terms i • if i is not present • insert iin the inverted index • Insert the occurrence in the bucket

  9. Searching with Inverted Index • To find a term  in an indexed collection of documents • obtain b() from the inverted index • scan the bucket to obtain list of occurrences • To find k terms • get k lists of occurrences • combine lists by elementary set operations

  10. Inverted Index Implementation • Size = (|V|) • Implemented using a hash table • Buckets stored in memory • construction algorithm is trivial • Buckets stored on disk • impractical due to disk assess time • use specialized secondary memory algorithms

  11. Bucket Compression • Reduce memory for each pointer in the buckets: • for each term sort occurrences by DID • store as a list of gaps - the sequence of differences between successive DIDs • Advantage – significant memory saving • frequent terms produce many small gaps • small integers encoded by short variable-length codewords • Example: the sequence of DIDs: (14, 22, 38, 42, 66, 122, 131, 226 ) a sequence of gaps: (14, 8, 16, 4, 24, 56, 9, 95)

  12. Lexical Processing • Performed prior to indexing or converting documents to vector representations • Tokenization • extraction of terms from a document • Text conflation and vocabulary reduction • Stemming • reducing words to their root forms • Removing stop words • common words, such as articles, prepositions, non-informative adverbs • 20-30% index size reduction

  13. Tokenization • Extraction of terms from a document • stripping out • administrative metadata • structural or formatting elements • Example • removing HTML tags • removing punctuation and special characters • folding character case (e.g. all to lower case)

  14. Stemming • Want to reduce all morphological variants of a word to a single index term • e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document) • Stemming - reduce words to their root form • e.g. fish – becomes a new index term • Porter stemming algorithm (1980) • relies on a preconstructed suffix list with associated rules • e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE • BINARIZATION => BINARIZE

  15. Content Based Ranking • A boolean query • results in several matching documents • e.g., a user query in google: ‘Web AND graphs’, results in 4,040,000 matches • Problem • user can examine only a fraction of result • Content based ranking • arrange results in the order of relevance to user

  16. Choice of Weights What weights retrieve most relevant pages?

  17. Vector-space Model • Text documents are mapped to a high-dimensional vector space • Each document d • represented as a sequence of terms (t) d = ((1), (2), (3), …, (|d|)) • Unique terms in a set of documents • determine the dimension of a vector space

  18. Example • Boolean representation of vectors: • V = [ web, graph, net, page, complex ] • V1 = [1 1 0 0 0] • V2 = [1 1 1 0 0] • V3 = [1 0 0 1 1]

  19. Vector-space Model • 1, 2 and 3are terms in document, x and x are document vectors • Vector-space representations are sparse, |V| >> |d|

  20. Term frequency (TF) • A term that appears many times within a document is likely to be more important than a term that appears only once • nij - Number of occurrences of a term j in a document di • Term frequency

  21. Inverse document frequency (IDF) • A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents • nj - Number of documents which contain the term j • n - total number of documents in the set • Inverse document frequency

  22. Inverse document frequency (IDF)

  23. Full Weighting (TF-IDF) • The TF-IDF weight of a term j in document di is

  24. Document Similarity • Ranks documents by measuring the similarity between each document and the query • Similarity between two documents d and d is a function s(d, d) R • In a vector-space representation the cosine coefficient of two document vectors is a measure of similarity

  25. Cosine Coefficient • The cosine of the angle formed by two document vectors x and x is • Documents with many common terms will have vectors close to each other, than documents with fewer overlapping terms

  26. Retrieval and Evaluation • Compute document vectors for a set of documents D • Find the vector associated with the user query q • Using s(xi, q), I = 1, ..,n, assign a similarity score for each document • Retrieve top ranking documents R • Compare R with R* - documents actually relevant to the query

  27. Retrieval and Evaluation Measures • Precision () - Fraction of retrieved documents that are actually relevant • Recall () - Fraction of relevant documents that are retrieved

  28. Probabilistic Retrieval • Probabilistic Ranking Principle (PRP) (Robertson, 1977) • ranking of the documents in the order of decreasing probability of relevance to the user query • probabilities are estimated as accurately as possible on basis of available data • overall effectiveness of such as system will be the best obtainable

  29. Probabilistic Model • PRP can be stated by introducing a Boolean variable R (relevance) for a document d, for a given user query q as P(R | d,q) • Documents should be retrieved in order of decreasing probability • d - document that has not yet been retrieved

  30. Latent Semantic Analysis • Why need it? • serious problems for retrieval methods based on term matching • vector-space similarity approach works only if the terms of the query are explicitly present in the relevant documents • rich expressive power of natural language • often queries contain terms that express concepts related to text to be retrieved

  31. Synonymy and Polysemy • Synonymy • the same concept can be expressed using different sets of terms • e.g. bandit, brigand, thief • negatively affects recall • Polysemy • identical terms can be used in very different semantic contexts • e.g. bank • repository where important material is saved • the slope beside a body of water • negatively affects precision

  32. Latent Semantic Indexing(LSI) • A statistical technique • Uses linear algebra technique called singular value decomposition (SVD) • attempts to estimate the hidden structure • discovers the most important associative patterns between words and concepts • Data driven

  33. LSI and Text Documents • Let X denote a term-document matrix X = [x1 . . . xn]T • each row is the vector-space representation of a document • each column contains occurrences of a term in each document in the dataset • Latent semantic indexing • compute the SVD of X: •  - singular value matrix • set to zero all but largest K singular values - • obtain the reconstruction of X by:

  34. LSI Example • A collection of documents: d1: Indian government goes for open-sourcesoftware d2: Debian 3.0 Woody released d3: Wine 2.0 released with fixes for Gentoo 1.4 and Debian 3.0 d4: gnuPOD released: iPOD on Linux… with GPLed software d5: Gentoo servers running at open-source mySQL database d6: Dolly the sheep not totally identical clone d7: DNA news: introduced low-cost human genomeDNA chip d8: Malaria-parasite genomedatabase on the Web d9: UK sets up genome bank to protect rare sheep breeds d10: Dolly’sDNA damaged

  35. LSI Example • The term-document matrix XT d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 1 0 0 0 1 0 0 0 0 0 software 1 0 0 1 0 0 0 0 0 0 Linux 0 0 0 1 0 0 0 0 0 0 released 0 1 1 1 0 0 0 0 0 0 Debian 0 1 1 0 0 0 0 0 0 0 Gentoo 0 0 1 0 1 0 0 0 0 0 database 0 0 0 0 1 0 0 1 0 0 Dolly 0 0 0 0 0 1 0 0 0 1 sheep 0 0 0 0 0 1 0 0 0 0 genome 0 0 0 0 0 0 1 1 1 0 DNA 0 0 0 0 0 0 2 0 0 1

  36. LSI Example • The reconstructed term-document matrix after projecting on a subspace of dimension K=2 •  = diag(2.57, 2.49, 1.99, 1.9, 1.68, 1.53, 0.94, 0.66, 0.36, 0.10) d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 0.34 0.28 0.38 0.42 0.24 0.00 0.04 0.07 0.02 0.01 software 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 Linux 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 released 0.63 0.53 0.72 0.79 0.45 -0.01 -0.05 0.09 -0.00 -0.04 Debian 0.39 0.33 0.44 0.48 0.28 -0.01 -0.03 0.06 0.00 -0.02 Gentoo 0.36 0.30 0.41 0.45 0.26 0.00 0.03 0.07 0.02 0.01 database 0.17 0.14 0.19 0.21 0.14 0.04 0.25 0.11 0.09 0.12 Dolly -0.01 -0.01 -0.01 -0.02 0.03 0.08 0.45 0.13 0.14 0.21 sheep -0.00 -0.00 -0.00 -0.01 0.03 0.06 0.34 0.10 0.11 0.16 genome 0.02 0.01 0.02 0.01 0.10 0.19 1.11 0.34 0.36 0.53 DNA -0.03 -0.04 -0.04 -0.06 0.11 0.30 1.70 0.51 0.55 0.81

  37. Probabilistic LSA • Aspect model (aggregate Markov model) • let an event be the occurrence of a term  in a document d • let z{z1, … , zK} be a latent (hidden) variable associated with each event • the probability of each event (, d) is • select a document from a density P(d) • select a latent concept z with probability P(z|d) • choose a term , sampling from P(|z)

  38. Aspect Model Interpretation • In a probabilistic latent semantic space • each document is a vector • uniquely determined by the mixing coordinates P(zk|d), k=1,…,K • i.e., rather than being represented through terms, a document is represented through latent variables that in tern are responsible for generating terms.

  39. Analogy with LSI • all n x m document-term joint probabilities • uik = P(di|zk) • vjk = P(j|zk) • kk = P(zk) • P is properly normalized probability distribution • entries are nonnegative

  40. Fitting the Parameters • Parameters estimated by maximum likelihood using EM • E step • M step

  41. Text Categorization • Grouping textual documents into different fixed classes • Examples • predict a topic of a Web page • decide whether a Web page is relevant with respect to the interests of a given user • Machine learning techniques • k nearest neighbors (k-NN) • Naïve Bayes • support vector machines

  42. k Nearest Neighbors • Memory based • learns by memorizing all the training instances • Prediction of x’s class • measure distances between x and all training instances • return a set N(x,D,k) of the k points closest to x • predict a class for x by majority voting • Performs well in many domains • asymptotic error rate of the 1-NN classifier is always less than twice the optimal Bayes error

  43. Naïve Bayes • Estimates the conditional probability of the class given the document •  - parameters of the model • P(d) – normalization factor (cP(c|d)=1) • classes are assumed to be mutually exclusive • Assumption: the terms in a document are conditionally independent given the class • false, but often adequate • gives reasonable approximation • interested in discrimination among classes

  44. Bernoulli Model • An event – a document as a whole • a bag of words • words are attributes of the event • vocabulary term  is a Bernoully attribute • 1, if  is in the document • 0, otherwise • binary attributes are mutually independent given the class • the class is the only cause of appearance of each word in a document

  45. Bernoulli Model • Generating a document • tossing |V| independent coins • the occurrence of each word in a document is a Bernoulli event • xj= 1[0] - jdoes [does not] occur in d • P(j|c) – probability of observing jin documents of class c

  46. Multinomial Model • Document – a sequence of events W1,…,W|d| • Take into account • number of occurrences of each word • length of the document • serial order among words • significant (model with a Markov chain) • assume word occurrences independent – bag-of-words representation

  47. Multinomial Model • Generating a document • throwing a die with |V| faces |d| times • occurrence of each word is multinomial event • nj is the number of occurrences of j in d • P(j|c) – probability that joccurs at any position t  [ 1,…,|d| ] • G – normalization constant

  48. Learning Naïve Bayes • Estimate parameters  from the available data • Training data set is a collection of labeled documents { (di, ci), i = 1,…,n }

  49. Learning Bernoulli Model • c,j = P(j|c), j = 1,…,|V|, c = 1,…,K • estimated as • Nc = |{ i : ci =c }| • xij = 1 if j occurs in di • class prior probabilities c = P(c) • estimated as

  50. Learning Multinomial Model • Generative parameters c,j = P(j|c) • must satisfy jc,j = 1 for each class c • Distributions of terms given the class • qjand  are hyperparameters of Dirichlet prior • nij is the number of occurrences of j in di • Unconditional class probabilities

More Related