1 / 56

A Neural Network Approach to Topic Spotting

A Neural Network Approach to Topic Spotting. Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005. Article Information. Published in Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval 1995 Authors Wiener E., Pedersen, J.O.

cecily
Download Presentation

A Neural Network Approach to Topic Spotting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005

  2. Article Information • Published in • Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval 1995 • Authors • Wiener E., • Pedersen, J.O. • Weigend, A.S. • 54 citations

  3. Summary • Introduction • Related Work • The Corpus • Representation • Term Selection • Latent Semantic Indexing • Generic LSI • Local LSI • Cluster-Directed LSI • Topic-Directed LSI • Relevancy Weighting LSI

  4. Summary • Neural Network Classifier • Neural Networks for Topic Spotting • Linear vs. Non Linear Networks • Flat Architecture vs. Modular Architecture • Experiment Results • Evaluating Performance • Results & discussions

  5. Introduction • Topic Spotting = Text Categorization = Text Classification • Problem of identifying which of a set of predefined topics are present in a natural language document. Topic 1 Document Topic 2 Topic n

  6. Introduction • Classification Approaches • Expert system approach: • manually construct a system of inference rules on top of large body of linguistic and domain knowledge • could be extremely accurate • very time consuming • brittle to changes in the data environment • Data driven approach: • induce a set of rules from a corpus of labeled training documents • practically better

  7. Introduction – Related Work • The major remarks regarding the related work: • Separate classifier was constructed for each topic. • Different set of terms was used to train each classifier.

  8. Introduction – The Corpus • Reuters 22173 corpus of Reuters newswire stories from 1987 • 21,450 stories • 9,610 for training • 3,662 for testing • mean length: 90.6 words, SD 91.6 • 92 topics appeared at least once in the training set. • The mean is 1.24 topics/doc. (up to 14 topics for some doc.) • 11,161 unique terms after preprocessing • inflectional stemming, • stop word removal, • conversion to lower case • elimination of words appeared in fewer three documents

  9. Representations • starting point: • Document Profile: term by document matrix containing word frequency entries

  10. 3/ 33 1/ 33 1/ 33 2/ 33 Representation Thorsten Joachims. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. http://citeseer.ist.psu.edu/joachims97text.html

  11. Representation - Term Selection • the subset of the original terms that are most useful for the classification task. • Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network • Divide problem into 92 independent classification tasks • Search for best discriminator terms between documents with the topic and those without

  12. Representation - Term Selection No. of doc. w/ topic t & contain term k • Relevancy Score • measures how unbalanced the term is across documents w/ or w/o the topic • Highly +ve and highly -ve scores indicate useful terms for discrimination • using about 20 terms yielded the best classification performance Total No. of doc. w/ topic t

  13. Representation - Term Selection

  14. Representation - Term Selection • Advantage: • little computation is required • resulting features have direct interpretability • Drawback: • many of best individual predictors contain redundant information • a term which may appear to be a very poor predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse. • Apple vs. Apple Computers • Selected Term Representation (TERMS) with 20 features TERMS

  15. Representation – LSI • Transform original doc to lower-dimensional space by analyzing correlational structure of terms in the document collection • (Training Set): applying a singular-value decomposition (SVD) to the original term by document matrix  Get U, , V • (test set): Transform document vectors by projecting them into LSI space • Property of LSI: higher dimensions capture less of variance of original data  drop w/ minimal loss. • Found: performance continues to improve up to at least 250 dimensions • Improvement rapidly slows dawn after about 100 dimensions • Generic LSI Representation (LSI) with 200 features LSI

  16. Representation – LSI Reuters Corpus Generic LSI Representation w/ 200 features Wool Wheat Money-supply SVD Gold Barley Zinc

  17. Representation – Local LSI • Global LSI performs worse as frequency decreases • infrequent topics are usually indicated by infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise. • Proposed two task-directed methods that make use of prior knowledge of the classification task

  18. Representation – Local LSI • What is Local LSI? • modeling only the local portion of the corpus related to those topics • includes documents that use terminology related to the topics (not necessary have any of the topics assigned) • Performing SVD over only the local set of documents • representation more sensitive to small, localized effects of infrequent terms. • representation more effective for classification of topics related to that local structure.

  19. Representation – Local LSI • Type of Local LSI: • Cluster Directed representation • 5 Meta-topics (clusters): • Agriculture, Energy, Foreign exchange, Government, and metals • How to construct local region? • Break corpus into 5 clusters  each containing all documents on corresponding meta-topic • Perform SVD for each Meta-topic region • Clustor-Directed LSI Representation (CD/LSI) with 200 features CD/LSI

  20. Representation – Local LSI Reuters Corpus Wool Wheat Money-supply SVD Gold Barley Zinc

  21. Representation – Local LSI Reuters Corpus Government Clustor-Directed LSI Representation (CD/LSI) w/ 200 features SVD G O V E R N M E N T A G R I C U L T U R E F o r e i g n E x c h a n g e M E T A L E N E R G Y Agriculture Wool Wheat Barley SVD Foreign Exchange Money-supply SVD Metal Gold Zinc SVD Energy SVD

  22. Representation – Local LSI • Types of Local LSI: • Term Directed representation • More fine-grained approach to local LSI • Separate representation for each topic. • How to construct the local region? • Use 100 most predictive terms for the topic. • Pick N most similar documents.N = 5 * No. of documents containing topic, 350  N  110 • Final Documents in topic region = N documents + 150 random documents • Topic-Directed LSI Representation (TD/LSI) with 200 features

  23. Representation – Local LSI Reuters Corpus Wool Wheat Money-supply SVD Gold Barley Zinc

  24. Representation – Local LSI Reuters Corpus Term-Directed LSI Representation (TD/LSI) w/ 200 features Wool SVD Wheat SVD Money-supply SVD Zinc SVD Barley SVD Gold SVD

  25. Representation – Local LSI • Drawback of Local LSI: • Narrower the region, the Lower flexibility in representations for modeling the classification of multiple topics • High computational overhead

  26. Representation - Relevancy Weighting LSI • Use term weight to emphasize the importance of particular terms before applying SVD • IDF weighting •  importance of low frequency terms •  the importance of high frequency terms • Assumes low frequency terms to be better discriminators than high frequency terms

  27. Representation - Relevancy Weighting LSI • Relevancy Weighting • tune the IDF assumption • emphasize terms in proportion to their estimated topic discrimination power • Global Relevancy Weighting of term k (GRWk) • Final Weighting of term k = IDF2 * GRWk •  all low frequency terms pulled up by IDF •  Poor predictors pushed down •  leaving only relevant low frequency terms with high weights • Relevancy Weighted LSI Representation (REL/LSI) with 200 features

  28. Neural Network Classifier (NN) • NN consists of: • processing units (Neurons) • weighted links connecting neurons

  29. Neural Network Classifier (NN) • major components of NN model: • architecture: defines the functional form relating input to output • network topology • unit connectivity • activation functions: e.g. Logistic regression fn.

  30. Neural Network Classifier (NN) • Logistic regression function • z = • is a linear combination of the input features • p  (0,1) - can be converted to binary classification method by thresholding the output probability

  31. Neural Network Classifier (NN) • major components of NN model (cont): • search algorithm: the search in weight space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS) • Backpropagation method • Mean squared errors • Cross-entropy error performancefunction C = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output

  32. NN for Topic Spotting • Network outputs are estimates of the probability of topic presence given the feature vector of a document • Generic LSI representation each network uses same representation • Local LSI representation different representation for each network

  33. n 1 2 NN for Topic Spotting • Linear NN • Output units with logistic activation and no hidden layer

  34. NN for Topic Spotting • Non Linear NN • Simple networks with a single hidden layer of logistic sigmoid units (6 – 15)

  35. NN for Topic Spotting • Flat Architecture • Separate network for each topic • use entire training set to train for each topic • Avoiding overfittingproblem by • adding penalty term to the cross-entropy cost function to encourage eliminationof small weights. • Early stopping based on cross-validation

  36. NN for Topic Spotting • Modular Architecture • decompose learning problem into smaller problems • Meta-Topic Network trained on full training set • estimate the presence probability of the five topics in doc. • use 15 hidden units

  37. NN for Topic Spotting • Modular Architecture • five groups of local topic networks • consists of local topic networks for each topic in meta-topic • each network trained only on the meta-topic region

  38. NN for Topic Spotting • Modular Architecture • five groups of local topic networks (cont.) • Example: wheat network trained Agriculture meta-topic. • Focus on finer distinctions, e.g. wheat and grain • Don’t waste time on easier distinctions, e.g. wheat and gold. • Each local topic networks uses 6 hidden units.

  39. NN for Topic Spotting • Modular Architecture • To compute topic predictions for a given document • Present document to meta-topic network • Present document to each of the topic networks • Outputs of meta-topic network  estimate of topic networks = final topic estimates

  40. Experimental Results • Evaluating Performance • Mean squared error between actual and predicted values is inefficient • Compute precision and recall based on contingency table constructed over range of decision thresholds • How to get the decision Thresholds?

  41. Experimental Results • Evaluating Performance • How to get the decision Thresholds? • Proportional assignment

  42. Experimental Results • Evaluating Performance • How to get the decision Thresholds? • fixed recall level approach • determine set of recall levels • analyze ranked documents to determine what decision thresholds lead to the desired set of recall levels. Target Recall

  43. Experimental Results • Performance by Micoraveraging • add all contingency tables together across topics at a certain threshold • compute precision and recall • used proportional assignment for picking decision thresholds • does not weight the topics evenly • used for comparisons to previously reported results • Breakeven point is used as a summary value

  44. Experimental Results • Performance by Macoraveraging • compute precision and recall for each topic • take the average across topics • used fixed set of recall levels • summary values are obtained for particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95

  45. Experimental Results • Microaveraged performance • Breakpoints compared to best algorithm: rule induction method best on heuristic search with breakpoint (0.789) 0.82 0.801 0.795 0.775

  46. Experimental Results • Macroaveraged performance • TERMS appears much closer to other three. • Relative effectivenessof the representationsat low recall levels isreversed at high recalllevels

  47. Six techniques performance on54 most frequent topics • considerable variation of performance across topics • relative ups and downs are mirrored in both plots Slight improvement of nonlinear networks LSI performance degrades compared to TERMS when ft decreases

  48. Experimental Results • Performance of Combination of Techniques and Its Improvement Match color & shape to get an experiment

  49. Experimental Results • Flat Networks

  50. Experimental Results • Modular Networks • 4 clusters only used • Recomputed average precision for the flat networks

More Related