560 likes | 689 Views
A Neural Network Approach to Topic Spotting. Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005. Article Information. Published in Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval 1995 Authors Wiener E., Pedersen, J.O.
E N D
A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining 4.14.2005
Article Information • Published in • Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval 1995 • Authors • Wiener E., • Pedersen, J.O. • Weigend, A.S. • 54 citations
Summary • Introduction • Related Work • The Corpus • Representation • Term Selection • Latent Semantic Indexing • Generic LSI • Local LSI • Cluster-Directed LSI • Topic-Directed LSI • Relevancy Weighting LSI
Summary • Neural Network Classifier • Neural Networks for Topic Spotting • Linear vs. Non Linear Networks • Flat Architecture vs. Modular Architecture • Experiment Results • Evaluating Performance • Results & discussions
Introduction • Topic Spotting = Text Categorization = Text Classification • Problem of identifying which of a set of predefined topics are present in a natural language document. Topic 1 Document Topic 2 Topic n
Introduction • Classification Approaches • Expert system approach: • manually construct a system of inference rules on top of large body of linguistic and domain knowledge • could be extremely accurate • very time consuming • brittle to changes in the data environment • Data driven approach: • induce a set of rules from a corpus of labeled training documents • practically better
Introduction – Related Work • The major remarks regarding the related work: • Separate classifier was constructed for each topic. • Different set of terms was used to train each classifier.
Introduction – The Corpus • Reuters 22173 corpus of Reuters newswire stories from 1987 • 21,450 stories • 9,610 for training • 3,662 for testing • mean length: 90.6 words, SD 91.6 • 92 topics appeared at least once in the training set. • The mean is 1.24 topics/doc. (up to 14 topics for some doc.) • 11,161 unique terms after preprocessing • inflectional stemming, • stop word removal, • conversion to lower case • elimination of words appeared in fewer three documents
Representations • starting point: • Document Profile: term by document matrix containing word frequency entries
3/ 33 1/ 33 1/ 33 2/ 33 Representation Thorsten Joachims. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. http://citeseer.ist.psu.edu/joachims97text.html
Representation - Term Selection • the subset of the original terms that are most useful for the classification task. • Difficult to select terms that discriminate between 92 classes while being small enough to serve as the feature set for a neural network • Divide problem into 92 independent classification tasks • Search for best discriminator terms between documents with the topic and those without
Representation - Term Selection No. of doc. w/ topic t & contain term k • Relevancy Score • measures how unbalanced the term is across documents w/ or w/o the topic • Highly +ve and highly -ve scores indicate useful terms for discrimination • using about 20 terms yielded the best classification performance Total No. of doc. w/ topic t
Representation - Term Selection • Advantage: • little computation is required • resulting features have direct interpretability • Drawback: • many of best individual predictors contain redundant information • a term which may appear to be a very poor predictor on its own may turn out to have great discriminative power in combination with other terms, and vise verse. • Apple vs. Apple Computers • Selected Term Representation (TERMS) with 20 features TERMS
Representation – LSI • Transform original doc to lower-dimensional space by analyzing correlational structure of terms in the document collection • (Training Set): applying a singular-value decomposition (SVD) to the original term by document matrix Get U, , V • (test set): Transform document vectors by projecting them into LSI space • Property of LSI: higher dimensions capture less of variance of original data drop w/ minimal loss. • Found: performance continues to improve up to at least 250 dimensions • Improvement rapidly slows dawn after about 100 dimensions • Generic LSI Representation (LSI) with 200 features LSI
Representation – LSI Reuters Corpus Generic LSI Representation w/ 200 features Wool Wheat Money-supply SVD Gold Barley Zinc
Representation – Local LSI • Global LSI performs worse as frequency decreases • infrequent topics are usually indicated by infrequent terms and infrequent terms may be projected out of LSI and considered as mere noise. • Proposed two task-directed methods that make use of prior knowledge of the classification task
Representation – Local LSI • What is Local LSI? • modeling only the local portion of the corpus related to those topics • includes documents that use terminology related to the topics (not necessary have any of the topics assigned) • Performing SVD over only the local set of documents • representation more sensitive to small, localized effects of infrequent terms. • representation more effective for classification of topics related to that local structure.
Representation – Local LSI • Type of Local LSI: • Cluster Directed representation • 5 Meta-topics (clusters): • Agriculture, Energy, Foreign exchange, Government, and metals • How to construct local region? • Break corpus into 5 clusters each containing all documents on corresponding meta-topic • Perform SVD for each Meta-topic region • Clustor-Directed LSI Representation (CD/LSI) with 200 features CD/LSI
Representation – Local LSI Reuters Corpus Wool Wheat Money-supply SVD Gold Barley Zinc
Representation – Local LSI Reuters Corpus Government Clustor-Directed LSI Representation (CD/LSI) w/ 200 features SVD G O V E R N M E N T A G R I C U L T U R E F o r e i g n E x c h a n g e M E T A L E N E R G Y Agriculture Wool Wheat Barley SVD Foreign Exchange Money-supply SVD Metal Gold Zinc SVD Energy SVD
Representation – Local LSI • Types of Local LSI: • Term Directed representation • More fine-grained approach to local LSI • Separate representation for each topic. • How to construct the local region? • Use 100 most predictive terms for the topic. • Pick N most similar documents.N = 5 * No. of documents containing topic, 350 N 110 • Final Documents in topic region = N documents + 150 random documents • Topic-Directed LSI Representation (TD/LSI) with 200 features
Representation – Local LSI Reuters Corpus Wool Wheat Money-supply SVD Gold Barley Zinc
Representation – Local LSI Reuters Corpus Term-Directed LSI Representation (TD/LSI) w/ 200 features Wool SVD Wheat SVD Money-supply SVD Zinc SVD Barley SVD Gold SVD
Representation – Local LSI • Drawback of Local LSI: • Narrower the region, the Lower flexibility in representations for modeling the classification of multiple topics • High computational overhead
Representation - Relevancy Weighting LSI • Use term weight to emphasize the importance of particular terms before applying SVD • IDF weighting • importance of low frequency terms • the importance of high frequency terms • Assumes low frequency terms to be better discriminators than high frequency terms
Representation - Relevancy Weighting LSI • Relevancy Weighting • tune the IDF assumption • emphasize terms in proportion to their estimated topic discrimination power • Global Relevancy Weighting of term k (GRWk) • Final Weighting of term k = IDF2 * GRWk • all low frequency terms pulled up by IDF • Poor predictors pushed down • leaving only relevant low frequency terms with high weights • Relevancy Weighted LSI Representation (REL/LSI) with 200 features
Neural Network Classifier (NN) • NN consists of: • processing units (Neurons) • weighted links connecting neurons
Neural Network Classifier (NN) • major components of NN model: • architecture: defines the functional form relating input to output • network topology • unit connectivity • activation functions: e.g. Logistic regression fn.
Neural Network Classifier (NN) • Logistic regression function • z = • is a linear combination of the input features • p (0,1) - can be converted to binary classification method by thresholding the output probability
Neural Network Classifier (NN) • major components of NN model (cont): • search algorithm: the search in weight space for a set of weights which minimizes the error between the output and the expected output (TRAINING PROCESS) • Backpropagation method • Mean squared errors • Cross-entropy error performancefunction C = - sum [all cases and outputs] (d*log(y) + (1-d)*log(1-y) ) d: desired output, y: actual output
NN for Topic Spotting • Network outputs are estimates of the probability of topic presence given the feature vector of a document • Generic LSI representation each network uses same representation • Local LSI representation different representation for each network
n 1 2 NN for Topic Spotting • Linear NN • Output units with logistic activation and no hidden layer
NN for Topic Spotting • Non Linear NN • Simple networks with a single hidden layer of logistic sigmoid units (6 – 15)
NN for Topic Spotting • Flat Architecture • Separate network for each topic • use entire training set to train for each topic • Avoiding overfittingproblem by • adding penalty term to the cross-entropy cost function to encourage eliminationof small weights. • Early stopping based on cross-validation
NN for Topic Spotting • Modular Architecture • decompose learning problem into smaller problems • Meta-Topic Network trained on full training set • estimate the presence probability of the five topics in doc. • use 15 hidden units
NN for Topic Spotting • Modular Architecture • five groups of local topic networks • consists of local topic networks for each topic in meta-topic • each network trained only on the meta-topic region
NN for Topic Spotting • Modular Architecture • five groups of local topic networks (cont.) • Example: wheat network trained Agriculture meta-topic. • Focus on finer distinctions, e.g. wheat and grain • Don’t waste time on easier distinctions, e.g. wheat and gold. • Each local topic networks uses 6 hidden units.
NN for Topic Spotting • Modular Architecture • To compute topic predictions for a given document • Present document to meta-topic network • Present document to each of the topic networks • Outputs of meta-topic network estimate of topic networks = final topic estimates
Experimental Results • Evaluating Performance • Mean squared error between actual and predicted values is inefficient • Compute precision and recall based on contingency table constructed over range of decision thresholds • How to get the decision Thresholds?
Experimental Results • Evaluating Performance • How to get the decision Thresholds? • Proportional assignment
Experimental Results • Evaluating Performance • How to get the decision Thresholds? • fixed recall level approach • determine set of recall levels • analyze ranked documents to determine what decision thresholds lead to the desired set of recall levels. Target Recall
Experimental Results • Performance by Micoraveraging • add all contingency tables together across topics at a certain threshold • compute precision and recall • used proportional assignment for picking decision thresholds • does not weight the topics evenly • used for comparisons to previously reported results • Breakeven point is used as a summary value
Experimental Results • Performance by Macoraveraging • compute precision and recall for each topic • take the average across topics • used fixed set of recall levels • summary values are obtained for particular topics by averaging precision over the 19 evenly spaced recall levels between 0.05 and 0.95
Experimental Results • Microaveraged performance • Breakpoints compared to best algorithm: rule induction method best on heuristic search with breakpoint (0.789) 0.82 0.801 0.795 0.775
Experimental Results • Macroaveraged performance • TERMS appears much closer to other three. • Relative effectivenessof the representationsat low recall levels isreversed at high recalllevels
Six techniques performance on54 most frequent topics • considerable variation of performance across topics • relative ups and downs are mirrored in both plots Slight improvement of nonlinear networks LSI performance degrades compared to TERMS when ft decreases
Experimental Results • Performance of Combination of Techniques and Its Improvement Match color & shape to get an experiment
Experimental Results • Flat Networks
Experimental Results • Modular Networks • 4 clusters only used • Recomputed average precision for the flat networks