Text Analysis: Methods for Searching, Organizing, Labeling and Summarizing Document Collections

Text Analysis:Methods for Searching, Organizing, Labeling and Summarizing Document Collections Danny Dunlavy Computer Science and Informatics Department (1415) Sandia National Laboratories July 16, 2008 CSRI Student Seminar Series SAND2008-4999P

Outline • Introduction • Motivational Problems • Data • Analysis Pipeline • Transformation, Analysis, and Post-processing • Hybrid Systems • Examples • Conclusions

Introduction • Knowledge discovery • Goal of text analysis • Data → information → knowledge • Challenges • Too much information to process manually • Data ambiguity • Word sense, multilingual, errors, weak signals • Heterogeneous data sources • Interpretability • Goals of this talk • Exposure to research in text analysis at Sandia • Focus on methods based on mathematical principles

Example 1: Information Retrieval Problem: ambiguous queries lead to information overload and topic confusion Solutions: optimization, linear algebra, machine learning, and probabilistic modeling Basketball player Mathematician Rank: 5, 50, 109, … Jazz Musician? Rank: > 200

Example 2: Spam Detection X-TMWD-Spam-Summary: TS=20080714175419; ID=1; SEV=2.3.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.03.0010; ENG=NA; RPDID=7374723D303030312E30413031303230332E34383742393243372E303043312C73733D312C6667733D30; CAT=NONE; CON=NONE; SIG=AAABAMQFAAAAAAAAAAAAAIAIgDkAAAM= X-TMWD-IP-Reputation: SIP=128.8.128.57; IPRID=303030312E30413039303330322E34383742393243362E30303330; CTCLS=T2; CAT=Unknown Date: Mon, 14 Jul 2008 13:54:13 -0400 From: Dianne O'Leary <oleary@cs.umd.edu> To: dmdunla@sandia.gov X-TMWD-Spam-Summary: TS=20080714175417; SEV=2.2.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.02.0125; ENG=IBF; RPDID=7374723D303030312E30413031303230332E34383742393243392E303045422C73733D312C6667733D30; CAT=NONE; CON=NONE X-MMS-Spam-Filter-ID: B2008071415_5.02.0125_4.0-9 X-PMX-Version: 5.4.2.344556, Antispam-Engine: 2.6.0.325393, Antispam-Data: 2008.7.14.174143 X-PerlMx-Spam: Gauge=IIIIIII, Probability=7%, Report='BODY_SIZE_1000_LESS 0, BODY_SIZE_300_399 0, BODY_SIZE_5000_LESS 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __SUBJ_MISSING 0' Problem: term meaning/usage ambiguity and deceit creates confusion between spam and good e-mail Solutions: optimization, linear algebra, machine learning, probabilistic modeling • Bayesian Statistics • SpamAssassin • Cloudmark Authority • MailSweeper Business Suite • Signature (Hash)Analysis • Cloudmark SpamNet • IM Message Inspector • Lexical Analysis • Brightmail Anti-Spam • Tumbleweed Email Firewall • Neural Networks • SurfControl E-mail Filter • AntiSpam for SMTP • Heuristic Patterns • McAfee SpamKiller • Brightmail Anti-Spam [S. Ali and Y. Xiang (2007), “Spam Classification Using Adaptive Boosting Algorithm," Proc. ICIS 2007.]

Example 3: Topic Detection and Association Problem: determine topics in text collections and identify the most important, novel, or significant relationships Clustering and visualization are key analysis methods Solutions: optimization, linear algebra, machine learning, and probabilistic modeling, visualization • http://www.kartoo.com • http://cloud.clusty.com

Text Data Semi-Structured Datae-mail, web pages, blogs, etc. Unstructured Datareports, newswire, etc. • Text collection(s) • Corpus (corpora) • Structured • Database fielded data • Semi-structured • XML, HTML • Unstructured • Formal • Newspaper articles, scientific articles, business reports, … • Informal • E-mail, chat, code comments, … • Other characteristics • Incomplete, noisy (errors, ambiguity), multilingual Data E-mail Headers message body attachments Metadata raw source index date collected source reliability etc. Data E-mail Headers to from date subject etc. Data text Metadata raw source index data collected source reliability etc. Unstructured Text Processing Metadata processing tool parameters used date processed Data named entities relationships facts events Analysis

Text Analysis Pipeline File readers (ASCII, UTF-8, XML, PDF, ...) Ingestion Tokenization, stemming, part-of-speech taggingnamed entity extraction, sentence boundaries Pre-processing Data model, dimensionality reduction, feature weighting, feature extraction/selection Transformation Information retrieval, clustering, summarization,classification, pattern recognition, statistics Analysis Visualization, filtering, summary statistics Post-processing Database, file, web site Archiving

Vector Space Model • Vector Space Model for Text • Terms (features): • Documents (objects): • Term  Document Matrix: • : measure of importance of term in document • Term Examples • Sentence: “Danny re-sent $1.” • Words: danny, sent, re [# chars?], $ [sym?], 1 [#?], re-sent [-?] • n-grams (3): dan, ann, nny, ny_, _re, re-, e-s, sen, ent, nt_, … • Named entities (people, orgs, money, etc.): danny, $1 • Document Examples • Documents, paragraphs, sentences, fixed-size chunks [G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Comm. ACM, 18(11), 613–620.]

Feature Weighting Term  Document Matrix Scaling:

Feature Extraction: Dimension Reduction • Goal: find new, smaller set of features (dimensions) that best captures variability, correlations, or structure in the data • Methods • Principal component analysis (PCA) • Eigenvalue decomposition of covariance matrix of • Pre-processing: mean of each feature is 0 • Singular value decomposition of • Local Linear Embedding (LLE) • Express points as combinations of neighbors and embed points into lower dimensional space (preserving neighbors) • Multidimensional scaling • Preserve pairwise distances in lower dimensional space • ISOMAP (nonlinear) • Extends MDS to use geodesic distances on a weighted graph

Analysis Tasks in This Talk • Information retrieval • Goal: find documents most related to a query • Challenges: pseudonyms, synonyms, stemming, errors • Methods: LSA (later), boolean search, probabilistic retrieval • Clustering • Goal: find a set of partitions that best separates groups of like objects • Challenges: distance metrics, number of clusters, uniqueness • Methods: k-means (later), agglomerative, graph-based • Summarization • Goal: find a compact representation of text with same meaning • Challenges: single- vs. multi-document summaries, subjectivity • Methods: HMM+QR (later), probabilistic • Classification • Goal: predict labels/categories of data instances (documents) • Challenges: data overfitting, • Methods: HEMLOCK (S. Gilpin, later), decision trees, naïve bayes, SVM

Other Analysis Tasks • Machine translation • Speech recognition • Cross language information retrieval • Word sense disambiguation • Determining sense of ambiguous words from context • Lexical acquisition • Filling in gaps in dictionaries build from text corpora • Concept drift detection • Change in general topics in streaming data • Association analysis • Discovering novel relationships hidden in text

Oftendeveloped independently Hybrid Systems • Rules + statistics/probabilities • Entity extraction (persons, organizations, locations) • Rules: list of common names, capitalization • Probabilities: chance name occurs given sequence of words • Any combination of data analytic tools Parser Data modeler Feature extractor Clustering tool

Hybrid System Development • Data model • Cross-system, cross-platform accessibility • Accommodation of multiple data structures • System • Modularized framework (plug-and-play capabilities) • Compatible interfaces • Multiple user interfaces • TITAN: customizable front-ends to analysis pipelines • YALE: required parameters vs. complete set of parameters • Performance, Verification & Validation • Tests for independent systems and overall system • Compatible test data and benchmarks • Analysis of parameter dependencies across individual systems

Hybrid System Example Query, Cluster, Summarize

Motivation • Query • methods plasma physics • Retrieval • General: Google, 7.8´106 of >2.5´1010 documents • Targeted: arXiv, 9,000 of >403,000 documents • Problems • Too much information • Redundant information • Results: link, title, abstract, snippet (?), etc. • Ordering of results (meaning of “best” match?)

Problems to Solve • QCS (Query, Cluster, Summarize) • Unstructured text parsing (common representation) • Data fusion (cleaning, assimilating, normalizing) • Natural language processing (sentences, POS) • Document retrieval (ranking) • High-dimensional clustering (data organization) • Automatic text summarization (data reduction) • Data representation/visualization (multiple perspectives)

singular values documents concepts terms concepts QueryLatent Semantic Analysis (LSA) documents … • SVD: • Truncated SVD: • Query scores (query as new “doc”): • LSA Ranking: d1 d2 d3 d4 dn t1 t2 . . . terms Truncated SVD tm [Deerwester, S. C., et al. (1990). Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41 (6), 391–407.]

A A q d1 d1 d2 d2 d3 d3 d4 d4 hurricane 1 hurricane hurricane 2 .89 1 .71 0 0 0 0 A2 d1 d2 d3 d4 earthquake 0 earthquake earthquake 0 0 0 0 1 1 2 .89 hurricane .78 .78 -.11 .11 catastrophe 0 catastrophe catastrophe 1 .45 1 .71 0 0 .45 1 earthquake -.03 .02 .96 .92 catastrophe .59 .60 .15 .30 qTA2 .78 .78 – .11 qTA .89 .71 0 0 LSA Example 1 d1: Hurricane.Ahurricane is acatastrophe. d2 : An example of acatastrophe is a hurricane. d3 : An earthquake is bad. d4: Earthquake.An earthquake is acatastrophe. d1: Hurricane. A hurricane is a catastrophe. d2 : An example of a catastrophe is a hurricane. d3 : An earthquake is bad. d4: Earthquake. An earthquake is a catastrophe. Remove stopwords normalization only rank-2 approximation captures link to doc 4

LSA Example 2 • policy • planning • politics • tomlinson • 1986 • Sport in Society: policy, Politics and Culture, ed A. Tomlinson (1990) • Policy and Politics in Sport, PE and Leisure eds S. Fleming, M. Talbot and A. Tomlinson (1995) • Policy and Planning (II), ed J. Wilkinson (1986) • Policy and Planning (I), ed J. Wilkinson (1986) • Leisure: Politics, Planning and People, ed A. Tomlinson (1985) • parker • lifestyles • 1989 • part • Work, Leisure and Lifestyles (Part 2), ed S. R. Parker (1989) • Work, Leisure and Lifestyles (Part 1), ed S. R. Parker (1989) [Leisure Studies of America Data: 97 documents, 335 terms]

ClusterGeneralized Spherical K-Means (gmeans) • The Players • Documents: • Partition/Disjoint Sets: • Concept vectors (centroids): • The Game • Maximize • The Rules • Adaptive, but bounded k • Similarity Estimation • First variation (stochastic perturbation) [Dhillon, I. S., et al. (2002). Iterative clustering of high dimensional text data augmented by local search. Proc. IEEE ICDM.]

n 1 n 2 n SummarizeHidden Markov Model + Pivoted QR • Single Document Summarization • Mark summary sentences in training documents • Build probabilistic model • Markov chain observations • log(#subject terms + 1) • terms showing up in titles, topics, subject descriptions, etc. • log(#topic terms + 1) • terms above a threshold using a mutual information statistic • Hidden Markov Model (HMM) • Hidden states: {summary, non-summary} • Score sentences in each document • Probabilities of sentence being a summary sentence [Conroy, J. M., et al. (2001). Text summarization via hidden markov models and pivoted QR matrix decomposition.]

SummarizeHidden Markov Model + Pivoted QR • Multi-document Summarization • Goal: generate w-word summaries • Use HMM scoresto select candidate sentences (~2w) • Terms as sentence features • Terms: • Sentences: • Scaling: = HMM score of sentence • Pivoted QR • Choose column with maximum norm ( ) • Subtract components along from remaining columns • Stop: chosen sentences (columns)  ~w words • Removes semantic redundancy

QCS: Evaluation • Document Understanding Conference (DUC) • Automatics evaluation of summarizers (ROUGE) • Measures how well you agree with human summaries • Human (), QCS (), S only () summaries • QCS finds subtopics and outliers Cluster 1 Cluster 2 ROUGE-2 score vs. Summarizers (Humans, QCS, S)

QCS: Evaluation • Document Understanding Conference (DUC) • Scoring as a function of QCS cluster size (k) • QCS (), S only (---) summaries • Best results for different clusters use different k Cluster 1 Cluster 2 ROUGE-2 scores vs. number of clusters

Benefits of QCS • Dynamic data organization and compression • Subset of documents relevant to a query • Topic clusters, single summary per cluster • Multiple perspectives (analyses) • Relevance ranking, topic clusters, summaries • Efficient use of computation • Parsing, term counts, natural language processing, etc.

Other Examples

ParaText™: Scalable Text Analysis • ParaText™ Lite • Serial client-server text analysis • Parser, vector space model, SVD, data similarities/graph creation • Built on vtkTextEngine (Titan) • Works with ~10K–100K documents • ParaText™ • End-to-end scalable text analysis • Challenge 1: Parsing [parallel string hashing, hierarchical agglomeration] • Challenge 2: Text modeling [initial Trilinos/Titan integration complete] • Challenge 3: Load balancing [initial: documents; goal: Isorropia/Zoltan] • Impact • Available in ThreatView 1.2.0+ directly or through ParaText™ server • Plans to interface to LSAView, OverView (1424), Sisyphus (1422)

ParaText™ Client XMLHTTP Artifact DB ParaText™ Server (PTS) Master ParaText™ Server 1 or 2DB Servers P0 P1 Pk Matrices DB PTS PTS PTS Reader Reader Reader Parallel Pipeline Parser Parser Parser Matrix Matrix Matrix SVD SVD SVD HPC Resource (cluster, multicore server, etc.)

LSAView: Algorithm Analysis/ Development • LSAView • Analysis and exploration of impact of informatics algorithms on end-user visual analysis of data • Aids in discovery process of optimal algorithm parameters for given data and tasks • Features • Side-by-side comparison of visualizations for two sets of parameters • Small multiple view for analyzing 2+ parameter sets simultaneously • Linked document, graph, matrix, and tree data views • Interactive, zoomable, hierarchical matrix and matrix-difference views • Statistical inference tests used to highlight novel parameter impact • Impact • Used in developing and understanding ParaText™ and LSALIB algorithms

LSAView Impact • Document similarities: • Inner product view: • Scaled inner product view: What is the best scaling for document similarity graph generation? original scaling no scaling inverse sqrt inverse [Leisure Studies of America Data: 97 documents, 335 terms]

E-Mail Classification • LSN Assistant / Sandia Categorization Framework • Yucca Mountain: categorize e-mail (Relevant, Federal Record, Privileged) • Machine learning library and GUI for document categorization & review • For review of existing categorizations, recommendations for new documents • Balanced learning • Skewed class distributions • Importance • Solved important, real problem • ~400K e-mails incorrectly categorized • Foundation for LSN Online Assistant • Real-time system for recommendations • Impact • Dong Kim, lead of DOE/OCRWM LSN certification is “very impressed with the LSN Assistant Tool and the approach to doing the review.” • Factor of 3 speedup over manual categorization review only

Conclusions • Text analysis relies heavily upon mathematics • Linear algebra, optimization, machine learning, probability theory, statistics, graph theory • Hybrid system development is a challenge • More than just gluing pieces together • Large-scale analysis is important • Storing and processing large amounts of data • Scaling algorithms up • Developing new algorithms for large data • Useful across many application domains

Collaborations • QCS • Dianne O’Leary (Maryland), John Conroy & Judith Schlesinger (IDA/CCS) • LSALIB • Tammy Kolda (8962) • ParaText™ • Tim Shead & Pat Crossno (1424) • LSAView • Pat Crossno (1424) • Sandia Categorization Framework • Justin Basilico (6341) and Steve Verzi (6343) • HEMLOCK • Sean Gilpin (1415)

Thank You Text Analysis:Methods for Searching, Organizing, Labeling and Summarizing Document Collections Danny Dunlavy dmdunla@sandia.gov http://www.cs.sandia.gov/~dmdunla

Text Analysis: Methods for Searching, Organizing, Labeling and Summarizing Document Collections

Text Analysis: Methods for Searching, Organizing, Labeling and Summarizing Document Collections

Presentation Transcript

Overview

Chapter 2 Descriptive Statistics: Tabular and Graphical Methods

Text, not Word Processing

Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis

Cluster and Outlier Analysis

Slides by JOHN LOUCKS St. Edward’s University

Chapter 7. Cluster Analysis

The Globally Harmonized System (GHS) for Hazard Classification and Labeling

Job Analysis

Sequence Analysis, Pair Wise Alignment, and Database Searching

Evidence-Based Practice Introduction to methods and searching for Librarians

Chapter 1 Exploring Data

Chapter 8 Arrays

Text Features

Clustering Methods

CONTEMPORARY METHODS OF MORTALITY ANALYSIS Biodemography of Mortality and Longevity

Statistical Methods for Mining Big Text Data

Mesh Parameterizations

A Survey on Software Architecture Analysis Methods

Quantitative Methods in Palaeoecology and Palaeoclimatology PAGES Valdivia October 2010

Systematic Reviews: Methods and Procedures

Meta-analysis