Untangling Text Data Mining

Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Outline • Untangling several different fields • DM, CL, IA, TDM • TDM examples • TDM as Exploratory Data Analysis • New Problems for Computational Linguistics • Our current efforts

Classifying Application Types

What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) • Fitting models to or determining patterns from very large datasets. • A “regime” which enables people to interact effectively with massive data stores. • Deriving new information from data.

Why Data Mining? • Because the data is there. • Because • larger disks • faster cpus • high-powered visualization • networked information are becoming widely available.

The Knowledge Discovery from Data Process (KDD) KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) Note: data mining is just one step in the process

DM Touchstone Applications(CACM 39 (11) Special Issue) • Finding patterns across data sets: • Reports on changes in retail sales • to improve sales • Patterns of sizes of TV audiences • for marketing • Patterns in NBA play • to alter, and so improve,performance • Deviations in standard phone calling behavior • to detect fraud • for marketing

What is Data Mining? Potential point of confusion: • The extracting ore from rock metaphor does not really apply to the practice of data mining • If it did, then standard database queries would fit under the rubric of data mining • In practice, DM refers to: • finding patterns across large datasets • discovering heretofore unknown information

What is Text Data Mining? • Many peoples’ first thought: • Make it easier to find things on the Web. • But this is information retrieval!

Needles in Haystacks The emphasis in IR is in finding documents that already contain answers to questions.

Information RetrievalA restricted form of Information Access • The system has available only pre-existing, “canned” text passages. • Its response is limited to selecting from these passages and presenting them to the user. • It must select, say, 10 or 20 passages out of millions.

What is Text Data Mining? • The metaphor of extracting ore from rock: • Doesmake sense for extracting documents of interest from a huge pile. • But does not reflect notions of DM in practice: • finding patterns across large collections • discovering heretofore unknown information

Real Text DM What would finding a pattern across a large text collection really look like?

Bill Gates + MS-DOS in the Bible! From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

Real Text DM • The point: • Discovering heretofore unknown information is not what we usually do with text. • (If it weren’t known, it could not have been written by someone!) • However: • There is a field whose goal is to learn about patterns in text for their own sake ...

Computational Linguistics! • Goal: automated language understanding • this isn’t possible • instead, go for subgoals, e.g., • word sense disambiguation • phrase recognition • semantic associations • Common current approach: • statistical analyses over very large text collections

Why CL Isn’t TDM • A linguist finds it interesting that “cloying” co-occurs significantly with “Jar Jar Binks” ... • … But this doesn’t really answer a question relevant to the world outside the text itself.

Why CL Isn’t TDM • We need to use the text indirectly to answer questions about the world • Direct: • Analyze patent text; determine which word patterns indicate various subject categories. • Indirect: • Analyze patent text; find out whether private or public funding leads to more inventions.

Why CL Isn’t TDM • Direct: • Cluster newswire text; determine which terms are predominant • Indirect: • Analyze newswire text; gather evidence about which countries/alliances are dominating which financial sectors

Nuggets vs. Patterns • TDM: we want to discover new information … • … As opposed to discovering which statistical patterns characterize occurrence of known information. • Example: WSD • not TDM: computing statistics over a corpus to determine what patterns characterize Sense S. • TDM: discovering the meaning of a new sense of a word.

Nuggets vs. Patterns • Nugget: a new, heretofore unknown item of information. • Pattern: distributions or rules that characterize the occurrence (or non-occurrence) of a known item of information. • Application of rules can create nuggets in some circumstances.

Example: Lexicon Augmentation • Application of a lexico-syntactic pattern: NP0 such as NP1, {NP2 …, (and | or) NPi } i >= 1, implies that forall NPi, i>=1, hyponym(NPi, NP0) • Extracts out a new hypernym: • “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use.” • implies hyponym(“Gelidium”, “red algae”) • However, this fact was already known to the author of the text.

The Quandry • How do we use text to both • Find new information not known to the author of the text • Find information that is not about the text itself

Idea: Exploratory Data Analysis • Use large text collections to gather evidence to support (or refute) hypotheses • Not known to author: links across many texts • Not self-referential: work within the domain of discourse

Example: Etiology • Given • medical titles and abstracts • a problem (incurable rare disease) • some medical expertise • find causal links among titles • symptoms • drugs • results

Swanson Example (1991) • Problem: Migraine headaches (M) • stress associated with M • stress leads to loss of magnesium • calcium channel blockers prevent some M • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD) implicated in M • high levels of magnesium inhibit SCD • M patients have high platelet aggregability • magnesium can suppress platelet aggregability • All extracted from medical journal titles

How to Automate This? • Idea: mixed-initiative interaction • User applies tools to help explore the hypothesis space • System runs suites of algorithms to help explore the space, suggest directions

Our Proposed Approach • Three main parts • UI for building/using strategies • Backend for interfacing with various databases and translating different formats • Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones

How to find functions of genes? • Important problem in molecular biology • Have the genetic sequence • Don’t know what it does • But … • Know which genes it coexpresses with • Some of these have known function • So … Infer function based on function of co-expressed genes • This is new work by Michael Walker and others at Incyte Pharmaceuticals

Make use of the literature • Look up what is known about the other genes. • Different articles in different collections • Look for commonalities • Similar topics indicated by Subject Descriptors • Similar words in titles and abstracts adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...

Developing Strategies • Different strategies seem needed for different situations • First: see what is known about Kallikrein. • 7341 documents. Too many • AND the result with “disease” category • If result is non-empty, this might be an interesting gene • Now get 803 documents • AND the result with PSA • Get 11 documents. Better!

Developing Strategies • Look for commalities among these documents • Manual scan through ~100 category labels • Would have been better if • Automatically organized • Intersections of “important” categories scanned for first

Try a new tack • Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests • New tack: intersect search on all three known genes • Hope they all talk about diagnostics and prostate cancer • Fortunately, 7 documents returned • Bingo! A relation to regulation of this cancer

Formulate a Hypothesis • Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer • New tack: do some lab tests • See if mystery gene is similar in molecular structure to the others • If so, it might do some of the same things they do

Strategies again • In hindsight, combining all three genes was a good strategy. • Store this for later • Might not have worked • Need a suite of strategies • Build them up via experience and a good UI

The System • Doing the same query with slightly different values each time is time-consuming and tedious • Same goes for cutting and pasting results • IR systems don’t support varying queries like this very well. • Each situation is a bit different • Some automatic processing is needed in the background to eliminate/suggest hypotheses

The UI part • Need support for building strategies • Mixed-initiative system • Trade off between user-initiated hypotheses exploration and system-initiated suggestions • Information visualization • Another way to show lots of choices

Candidate Associations Suggested Strategies Current Retrieval Results

Summary • The future: analyzing what the text is about • We don’t know how; text is tough! • Idea: bring the user into the loop. • Build up piecewise evidence to support hypotheses • Make use of partial domain models. • The Truth is Out There!

Summary • Text Data Mining: • Extracting heretofore undiscovered information from large text collections • Information Access  TDM • IA: locating already known information that is currently of interest • Finding patterns across text is already done in CL • Tells us about the behavior of language • Helps build very useful tools!

Untangling Text Data Mining