10 likes | 213 Views
Query: NYT Article. P( topic ). the. four. remaining. directors. of. the. enron. corporation. who. oversaw. the. company. s. rise. to. the. top. 10. of. the. fortune. 500. and. its. collapse. into. bankruptcy. last. year quit today. the. resignations. of. the.
E N D
Query: NYT Article P( topic ) the four remaining directors of the enron corporation who oversaw the company s rise to the top 10 of the fortune 500 and its collapse into bankruptcy last year quit today the resignations of the directors robert a belfer norman p blake_jr board ken_lay enron wendy l gramm and herbert s winokur_jr were accepted members ken enron_north_america unanimously by the board enron said the board announced in committee ceo enron_corp february 2002 its intent to conduct an orderly transition to a directors chairman enron_employee board composed of new independent directors … gisb president enron_stock Best matching ENRON email P( topic ) earlier today we issued a press_release announcing further progress on the transition of enron s board of directors at a board meeting this morning the board accepted the resignations of the remaining four long standing directors board plan ken_lay robert a belfer norman p blake dr wendy l gramm and herbert members week ken s winokur jr the board also elected raymond s troubh as committee plans ceo interim chairman of the board and noted that they have directors process chairman identified three independent director candidates whose gisb working president election is pending feedback from the creditors committee … TOPIC MIXTURE ... TOPIC TOPIC TOPIC ... WORD WORD WORD ... X X Example Cross Corpus Retrieval Analysis, Exploration, and Retrieval of Information across Multiple Corpora Cross-Corpus Analysis with Topic ModelsPadhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine • Example: two corpora, Enron emails and New York Times articles that mention “Enron” • Problem: how to find Enron emails relevant to New York Times article (or vice versa)? • Approach: 1) Train two separate topic models 2) map the query into the topic space of the other corpus3) Calculate relevance by proximity in topic space (e.g. using Jensen-Shannon divergence) • E.g. emails, intelligence reports, news articles. We looked at: • Applications: • Corpus comparison: automatically compare topics across 2 different corpora • Cross-corpus retrieval: given a document in corpus A, find similar documents in corpus B • “GateKeeper”: given a document in corpus A, compute the likelihood of finding matching documents in corpus B, without looking at individual document records. New York Times Articles 3000 articles that mention “Enron” Enron email data 500,000 emails 11k different authors 1999-2002 PubMed 15,000,000 articles Queries can return 100k or more articles Corpus Comparison GateKeeper Common Topics Pre 1980 Topics 2003 Topics • Application: analyst wants to check whether some report X (query) has any similar documents in secure database at a different agency. Analyst uses “gatekeeper” to assess whether there are any relevant documents before going through lengthy process of securing access • Problem: information retrieval model cannot have access to individual documents either -- only has summaries of topics across whole database • Solution: use log likelihood of query document with the topic model using only the topics. • Simulation: assume Biobase docs as secure database. Probe with (relevant) new Biobase docs or (irrelevant) computer science docs from CiteSeer. Figure shows that relevant documents can be discriminated from irrelevant documents based on this global measure. • What are the topical similarities and differences between two large sets of documents? • Example: PubMed papersbefore 1980 compared with 2003 … • Example: PubMed papers from China and Israel… Child mortality SARS (11.0) Cattle diseases (6.7) Ricin binding (6.1) Gene mutations (5.5) Cell marrow Brucellosis (4.1) Plague study Biological agents (5.5) Patient diagnosis Animal infections (3.7) Gene sequences (5.0) Cases reported Proteins (3.3) HIV (4.5) Probabilistic Topic Models Common Topics China Topics Israel Topics Terrorism Wall Street Stock Market Bankruptcy • topic = distribution over words • document = mixture of topics • Topic models can be learned automatically using statistical learning [e.g. Griffiths and Steyvers (2004) ] SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP Biological agents(24.5) Cell marrow (30.0) Animal infections Serum levels (24.5) Terrorist injuries (14.9) Acid mass detection Gene sequences (22.2) Cattle diseases West nile virus (12.2) September 11 (11.0) Nerve motor study Antibodies (13.5) Vaccination SARS (10.0) Public health (8.2) BIOBASE TOPIC MODEL CITESEER Collocation Topic Model • For each document, choose a mixture of topics • For every word slot, sample a topic • If x=0, sample a word from the topic • If x=1, sample a word from the distribution based on previous word • New model combines frequent word combinations (collocations) with topics • Model automatically extracts topics and word combinations • Collocations in topics improve interpretability: e.g.“United_States”, “Sept_11”, “Osama_Bin_Laden” BIOBASE CITESEER WORD MODEL