400 likes | 613 Views
Portable Classification Tools. Mark Shewhart LexisNexis 21 June 2001. Overview. Classification Tools and Types Consistent Controlled Classification Schemes Across All Content Benefits of C.C.C.S. Approaches to “Portable” Classification Challenges Examples Q & A. Introduction.
E N D
Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001
Overview • Classification Tools and Types • Consistent Controlled Classification Schemes Across All Content • Benefits of C.C.C.S. • Approaches to “Portable” Classification • Challenges • Examples • Q & A
Introduction • Mark Shewhart • LexisNexis • One of early innovators in building on-line databases and search tools, with classification • Currently providing increasing range of tools, solutions and services to support information needs of government organizations, companies, and individuals
Uncontrolled Classification • PROS • No manual development of classification algorithms or searches • Aids in knowledge discovery & taxonomy development • Adapts to changing terminology and topics • CONS • Difficulty providing meaningful labels to taxonomy • Problematic on fine grained rules • Examples • Verity, Semio, SRA’s NetOwl Extractor, InXight’s Thing Finder, LEXIS-NEXIS core-terms
Controlled Classification • Machine Leaning • Provide several hundred “on-point” samples per topic • Most systems do not allow for manual intervention • Examples - Verity, Semio, Autonomy, InXight, Purple Yogi, Webmind, Fulcrum, SmartLogik. • Manually Created “Algorithms” • Human Indexers manually create the algorithm for each topic • Examples - Any Boolean Search Engine, Verity, InXight Classifier, LEXIS-NEXIS SmartIndexing, Factiva Intelligent Indexing, Metacode, Sageware.
Controlled Classification • Basic search tools with complex queries created by domain experts is a form of controlled classification • Natural Language • Verity, Alta-Vista, LexisNexis, West ... • Boolean • MS Site Server, Alta-Vista, LexisNexis, West, Factiva, Dialog ... • Enhanced - additional “beyond boolean” operators/control • Verity, Semio ...
Uses for Uncontrolled Classification • Taxonomy Development • Several companies market tools focused on taxonomy development • Knowledge Discovery • Relationships between terms • New or changing terms
Consistent Classification Scheme Everywhere • Your Intranet, The Web, and Premium Content Providers • Search all three using the same taxonomy • A consistent, controlled, classification scheme facilitates • data analysis & visualization - BIZ360, I2 • Intra-document linking by taxonomy nodes • Investigative Analysis of content
Consistent Classification - One Stop Search Premium Content One Stop Search Mining Web Content Your Intranet
Consistent Classification - Locate & Link Dossier Explore LEXIS-NEXIS for Microsoft Case Law Case with Microsoft as a Party Patents Microsoft News Computing & Tech News Computer Company News Microsoft Web Site
Company Tracking and Analysis User pre-selects companies to track. Your Companies MICROSOFT CORP INTEL DELL COMPUTER CORP
Company Tracking and Analysis User selects Microsoft Corp. Higher than average coverage flagged Your Companies MICROSOFT CORP INTEL DELL COMPUTER CORP
Company Tracking and Analysis The next day - User is back again Extremely high coverage flagged Your Companies MICROSOFT CORP INTEL DELL COMPUTER CORP
Company Tracking and Analysis Click on the red circle for News Topic Analysis Your Companies MICROSOFT CORP INTEL DELL COMPUTER CORP
Company Tracking and Analysis User clicks on the “STOCKS” bar for the news Your Companies MICROSOFT CORP INTEL DELL COMPUTER CORP
Answer Set Navigation Executive Changes More Executive Changes Stocks User clicks on Topic Analysis Lawsuits More Stocks More Lawsuits
Consistent Classification - Trending • Trend Analysis of Metadata NEXercise User Selected Indexing Terms: Download into Excel Spreadsheet Online Trading Electronic Commerce Internet Crime
Consistent Classification - Press Trending • Trending in the News • The Washington Times, May 05, 2000, … "A Nexis search of 'extreme right' over the past month scored 212 mentions; a Nexis search of'extreme left' over the past month yielded 58 items. • MC Technology Marketing Intelligence, December 1, 1999 … We looked at such quantitative data as stock performance in 1999 and the number of press mentions (as shown in a Lexis- Nexis search), • Fortune, October 12, 1998, … Just how addicted to cliches are financial media editors? Here's a list of fave words and the number of stock market stories in which they appeared, generated by a Lexis-Nexis search from the end of August to Sept. 11: Turmoil: 1,559; plunge: 1,260; crash: 965; correction: 860; bear market: 750; ... • International Herald Tribune (Neuilly-sur-Seine, France), July 4, 2000, Tuesday … The National Security Agency certainly features regularly in Mr. Gertz's coverage. A Lexis-Nexis search lists 132 Gertz stories in The Washington Times going back to 1989 that have mentioned the agency. • The Washington Post, June 28, 2000,...easily discern one of the issues of greatest concern to voters: George W. Bush's position on the death penalty. A Nexis search Monday for stories mentioning Bush at least three times and the words "death penalty" or "executions" or "capital punishment" at least three … • The New York Times, June 14, 2000, ...tally the Hotline political tip sheet keeps of how often possible vice-presidential choices merit a major media mention. Mr. Danforth had 10 mentions, compared with 49 for Gov. Tom Ridge of Pennsylvania, No. 1 on the 53-name list.
Consistent Classification- Source Suggestion LEXIS-NEXIS top Sources for IPO’s Cable News Network F M&A Journal AFX-Extel News PR Newswire Business Wire Phillips Newsletter Financial Times Institutional Invest IAC News Business Times Cable News Network Asia Intelligence Wire Financial Post New York Post • Automatic Suggestion of Sources • LEXIS-NEXIS top Sources for Denver Broncos • Rocky Mountain News • Denver Post • Sports Network • Associated Press • Seattle Post-intelligencer • USA Today • Washington Post • Orlando Sentinel • Kansas City Star • Regal-fort Worth Star • San Diego Union Tribune LEXIS-NEXIS Suggest-a-Source User Selected Indexing Term LEXIS-NEXIS Suggest-a-Source User Selected Indexing Term • What are these? IPOs Denver Broncos
Consistent Classification - More Than a Cite List NEXIS Source Analyzer™ Dayton Daily News Topics 2697 Sports• 2616 Athletes 2181 Basketball 1871 Campaigns & Elections 1772 College Sports 1503 Cities 1476 Lawyers 1473 Baseball & Softball 1438 High School Sports 1345 Violent Crime 1258 Litigation 1207 Sentencing 1158 Judges 1132 American Football 1086 Fundraising 937 Television Programming 931 Deaths & Obituaries 857 Diseases & Disorders 852 Settlements & Decisions 837 Arrests NEXIS Source Analyzer™ Washington Post Topics 11410 Sports• 8567 Campaigns & Elections 7439 Athletes 6415 Lawyers 4665 Basketball 4498 Violent Crime 4393 Banking & Finance 4265 Entertainment & Arts 4155 Baseball & Softball 3938 Judges 3753 International Relations 3703 Budget 3675 College Sports 3557 Cities 3397 Litigation 3384 Sentencing 3243 Candidates 3202 American Football 3109 Television Programming 2758 Fundraising NEXIS Source Analyzer™ Los Angeles Times Topics 6080 Sports• 3375 Cities 3101 Campaigns & Elections 2915 High School Sports 2815 Athletes 2800 Lawyers 2360 Basketball 2347 Baseball & Softball 2341 Letters & Comments 2241 College Sports 2188 Violent Crime 2113 San Fernando Valley 1918 Television Programming 1851 Litigation 1793 Judges 1711 Deaths & Obituaries 1504 Editorials & Opinions 1410 Environment 1391 Television Industry 1380 Sentencing • Source Analyzer Source Analyzer™ User Selected Sources: Download into Excel Spreadsheet Dayton Daily News Washington Post LA Times • Source Analyzerhighlights Common Terms
Consistent Classification - More Than a Cite List NEXIS Source Analyzer™ Financial Times Topics 61039 Banking & Finance 32061 Mergers & Acquisitions 18869 Telecommunications 18112 Trade Agreements 17499 Campaigns & Elections• 13484 Currencies 11458 Computing & Technology 11121 International Relations 11056 Exchange Rates 11009 Privatization 10229 Emerging Markets 10160 Energy 9015 Joint Ventures 8959 Stock Indexes 8680 Debt 8609 Budget 8606 Automakers 8424 Engineering 8347 Central Banks 8110 Taxes NEXIS Source Analyzer™ USA Today Topics 30235 Sports 17591 Athletes 9006 Baseball & Softball 9003 College Sports 8989 Basketball 8287 Television Programming 7501 American Football 7355 Campaigns & Elections• 6485 Lawyers 6370 Banking & Finance 5662 Olympics 4975 Entertainment & Arts 4884 Television Industry 4469 Polls & Surveys 3975 Litigation 3832 Airlines 3363 Judges 3335 Violent Crime 3331 International Relations 2933 Network Television • The New Republic, JULY 26, 1999 … The U.S. section is lambasted for repeating what was reported in the American press. To prove it, Sullivan does a Nexis search on the topic of each article in a random issue and compares what he finds to The Economist. The results are not surprising. • Source Analyzer Source Analyzer™ User Selected Sources: Download into Excel Spreadsheet Financial Times USA Today • Source Analyzer™highlights Common Terms
Reporter Analysis NEXIS ByLine Analyzer™ Steve Schmidt reported Topics 13 CITIES 10 NATIONAL PARKS 10 CAMPAIGNS & ELECTIONS 8 SUBURBS 8 MARRIAGE 7 THEME PARKS 6 VIOLENT CRIME 6 SECONDARY SCHOOLS 5 SPORTS 5 PUBLIC TRANSPORTATION • What is a reporter covering? NEXIS ByLine Analyzer™ Steve Schmidt reported Companies 5 MICROSOFT CORP 1 WALT DISNEY CO INC 1 PACIFIC LUMBER CO 1 PACIFIC BELL 1 MAPES HOTEL 1 DESTINATION PALM BEACH 1 ALTURAS CASINO 1 ALASKA AIR GROUP INC NEXIS ByLine Analyzer™ Steve Schmidt reported Organizations 4 SAN DIEGO STATE UNIVERSITY 4 FEDERAL BUREAU OF INVESTIGATION 3 SAN DIEGO CITY COUNCIL 3 NATIONAL PARK SERVICE 2 WILD HORSE ORGANIZED ASSISTANCE 2 VALLEY MIDDLE SCHOOL 2 UNIVERSITY OF CALIFORNIA (LOS ANGELES) 2 SAN DIEGO PADRES 2 HELIX HIGH SCHOOL 1 YOSEMITE INSTITUTE NEXIS ByLine Analyzer™ Steve Schmidt reported people 4 DAVID KNIGHT 3 SHAWN STINSON 3 EMILIO ESTEVEZ 3 CHARLIE SHEEN 3 BILL GATES 3 ALBERT GORE JR 2 WILLIE L BROWN 2 SCOTT HINSON 2 PETE KNIGHT 2 MICHAEL GONZALEZ ByLine Analyzer™ User Selected Reporter: Download into Excel Spreadsheet Steve Schmidt
Topic Analysis • Who’s involved & Who’s reporting on the recent rash of bacteria related product recalls? NEXIS Topics Analyzer™ Top Reporters 2 ROBERT WALKER 2 NICOLE BAILEY 2 LYNNE KOZIEY 1 SHAWN OHLER 1 SARAH GREEN 1 QUINTIN ELLISON 1 MATTHEW P BLANCHARD 1 MARTHA M. HAMILTON 1 MARLENE HABIB 1 MARK BROWN 1 LYLE HARVEY 1 KATHERINE HARDING 1 KAREN CLARK LEPOOLE 1 JOHN TAYLOR 1 JESSICA HANSEN 1 IAN MCDOUGALL 1 FRED ANKLAM JR 1 DONNA CASEY 1 DINA CAPPIELLO 1 CHU SHOWWEI 1 CHRISTINE WINTER 1 BILL EGBERT 1 BARBARA DURBIN NEXIS Topic Analyzer™ Top related Companies 29 MOYER PACKING CO 16 IBP INC 12 PACKERLAND PACKING CO INC 11 KRAFT FOODS 6 LAKESIDE FARM INDUSTRIES 5 PHILIP MORRIS COS INC 5 FOOD SAFETY & INSPECTION SERVICE 4 SNOW BRAND MILK PRODUCTS CO LTD 3 GARDEN BOTANIKA INC 2 XL FOODS 2 STOP & SHOP SUPERMARKET CO 2 LAKESIDE PACKERS 2 GIANT FOOD STORES INC 2 DEL GOULD MEATS INC 2 COSTCO WHOLESALE CORP Topic Analyzer™ User Selected Topics: Download into Excel Spreadsheet Product Recalls Bacteria
Approaches Service Provider • “ASP” Service Model Internet Documents Categories Customer
Approaches • Port The Classification Application to run in user’s environment • Software • Intellectual Capital
Approaches • Port the Intellectual Capital to another classification system’s format & logic Verity Users Semio Users Autonomy Users Hummingbird Users Inxight Users
Challenges • Operator Incompatibility • Parsing vs Inverted Word Index Tools • Document Length Adjustments
Search Operator Compatibility • Many Boolean search systems do not have a frequency operator - ATLEASTn( term ) at LexisNexis • Years ago, LexisNexis noticed that many experienced searchers were simulating a frequency operator by cascading an existing proximity operator • cat W/9999 cat W/9999 cat • To simulate ATLEAST3( cat ) • How do we port an ATLEASTn() search to a system without a proximity operator or a system that does not cascade proximity operators?
Porting Boolean Searches - Verity Example • ATLEASTn Operator • LNG Boolean: ATLEASTn( expr ) • Verity: • <COMPLEMENT>( <YESNO>( <COMPLEMENT>( • <AND>( <MULT/[10000/n]>( <FREQ>( expr ) ) ) • ) ) ) • NOTE: • ATLEASTn( expr1 or expr2 or … or exprX ) is equivalent to ATLEASTn( expr1 ) or ATLEASTn(expr2 ) or … or ATLEASTn( exprX ) • ATLEASTn( expr1 and expr2 and … and exprX ) is equivalent to ATLEASTn( expr1 ) and ATLEASTn(expr2 ) and … and ATLEASTn( exprX )
Automatic Stemming - Precision Issues Many search engines perform automatic stemming which is needed for depluralization which was assumed when the Search Advisor searches were created and tested. Unfortunately, this “stemming” allows words to match morphological variants other then singular/plurals. For example, a search on CONSTITUTION may match CONSTITUTIONAL. This causes the ported searches to retrieve documents that the LN Boolean search does not. Some possible solutions. • Do nothing. The words are many times similar in concept. This would require more detailed domain by domain analysis. • Some search tools allow the user to put “quotes” around terms to turn off the stemming. If so, put quotes around all terms and generate additional terms in our search to simulate depluralization. • Put quotes around all terms and do NOT generate new terms. This omits depluralization as well. Huge recall hit I would imagine.
Porting Boolean Searches - Recall Issues Proximity operators are impacted by differences in the set of non-searchable “noise” words. Porting LexisNexis searches to a system withlessnoise words will cause some documents matched by LexisNexis’ search engine not to be retrieved. For example, the search ATTACHED w/5 POLE matches in LN but may not in the following text “cable attachedto thehopper whichthegin-pole”. This also occurs in phrases which are W/1 (really a phrase). We may also miss documents on the term SURETY CONTRACT when LN matched it in the phrase SURETY TO THE CONTRACT Possible solution - Increase n by 1 or 2 in the ported search. This could have precision impacts.
Porting Uncontrolled Classification Tools To Yours • Many companies market uncontrolled classification tools that automatically create categories • Many cluster terms and assign weights different than TFIDF Natural Language Search : cat, dog, puppy, mouse .4 cat .2 dog .3 puppy .4 mouse Natural Language Search : cat, cat, cat, cat, dog, dog, puppy, puppy, puppy, mouse, mouse, mouse, mouse New Weighted Natural Language Search that does not use TFIDF: cat(0.4), dog(0.2), puppy(0.3), mouse(0.4)
LN Topical Indexing to Verity Example #SUBJECT: #CVTS: #SUBJ=CATS & DOGS EXAMPLE #TERMS: #WEIGHT=1 #THRESH=5 #FREQLMT=4 {fl01 = 4} #TERM01=cat #TERM01=cats #FREQLMT=4 {fl02 = 4} #TERM02=dog #TERM02=dogs
Word Concept Buckets • the #TERM01 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 can be represented in Verity as: • <SUM>( <AND>( <MULT/2500>( <FREQ>(“cat”) ) ), • <AND>( <MULT/2500( <FREQ>( “cats” ) ) ) • ) • The #TERM02 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 is represented in Verity as: • <SUM>( <AND>( <MULT/2500>( <FREQ>(“dog”) ) ), • <AND>( <MULT/2500( <FREQ>( “dogs” ) ) ) • )
Word Concept Buckets • Examples of the TERM01 word concept counts (FL=4) • # cat/cats <SUM>( <AND>( <MULT/2500>( <FREQ>(“cat”) ) ), • <AND>( <MULT/2500( <FREQ>( “cats” ) ) ) ) • 0 0.00 • 1 0.25 • 2 0.50 • 3 0.75 • 4 1.00 • 5+ 1.00
Blocking Effect #SUBJECT: #CVTS: #SUBJ=CAT DOG EXAMPLE #TERMS: #THRESH=4 #FREQLMT=5 {fl01 = 5} #TERM01=cat dog #FREQLMT=3 {fl02 = 3} #TERM02=cat #TERM02=dog #BLOCK=cat food #BLOCK=dog food • In SmartIndexing, we do not count “cat” if it is in the phrase “cat dog” • This is the Blocking Effect • This is not natural in an Inverted word index based search systems • Very unnatural - “cats and dogs, sleeping together - total hysteria”
Blocking Effect • Verity has the <FREQ> operator which counts term frequency without the Blocking Effect. • So the “cat” in “cat dog” is counted • But … • <LN-FREQ>(“cat”) = • <FREQ>(“cat”) - <FREQ>(cat dog”) - <FREQ>(“cat food”) • We have term counts with the blocking effect …. • … Whoops! Verity does not have a <SUBTRACT> operator!
Learning to Subtract • Introducing <LNG_SUBTRACT> ( b , a ) defined as b – a = • <COMPLEMENT>( <SUM>( <COMPLEMENT>( b ) , a ) ) • Where 0<= a <= b <= 1 • Follow the math .... • <COMPLEMENT>( <SUM>( <COMPLEMENT>( b ) , a ) ) ) = • <COMPLEMENT>( <COMPLEMENT>( b ) + a ) ) = • <COMPLEMENT>( 1 - b + a ) = • 1 - ( 1 - b + a ) = • 1 -1 + b - a = • b – a
Actual Results from CATS & DOGS EXAMPLE • Cats & Dogs Test Summary expected results • Score (Doc_) 0 cat/cats 1 cat/cats 2 cat/cats 3 cat/cats 4 cat/cats 5 + cat/cats • 0 dog/dogs 0.0 (CD1) 0.125 (CD7) 0.25 (CD11) 0.375 (CD14) 0.50 (CD16) 0.50 (CD17) • 1 dog/dogs 0.125 (CD2) 0.25 (CD8) 0.375 (CD12) 0.50 (CD15) 0.625 (CD27) 0.625 (CD32) • 2 dog/dogs 0.25 (CD3) 0.375 (CD9) 0.50 (CD13) 0.625 (CD23) 0.750 (CD28) 0.750 (CD33) • 3 dog/dogs 0.375 (CD4) 0.50 (CD10) 0.625 (CD20) 0.750 (CD24) 0.875 (CD29) 0.875 (CD34) • 4 dog/dogs 0.50 (CD5) 0.625 (CD18) 0.750 (CD21) 0.875 (CD25) 1.00 (CD30) 1.00 (CD35) • 5+ dog/dogs 0.50 (CD6) 0.625 (CD19) 0.750 (CD22) 0.875 (CD26) 1.00 (CD31) 1.00 (CD36) • Cats & Dogs Test Actual Results • Score (Doc_) 0 cat/cats 1 cat/cats 2 cat/cats 3 cat/cats 4 cat/cats 5 + cat/cats • 0 dog/dogs 0.0000 (CD1) 0.1247 (CD7) 0.2494 (CD11) 0.3746 (CD14) 0.4997 (CD16) 0.5000 (CD17) • 1 dog/dogs 0.1247 (CD2) 0.2494 (CD8) 0.3742 (CD12) 0.4993 (CD15) 0.6244 (CD27) 0.6247 (CD32) • 2 dog/dogs 0.2494 (CD3) 0.3742 (CD9) 0.4989 (CD13) 0.6240 (CD23) 0.7492 (CD28) 0.7494 (CD33) • 3 dog/dogs 0.3746 (CD4) 0.4993 (CD10) 0.6240 (CD20) 0.7492 (CD24) 0.8743 (CD29) 0.8746 (CD34) • 4 dog/dogs 0.4997 (CD5) 0.6244 (CD18) 0.7492 (CD21) 0.8743 (CD25) 0.9994 (CD30) 0.9997 (CD35) • 5+ dog/dogs 0.5000 (CD6) 0.6247 (CD19) 0.7494 (CD22) 0.8743 (CD26) 0.9997 (CD31) 1.0000 (CD36) • Verity Threshold = THRESH/MAX = 5/8 = 0.625
Q & A Mark Shewhart Consulting Research Scientist LexisNexis mark.shewhart.3@lexis-nexis.com 937-865-6800 x4717