190 likes | 302 Views
eClassifier: Tool for Taxonomies. Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA. Assertions on Taxonomy Generation. Manual methods are too labor intensive, limit scope and scale, and are not maintainable Canned taxonomies are a niche solution
E N D
eClassifier:Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA
Assertions on Taxonomy Generation • Manual methods are too labor intensive, limit scope and scale, and are not maintainable • Canned taxonomies are a niche solution • There are many “natural” or “right” taxonomies, even on the same collection • Clustering, canned taxonomies and other methods are good starting points, but not enough
Salient Features of eClassifier • Clustering algorithm independent • bias towards speed for interaction • Classification algorithm independent • evaluate multiple algorithms for given taxonomy • pick best algorithm for each level in taxonomy • Multiple methods to seed taxonomy: • import, clustering, query based • Multiple methods for evaluating, editing and validating taxonomies • Given a taxonomy, analysis/discovery against structured and unstructured information
eClassifier Principles • Apply multiple text mining algorithms to textual data sets in a practical manner. • Provide consistently good results, the goal is not perfection. • Utilize domain expertise by giving the user control over the mining process. • Provide tools, metrics and reports to draw useful conclusions from the analysis.
The Mining Process • Create a dictionary of terms (words and phrases) • Prune dictionary (prune irrelevant terms) • Cluster documents based on this dictionary • Examine the resulting taxonomy, modifying based on domain expertise • Create multiple taxonomies (divide and conquer) • Do deeper analysis by creating keyword classifications, comparing taxonomies, inspecting dictionary co-occurrence, examining recent trends
The Class Table For viewing and understanding each level in a taxonomy
Understanding Class Metrics • Class Naming Convention • Shortest possible name that covers the examples • “,” => OR • “&” => AND • X_Y => X followed by Y • NONE => no useful text • Miscellaneous => No easy description • Cohesion • A measure of similarity between documents in the same class (0-different terms, 100-same terms) • Distinctness • A measure of similarity between documents in different classes (0-very similar, 100-very unique)
Dictionary Tool • Edit -> Dictionary Tool • Use this to edit the features on which the taxonomy is based • Delete irrelevant or ambiguous terms • Generate and edit synonyms
Dictionary Generation Files • StopWords • words excluded from the dictionary • Synonyms • different forms of the same semantic term • IncludeWords • words that always appear in dictionary • Stock Phrases • text to be ignored in creating dictionary • Synonyms and Stock Phrases can be automatically generated and then edited
Refinement of Classes • Subclass Classes • Subdivide an existing class into multiple subclass at the next level in the taxonomy • Merge Classes • Delete Classes • Rename Class • Undo • Don’t be afraid to try things • Save • .obj files contain all information eClassifier uses • .class files contain class membership • Read
Class View • For understanding the concepts and contents of a given class • View the text • Most typical • Least typical • View the source Web page • View distinguishing terms • View deduced rules for classification and related documents
Keyword Searching • Edit->Keyword Search • Search for Dictionary terms • Use “and” , “or” and “_” • Searching within a class • Related Words • Look at Trends • Create new Classes • See where the matching documents occur via Class Table
Document/Page Viewer • Sorting Documents • Most typical • Least typical • View distinguishing terms • Representative use of important words • Moving documents • Trend • Reports
Keyword Class Generation • Execute->Classify by Keywords • Open queries (KCG files) • One query per line • .AND. , .OR., (, ) • Add, Rename, Delete queries • Prioritize – Move up and down • Multiple/only one class • Ambiguous/first matching class • Run Queries • Save Queries • Run eClassifier
Comparing Taxonomies • File->Compare Taxonomies • File->Read Structured Information • Co-occurrence counts and affinities • Trend • View documents • Transpose • Report (CSV)
Dictionary Co-occurrence • View->Dictionary Co-occurrence • Type ahead searching • Co-occurrence counts and affinities • Trend • View documents • Zoom in • Change Metric -> dependency
Advanced Features • Visualization • Subclass from Structured Information • Make Classifier • Read Template • Import Category • Add a category from another saved taxonomy • Select Metrics • Add other columns to the Class table • BIW
Visualization • Look at relationships between selected classes • Discover sub-clusters • Find “borderline” examples • View/Move Documents • Navigator • Touring