160 likes | 376 Views
CS591CXZ Web mining: Lexical relationship mining. Mining Topic-Specific Concepts and Definitions on the Web. Bing Liu, etc KDD03. Lexical relationship mining.
E N D
CS591CXZWeb mining:Lexical relationship mining Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03
Lexical relationship mining • A lexical relationship is a relationship between words, such as synonym, antonym, hypernym (“poodle” <-- “dog”), and hyponym (“poodle” --> “dog”) • A lexical relationship is a connection between the meanings of two words in a text which helps the text to hold together. Relevant connections include (rough) synonymy (e.g. woman - person, win - victory) and connections in a field of meaning (e.g. plane - pilot). Thus, subtopic mining is in this category, but definition mining is not.
Information Extraction • MUC http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ Information Extraction: the extraction or pulling out of pertinent information from large volumes of texts Items of Information Percentile Reliability Entities 90 Attributes 80 definition falls here Facts 70 Events 60 Attribute: a property of an entity such as its name, alias, descriptor, or type
Mining Topic-Specific Concepts and Definitions on the Web • Goal : Systematically learn an unfamiliar topic from Web • Definitions • Topic hierarchy • Input : a term “data mining”, “Web mining” • Tasks • Identify sub-topics or salient concepts • Like building ontology, but no clear hierarchy E.g.: Genetic Algorithm • Algorithms • Find and organize definition pages • Definition question answering • Concept disambiguation
Techniques • A lot of heuristics • Simple linguistic patterns {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … … • Web page tags <h1>,…,<h4> <b> <em> <li> … • Frequent pattern mining • A classic data mining technique
Algorithm WebLearn(T) • Submit T to a search engine, get relevant pages • Mines subtopics or salient concepts of T • Finds definition pages • Output the concepts and definition pages to users. If a user wants to know more about subtopics T’ do WebLearn(T’)
Mining subtopic/salient concept(1) Input: a set of top-ranked relevant document Steps: • Filter out “noisy” documents • Publication listing pages “in proceeding”, “journal” • Forum discussion pages “previous message”, “reply to” • Pages that do not contain all query terms
Mining subtopic/salient concept(2) 2. Identify important phrases in each page • Extract text segments in HTML emphasizing tags <h1>,…,<h4> <b> <em> <li> … • Except those containing: • Salutation title (Mr. Dr. Professor) • URL or email address • “conference”, “journal” … • Digits ( KDD2004) • Images • Too many words (15 words as limit)
Mining subtopic/salient concept(3) 3. Mine frequent phrases • Input: emphasized text segments • Mine frequent word sets using associate rule mining technique 4. Eliminate word sets unlikely to be subtopics • Heuristic: those that do not appear alone in emphasizing tags in any page “process” • Remove generic words from result set “abstract”, “introduction”, “conclusion”, “research”,… 5. Rank result sets According to number of pages they occur
Definition Finding • Definition identification patterns suitable for Web pages {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … • HTML structuring clues and hyperlinks • If only one header <h1>, <h2>,… or one big emphasized segment at the beginning => definition page • Look up definition pages up to the second level of the hyperlinks, and only hyperlinks with anchor text matching the concept
Subtopic disambiguation • By adding context terms • usually parent topic or subtopics • context terms tend to dominate results • cannot work for the first (root) topic • Heuristics to combat domination of context terms • only consider text segments containing the topic or subtopic • identify pages with topic hierarchy HTML list tag <li> The hierarchy should also contain other subtopics of the parent topic • shallow linguistic phenomena Topic + “approaches” / ”techniques” + ( + “e.g” / “such as” / “including” + subtopics ) Then, how does this help disambiguate?
Evaluation • Use Google to get the initial set of relevant pages • Result 1: subtopics / salient concepts Looks pretty good, terms are closely relevant More salient concepts than subtopics • Result 2: definition discovery comparison Precision: WebLearn vs Google vs AskJeeves • Result 3 : disambiguation Seem to be useful
Analysis • Interesting topic Potentially to be used in practice • A complete system • Techniques • Avoid NLP, Machine Learning • Apply heuristics of shallow text structures
Limitations • Research topics, not much ambiguity • Techniques: • Heuristics are empirical, by no means being flawless or exhaustive, and hard to applied to other domains
How to improve? -- discussion • Better research: • do you think it is a good research topic? • Better techniques: • what techniques would you like to try to solve the problme?