1 / 40

Mining Domain Specific Words from Hierarchical Web Documents

Mining Domain Specific Words from Hierarchical Web Documents. Jing-Shin Chang ( 張景新 ) Department of Computer Science & Information Engineering National Chi-Nan ( 暨南 ) University 1, Univ. Road, Puli, Nantou 545, Taiwan, ROC. jshin@csie.ncnu.edu.tw CJNLP-04, 2004/11/10~15, City U., H.K. TOC.

aletta
Download Presentation

Mining Domain Specific Words from Hierarchical Web Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Domain Specific Words fromHierarchical Web Documents Jing-Shin Chang (張景新) Department of Computer Science & Information Engineering National Chi-Nan (暨南) University 1, Univ. Road, Puli, Nantou 545, Taiwan, ROC. jshin@csie.ncnu.edu.tw CJNLP-04, 2004/11/10~15, City U., H.K.

  2. TOC • Motivation • What are DSW’s? • Why DSW Mining? (Applications) • WSD with DSW’s without sense tagged corpus • Constructing Hierarchical Lexicon Tree w/o Clustering • Other applications • How to Mine DSW’s from Hierarchical Web Documents • Preliminary Results • Error Sources • Remarks

  3. Motivation • “Is there a quick and easy (engineering) way to construct a large scale WordNet or things like that … now that everyone is talking about ontological knowledge sources and X-WordNet (whatever you call it)…?” • …trigger a new view for constructing a lexicon tree with hierarchical semantic links… • …DSW identification turns out to be a key to such construction • …and can be used in various applications, including DSW-based WSD without using sense tagged corpora…

  4. What Are Domain Specific Words (DSW’s) • Words that appear frequently in some particular domains: • (a) Multiple Sense Words that are used frequently with special meanings or usage in particular domains • E.g., piston: “活塞” (mechanics) or “活塞隊” (sports) • (b) Single Sense Words that are used frequently in particular domains • Suggesting that some words in the current document might be related to this particular sense • As “anchor words/tags” in the context for disambiguating other multiple sense words

  5. What to Do in DSW Mining • DSW Mining Task • Find lists of words that occurs frequently in the same domain and associate each list (and words within it) a domain (implicit sense) tag • E.g., entertainment: ‘singer’, ‘pop songs’, ‘rock & roll’, ‘Chang Hui-Mei’ (‘Ah-Mei’), ‘album’, … • As a side effect, find the hierarchical or network-like relationships between adjacent sets of DSW’s • When applied to mining DSW’s associated with each node of a hierarchical directory/document tree • Each node being annotated with a domain tag

  6. DSW Applications (1) • Technical term extraction: • W(d) ={ w | w  DSW(d) } • d  {computer, traveling, food, …}

  7. DSW Applications (2) • Generic WSD based on DSW’s • ArgmaxSSd P(s|d,W)P(d|W) = agrmaxSSd P(s|d,W)P(W|d)P(d) • If a large-scale sense-tagged corpus is not available, which is often the case • Machine translation • Help select translation lexicon candidates • E.g., money bank (when used with “payment”, “loan”, etc.), river bank, memory bank (in PC, Intel, MS Windows domains)

  8. DSW Applications Need sense-tagged corpora for training (*not widely available) • Generic WSD based on DSW’s Implicitly domain-tagged corpora are widely available in the web Sum over domains where w0 is a DSW

  9. DSW Applications (3) • Document classification • N-class classification based on DSW’s • Anti-spamming (Two-class classification) • Words in spamming (uninteresting) mails vs. normal (interesting) mails help block spamming mails • Interesting domains vs. uninteresting domains • P(W|S)P(S) vs. P(W|~S)P(~S)

  10. DSW Applications (3.a) • Document classification based on DSW’s • d: document class label • w[1..n]: bag of words in document • |D| = n >= 2: number of document classes • Anti-spamming based on DSW’s • |D|=n=2 (two-class classification)

  11. DSW Applications (4) • Building large lexicon tree or wordnet-lookalike (semi-) automatically from hierarchical web documents • Membership: Semantic links among words of the same domain are close (context), similar (synonym, thesaurus), or negated concept (antonym) • Hierarchy: Hierarchy of the lexicon suggests some ontological relationships

  12. Conventional Methods for Constructing Lexicon Trees • Construction by Clustering • Collect words in a large corpus • Evaluate word association as distance (or closeness) measure for all word pairs • Use clustering criteria to build lexicon hierarchy • Adjust the hierarchy and Assign semantic/sense tags to nodes of the lexicon tree • Thus assigning sense tags to members of each node

  13. A0, A1, B A0, A1, C A0, A2, D A0, A2, E B D C E A12 A04 A22 Clustering Methods for Constructing Lexicon Trees

  14. Clustering Methods for Constructing Lexicon Trees • Disadvantages • Do not take advantages of hierarchical information of document tree (flattened when collecting words) • Word association & Clustering criteria are not related directly to human perception • Most clustering algorithms conduct binary merging (or division) in each step for simplicity • Automatically generated semantics hierarchy may not reflect human perception • Hierarchy boundaries are not clearly & automatically detected • Adjustment of hierarchy may not be easy (since human perception is not used to guide clustering) • Pairwise association evaluation is costly

  15. A04, A12, A22, B, C, D, E A02, A12, B, C A02, A22, D, E A0, A1, B A0, A1, B A0, A1, C A0, A1, C A0, A2, D A0, A2, D A0, A2, E A0, A2, E Hierarchical Information Loss when Collecting Words

  16. A0, A1, B ? A0, A1, C ? A0, A2, D ? ? A0, A2, E ? ? D B E C A12 A04 A22 Clustering Methods for Constructing Lexicon Trees Reflect human perception? Why binary? Hierarchy?

  17. Alternative View for Constructing Lexicon Trees • Construction by Retaining DSW’s • Preserve hierarchical structure of web documents as baseline of semantic hierarchy, which is already mildly confirmed by webmasters • Associate each node with DSW’s as members and tag each DSW with the directory/domain name • Optionally adjust the tree hierarchy and members of each nodes

  18. O,O,O,O O,X,O,O O,O,X,O X,O,X,O X,X,O,O O,X,O,X O,O,X,X Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW

  19. O,O,O,O O,O,O O,O,O O,O O,O O,O O,O Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW

  20. Constructing Lexicon Trees by Preserving DSW’s • Advantages • Hierarchy reflect human perception • Adjustment could be easier if necessary • Directory names are highly correlated to sense tags • Domain-based model can be used if sense-tagged corpora is not available • Pairwise word association evaluation is replaced by computation of “domain specificity” against domains • O(|W|x|W|) vs. O(|W|x|D|) • Requirements: • A well-organized web site • Mining DSW’s from such a site

  21. A04, A12, A22, B, C, D, E A02, A12, B, C A02, A22, D, E A0, A1, B A0, A1, C A0, A2, D A0, A2, E Constructing Lexicon Trees by Preserving DSW’s Synonym Antonym Is_a, hypernym, … X Membership (closeness, similarity) relationship Y Y is_a X ?? B is_a X (or A1)

  22. Alternative View for Constructing Lexicon Trees • Benefits: • No similarity computation: Closeness (incl. similarity) is already implicitly encoded by human judges • No binary clustering: Clustering is already done (implicitly) with human judgment • Hierarchical links available: Some well developed relationships are already done • Although not perfect…

  23. Proposed Method for Mining • Web Hierarchy as a Large Document Tree • Each document was generated by applying DSW’s to some generic document templates • Remove non-specific words from documents, leaving a lexicon tree with DSW’s associated with each node • Leaving only domain-specific words • Forming a lexicon tree from a document tree • Label domain specific words • Characteristics: • Get associated words by measuring domain-specificity to a known and common domain instead of measuring pairwise association plus clustering

  24. Mining Criteria:Cross-Domain Entropy • Domain-independent terms tend to distributed evenly in all domains. • Distributional “evenness” can be measured with the Cross-Domain Entropy (CDE) defined as follows: • Pij: probability of word-i in domain-j • fij: normalized frequency

  25. Mining Criteria:Cross-Domain Entropy • Example: • Wi = “piston”, with frequencies (normalized to [0,1]) at various domains: • fij = (0.001, 0.62, 0.0003, 0.57, 0.0004) • Domain-specific (unevenly distributed) at the 2nd and the 4th domains

  26. Mining Algorithm – Step1 • Step1 (Data Collection): Acquire a large collection of web documents using a web spider while preserving the directory hierarchy of the documents. Strip unused markup tags from the web pages.

  27. Mining Algorithm – Step2 • Step2 (Word Segmentation or Chunking): Identify word (or compound word) boundaries in the documents by applying a word segmentation process, such as (Chiang 92; Lin 93), to Chinese-like documents (where word boundaries are not explicit) or applying a compound word chunking algorithms to English-like documents in order to identify interested word entities.

  28. Mining Algorithm – Step3 • Step3 (Acquiring Normalized Term Frequencies for all Words in Various Domains): For each subdirectory dj, find the number of occurrences nij of each term wi in all the documents, and derive the normalized term frequency fij= nij/Nj by normalizing nij with the total document size, Nj = Sinij, in that directory. The directory is then associated with a set of <wi, dj, fij> tuples, where wiis the i-th words of the complete word list for all documents, dj is the j-th directory name (refer to as the domain hereafter), and fij is the normalized relative frequency of occurrence of in domain dj.

  29. Mining Algorithm – Step3 Input: where

  30. Mining Algorithm – Step4 • Step4 (Removing Domain-Independent Terms): Domain-independent terms are identified as those terms which distributed evenly in all domains. That is, terms with large Cross-Domain Entropy (CDE) defined as follows: • Terms whose CDE is above a threshold can be removed from the lexicon tree since such terms are unlikely to be associated with any domain closely. Terms with a low CDE will be retained in a few domains with the highest normalized frequencies (e.g., top-1 and top-2).

  31. Experiments • Domains: • News articles from a local news site • 138 distinct domains • including leaf nodes of the directory tree and their parents • leaves with the same name are considered in the same domain • Examples: baseball, basketball, broadcasting, car, communication, culture, digital, edu(cation), entertainment (流星+花園), finance, food (大魚大肉,干貝,木耳,錫箔紙,…)… • Size: 200M bytes (HTML files) • 16K+ unique words after word segmentation

  32. Domains(Hierarchy not shown)

  33. Sample Output (4 Selected Domains) Table 1. Sampled domain specific words with low entropies.

  34. Preliminary Results • Domain specific words and the assigned domain tags are well associated (e.g., “投手” is specifically used in the “baseball” domain.) • Extraction with the cross-domain entropy (CDE) metric is well founded. • Domain-independent (or irrelevant) words (such as those for webmaster’s advertisements) are well rejected as DSW candidates for their high cross-domain entropy • DSW’s are mostly nouns and verbs (open-class words)

  35. Preliminary Results • Low cross-domain entropy words (DSW’s) in the respective domain are generally highly correlated (e.g., “日本職棒”, “部長”) • New usages of words, such as “活塞” (Pistons) with the “basketball” sense, could also be identified • Both are good for WSD tasks to use the DSW’s as contextual evidences

  36. Error Sources • Single CDE metric may not be sufficient to capture all characteristics of “domain-specificity” • Type II error: Some general (non-specific) words may have low entropy simply because they appear only in one domain (CDE=0) • Probably due to low occurrence counts (a kind of estimation error) • Type I error: Some multiple sense words may have too many senses and thus be mis-recognized as non-specific in each domain (although the senses are unique in respect domains)

  37. Error Sources • “Well-organized website” assumption may not be available all the time • The hierarchical directory tags may not be appropriate representatives for the document words within a website • The hierarchies may not be consistent from website to website

  38. Future works • Use other knowledge sources, other than the single CDE measure, to co-train the model in a manner similar to [Chang 97b, c] • E.g., with other term weighting metrics • E.g., stop list acquisition metric for identifying common words (for type II errors) • Explore methods and criteria to adjust hierarchy of a single directory tree • Explore methods to merge directory trees from different sites

  39. Concluding Remarks • A simple metric for automatic/semi-automatic identification of DSW’s • At low sense tagging cost • Rich web resource almost free • Implicit semantic tagging implied by the directory hierarchy (imperfect hierarchy but free) • A simple method to build semantic links and degree of closeness among DSW’s • may be helpful for building large semantically tagged lexicon trees or network linked x-wordnets • Good knowledge source for WSD-related applications • WSD, Machine translation, document classification, anti-spamming, …

  40. Thanks for your attention!! Thanks!!

More Related