Mining Domain Specific Words from Hierarchical Web Documents

Mining Domain Specific Words fromHierarchical Web Documents Jing-Shin Chang (張景新) Department of Computer Science & Information Engineering National Chi-Nan (暨南) University 1, Univ. Road, Puli, Nantou 545, Taiwan, ROC. jshin@csie.ncnu.edu.tw CJNLP-04, 2004/11/10~15, City U., H.K.

TOC • Motivation • What are DSW’s? • Why DSW Mining? (Applications) • WSD with DSW’s without sense tagged corpus • Constructing Hierarchical Lexicon Tree w/o Clustering • Other applications • How to Mine DSW’s from Hierarchical Web Documents • Preliminary Results • Error Sources • Remarks

Motivation • “Is there a quick and easy (engineering) way to construct a large scale WordNet or things like that … now that everyone is talking about ontological knowledge sources and X-WordNet (whatever you call it)…?” • …trigger a new view for constructing a lexicon tree with hierarchical semantic links… • …DSW identification turns out to be a key to such construction • …and can be used in various applications, including DSW-based WSD without using sense tagged corpora…

What Are Domain Specific Words (DSW’s) • Words that appear frequently in some particular domains: • (a) Multiple Sense Words that are used frequently with special meanings or usage in particular domains • E.g., piston: “活塞” (mechanics) or “活塞隊” (sports) • (b) Single Sense Words that are used frequently in particular domains • Suggesting that some words in the current document might be related to this particular sense • As “anchor words/tags” in the context for disambiguating other multiple sense words

What to Do in DSW Mining • DSW Mining Task • Find lists of words that occurs frequently in the same domain and associate each list (and words within it) a domain (implicit sense) tag • E.g., entertainment: ‘singer’, ‘pop songs’, ‘rock & roll’, ‘Chang Hui-Mei’ (‘Ah-Mei’), ‘album’, … • As a side effect, find the hierarchical or network-like relationships between adjacent sets of DSW’s • When applied to mining DSW’s associated with each node of a hierarchical directory/document tree • Each node being annotated with a domain tag

DSW Applications (1) • Technical term extraction: • W(d) ={ w | w  DSW(d) } • d  {computer, traveling, food, …}

DSW Applications (2) • Generic WSD based on DSW’s • ArgmaxSSd P(s|d,W)P(d|W) = agrmaxSSd P(s|d,W)P(W|d)P(d) • If a large-scale sense-tagged corpus is not available, which is often the case • Machine translation • Help select translation lexicon candidates • E.g., money bank (when used with “payment”, “loan”, etc.), river bank, memory bank (in PC, Intel, MS Windows domains)

DSW Applications Need sense-tagged corpora for training (*not widely available) • Generic WSD based on DSW’s Implicitly domain-tagged corpora are widely available in the web Sum over domains where w0 is a DSW

DSW Applications (3) • Document classification • N-class classification based on DSW’s • Anti-spamming (Two-class classification) • Words in spamming (uninteresting) mails vs. normal (interesting) mails help block spamming mails • Interesting domains vs. uninteresting domains • P(W|S)P(S) vs. P(W|~S)P(~S)

DSW Applications (3.a) • Document classification based on DSW’s • d: document class label • w[1..n]: bag of words in document • |D| = n >= 2: number of document classes • Anti-spamming based on DSW’s • |D|=n=2 (two-class classification)

DSW Applications (4) • Building large lexicon tree or wordnet-lookalike (semi-) automatically from hierarchical web documents • Membership: Semantic links among words of the same domain are close (context), similar (synonym, thesaurus), or negated concept (antonym) • Hierarchy: Hierarchy of the lexicon suggests some ontological relationships

Conventional Methods for Constructing Lexicon Trees • Construction by Clustering • Collect words in a large corpus • Evaluate word association as distance (or closeness) measure for all word pairs • Use clustering criteria to build lexicon hierarchy • Adjust the hierarchy and Assign semantic/sense tags to nodes of the lexicon tree • Thus assigning sense tags to members of each node

A0, A1, B A0, A1, C A0, A2, D A0, A2, E B D C E A12 A04 A22 Clustering Methods for Constructing Lexicon Trees

Clustering Methods for Constructing Lexicon Trees • Disadvantages • Do not take advantages of hierarchical information of document tree (flattened when collecting words) • Word association & Clustering criteria are not related directly to human perception • Most clustering algorithms conduct binary merging (or division) in each step for simplicity • Automatically generated semantics hierarchy may not reflect human perception • Hierarchy boundaries are not clearly & automatically detected • Adjustment of hierarchy may not be easy (since human perception is not used to guide clustering) • Pairwise association evaluation is costly

A04, A12, A22, B, C, D, E A02, A12, B, C A02, A22, D, E A0, A1, B A0, A1, B A0, A1, C A0, A1, C A0, A2, D A0, A2, D A0, A2, E A0, A2, E Hierarchical Information Loss when Collecting Words

A0, A1, B ? A0, A1, C ? A0, A2, D ? ? A0, A2, E ? ? D B E C A12 A04 A22 Clustering Methods for Constructing Lexicon Trees Reflect human perception? Why binary? Hierarchy?

Alternative View for Constructing Lexicon Trees • Construction by Retaining DSW’s • Preserve hierarchical structure of web documents as baseline of semantic hierarchy, which is already mildly confirmed by webmasters • Associate each node with DSW’s as members and tag each DSW with the directory/domain name • Optionally adjust the tree hierarchy and members of each nodes

O,O,O,O O,X,O,O O,O,X,O X,O,X,O X,X,O,O O,X,O,X O,O,X,X Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW

O,O,O,O O,O,O O,O,O O,O O,O O,O O,O Constructing Lexicon Trees by Preserving DSW’s O: +DSW X: -DSW

Constructing Lexicon Trees by Preserving DSW’s • Advantages • Hierarchy reflect human perception • Adjustment could be easier if necessary • Directory names are highly correlated to sense tags • Domain-based model can be used if sense-tagged corpora is not available • Pairwise word association evaluation is replaced by computation of “domain specificity” against domains • O(|W|x|W|) vs. O(|W|x|D|) • Requirements: • A well-organized web site • Mining DSW’s from such a site

A04, A12, A22, B, C, D, E A02, A12, B, C A02, A22, D, E A0, A1, B A0, A1, C A0, A2, D A0, A2, E Constructing Lexicon Trees by Preserving DSW’s Synonym Antonym Is_a, hypernym, … X Membership (closeness, similarity) relationship Y Y is_a X ?? B is_a X (or A1)

Alternative View for Constructing Lexicon Trees • Benefits: • No similarity computation: Closeness (incl. similarity) is already implicitly encoded by human judges • No binary clustering: Clustering is already done (implicitly) with human judgment • Hierarchical links available: Some well developed relationships are already done • Although not perfect…

Proposed Method for Mining • Web Hierarchy as a Large Document Tree • Each document was generated by applying DSW’s to some generic document templates • Remove non-specific words from documents, leaving a lexicon tree with DSW’s associated with each node • Leaving only domain-specific words • Forming a lexicon tree from a document tree • Label domain specific words • Characteristics: • Get associated words by measuring domain-specificity to a known and common domain instead of measuring pairwise association plus clustering

Mining Criteria:Cross-Domain Entropy • Domain-independent terms tend to distributed evenly in all domains. • Distributional “evenness” can be measured with the Cross-Domain Entropy (CDE) defined as follows: • Pij: probability of word-i in domain-j • fij: normalized frequency

Mining Criteria:Cross-Domain Entropy • Example: • Wi = “piston”, with frequencies (normalized to [0,1]) at various domains: • fij = (0.001, 0.62, 0.0003, 0.57, 0.0004) • Domain-specific (unevenly distributed) at the 2nd and the 4th domains

Mining Algorithm – Step1 • Step1 (Data Collection): Acquire a large collection of web documents using a web spider while preserving the directory hierarchy of the documents. Strip unused markup tags from the web pages.

Mining Algorithm – Step2 • Step2 (Word Segmentation or Chunking): Identify word (or compound word) boundaries in the documents by applying a word segmentation process, such as (Chiang 92; Lin 93), to Chinese-like documents (where word boundaries are not explicit) or applying a compound word chunking algorithms to English-like documents in order to identify interested word entities.

Mining Algorithm – Step3 • Step3 (Acquiring Normalized Term Frequencies for all Words in Various Domains): For each subdirectory dj, find the number of occurrences nij of each term wi in all the documents, and derive the normalized term frequency fij= nij/Nj by normalizing nij with the total document size, Nj = Sinij, in that directory. The directory is then associated with a set of <wi, dj, fij> tuples, where wiis the i-th words of the complete word list for all documents, dj is the j-th directory name (refer to as the domain hereafter), and fij is the normalized relative frequency of occurrence of in domain dj.

Mining Algorithm – Step3 Input: where

Mining Algorithm – Step4 • Step4 (Removing Domain-Independent Terms): Domain-independent terms are identified as those terms which distributed evenly in all domains. That is, terms with large Cross-Domain Entropy (CDE) defined as follows: • Terms whose CDE is above a threshold can be removed from the lexicon tree since such terms are unlikely to be associated with any domain closely. Terms with a low CDE will be retained in a few domains with the highest normalized frequencies (e.g., top-1 and top-2).

Experiments • Domains: • News articles from a local news site • 138 distinct domains • including leaf nodes of the directory tree and their parents • leaves with the same name are considered in the same domain • Examples: baseball, basketball, broadcasting, car, communication, culture, digital, edu(cation), entertainment (流星+花園), finance, food (大魚大肉,干貝,木耳,錫箔紙,…)… • Size: 200M bytes (HTML files) • 16K+ unique words after word segmentation

Domains(Hierarchy not shown)

Sample Output (4 Selected Domains) Table 1. Sampled domain specific words with low entropies.

Preliminary Results • Domain specific words and the assigned domain tags are well associated (e.g., “投手” is specifically used in the “baseball” domain.) • Extraction with the cross-domain entropy (CDE) metric is well founded. • Domain-independent (or irrelevant) words (such as those for webmaster’s advertisements) are well rejected as DSW candidates for their high cross-domain entropy • DSW’s are mostly nouns and verbs (open-class words)

Preliminary Results • Low cross-domain entropy words (DSW’s) in the respective domain are generally highly correlated (e.g., “日本職棒”, “部長”) • New usages of words, such as “活塞” (Pistons) with the “basketball” sense, could also be identified • Both are good for WSD tasks to use the DSW’s as contextual evidences

Error Sources • Single CDE metric may not be sufficient to capture all characteristics of “domain-specificity” • Type II error: Some general (non-specific) words may have low entropy simply because they appear only in one domain (CDE=0) • Probably due to low occurrence counts (a kind of estimation error) • Type I error: Some multiple sense words may have too many senses and thus be mis-recognized as non-specific in each domain (although the senses are unique in respect domains)

Error Sources • “Well-organized website” assumption may not be available all the time • The hierarchical directory tags may not be appropriate representatives for the document words within a website • The hierarchies may not be consistent from website to website

Future works • Use other knowledge sources, other than the single CDE measure, to co-train the model in a manner similar to [Chang 97b, c] • E.g., with other term weighting metrics • E.g., stop list acquisition metric for identifying common words (for type II errors) • Explore methods and criteria to adjust hierarchy of a single directory tree • Explore methods to merge directory trees from different sites

Concluding Remarks • A simple metric for automatic/semi-automatic identification of DSW’s • At low sense tagging cost • Rich web resource almost free • Implicit semantic tagging implied by the directory hierarchy (imperfect hierarchy but free) • A simple method to build semantic links and degree of closeness among DSW’s • may be helpful for building large semantically tagged lexicon trees or network linked x-wordnets • Good knowledge source for WSD-related applications • WSD, Machine translation, document classification, anti-spamming, …

Thanks for your attention!! Thanks!!

Mining Domain Specific Words from Hierarchical Web Documents

Mining Domain Specific Words from Hierarchical Web Documents

Presentation Transcript

Domain-Specific Corpora

Sociopolitical Domain as a Bridge from General Words to Terms of Specific Domains

Information Extraction from Web Documents

Second Grade ELA Domain Specific Words A thru F

From Web Documents to Web Applications

Domain Specific Deep Web Discovery and Cataloging

Domain Specific Languages

How domain specific are Domain Specific Languages?

SPECIFIC Words

Domain-Specific Software Engineering

Domain-Specific Languages:

Adding Domain-Specific Knowledge

From Web Documents to Web Applications

Domain Specific Language

Domain Specific Languages

Automating the Extraction of Domain Specific Information from the Web

Domain Specific Languages

L-ISA Learning Domain Specific ISA relations from the WEB

Domain Specific Models

Domain Specific Languages