500 likes | 628 Views
My past works. Kazunari Sugiyama 16 Apr., 2009 WING Group Meeting. Outline. I. Natural Language Processing (Disambiguation) I-1 Personal Name Disambiguation in Web Search Results I-2 Word Sense Disambiguation in Japanese Texts II. Web Information Retrieval
E N D
My past works Kazunari Sugiyama 16 Apr., 2009 WING Group Meeting
Outline I. Natural Language Processing (Disambiguation) I-1 Personal Name Disambiguation in Web Search Results I-2 Word Sense Disambiguation in Japanese Texts II. Web Information Retrieval II-1 Characterizing Web Pages Using Hyperlinked Neighboring Pages II-2 Adaptive Web Search based on User’s Information Needs
I-1 Personal Name Disambiguation in Web Search Results [Outline] 1. Introduction 2. Our Proposed Method 3. Experiments
Politician Professor of Computer Science consulting company Robert M. Gates (other person not “William Cohen”)
C C 2 2 t t 2 2 t C C 1 1 1 G G G’ G G G i 1 1 2 2 1 p p p p p w w w w w 2.1 Our Proposed Method [Semi-supervised Clustering] s2 s2 t t × × n n s1 × × s1 × p D(G, w ) p p w w t t 3 3 t 1 p : search-result Web page p : seed page s i : feature vector of p : the centroid vector of a cluster Control the fluctuation of the centroid of a cluster (G) p : feature vector of a Web page contained in a cluster that has centroid G w
3. Experiments 3.1 Experimental Data 3.2 Evaluation Measure 3.3 Experimental Results
3.1 Experimental Data • WePS Corpus • Established for “Web People Search Task” at SemEval-2007 in the Association for Computational Linguistics (ACL) conference • Web pages related to 79 personal names • Sampled from • participants in conferences on digital libraries and computational linguistics, • bibliographic articles in the English Wikipedia, • the U.S. Census • The top 100 Yahoo search results via its search API for a personal name query • Training sets: 49 names, Test sets: 30 names (7,900 Web pages in total) • Pre-processing for WePS corpus • Eliminate stopwords, and perform stemming • Determine the optimal parameter for merging similar clusters using the training set in the WePS corpus and then apply it to the test set in the WePS corpus
3.2 Evaluation Measure (1) Purity (2) Inverse purity (3) F (harmonic mean of (1) and (2)) [Hotho et al., GLDV Journal’05] (These are standard evaluation measures employed in the “Web People search task.”)
This result is comparable to the top results (0.78) among the top 5 participants in “Web People Search Task.” We could acquire useful information from sentences that characterize an entity of a person and disambiguate the entity of a person effectively. 3.3 Experimental Results
I-2. Word Sense Disambiguation in Japanese Texts [Outline] 1. Introduction 2. Proposed Method 3. Experiments
1. Introduction (1/2) • Word Sense Disambiguation (WSD) • Determining the meaning of ambiguous word in its context “run”: (1) Bob goes running every morning. “to move fast by using one’s feet” (2) Mary runs a beauty parlor. “to direct or control”
1. Introduction (2/2) • Our approach for WSD • Basically, supervised WSD • Applying semi-supervised clustering by introducing sense-tagged instances to supervised WSD Add sense-taggedinstances (“seed instances”) (1) (2) Semi-supervised clustering Feature extraction for clustering and WSD (“baseline features”) Raw corpus not assigned sense tags (3) (4) Feature extraction for WSD from clustering results Supervised WSD
2. Proposed Method 2.1 Semi-supervised Clustering 2.2 Features for WSD obtained Using Clustering Results
2.1 Semi-supervised Clustering 2.1.1 Features for Clustering 2.1.2 Semi-supervised Clustering 2.1.3 Seed Instances and Constraints for Clustering
2.1.1 Features for Clustering and WSD (“baseline features”) • Morphological features • Bag-of-words (BOW), Part-of-speech (POS), and detailed POS classification • Target word itself and the two words to its right and left. • Syntactic features • If the POS of a target word is a noun, extract the verb in a grammatical dependency relation with the noun. • If the POS of a target word is a verb, extract the noun in a grammatical dependency relation with the verb. • Figures in Bunrui-Goi-Hyou • 4 and 5 digits regarding the content word to the right and left of the target word. • 地域 (“community”)社会 (“society”) “1.1720,4,1,3” → 1172 (as 4 digits), 11720 (as 5digits) • 5 topics inferred on the basis of LDA • Compute the log-likelihood of a word instance (“soft-tag” approach [Cai et al., EMNLP’07])
x x s s 1 1 t t 2 2 t t C C C C C C C 1 1 2 1 2 3 new 3 1 G G G G G G G G i 1 new 2 2 2 2 1 p p w w 2.1.2 Semi-supervised Clustering [Proposed method] Refine [Sugiyama and Okumura, ICADL’07] (about Web People Search Task) for word instances s2 s2 t t × × n n × × × × × t t 3 3 x : word instance x : seed instance s i : the centroid of a cluster : adaptive Mahalanobis distance Control the fluctuation of the centroid of a cluster
2.1.3 Seed Instances and Constraints for Clustering (1/3) Select initial seed instances: (I-1) randomly, (I-2) “KKZ” (I-3) centroid of a cluster generated by K-means initial instances for K-means * randomly, * “KKZ”[Katsavounidis et al., IEEE Signal Processing Letters, ‘94] [Method I] Data set of word instances Training data set Data set for clustering : seed instance
2.1.3 Seed Instances and Constraints for Clustering (2/3) Select initial seed instances: (II-1) By considering the frequency of word senses (II-1-1) randomly, (II-1-2) “KKZ” (II-1-3) centroid of a cluster generated by K-means initial instances for K-means * randomly, * “KKZ” (II-2) In proportion to the frequency of word senses (D’Hondt method) (II-2-1) randomly, (II-2-2) “KKZ” (II-2-3) centroid of a cluster generated by K-means initial instances for K-means * randomly, * “KKZ” [Method II] Data set of word instances s1 s2 s3 Data set for clustering Training data set s1, s2, s3: word senses of target word : seed instance
D( , ) < × Put “must-link” constraints between and C C C s v s s v v D( , ) > , D( , ) < , Th Th Th Th (Th = 0.388) dis dis dis dis D( , ) > dis × C C is an outlier, C C C C D( , ) D( , ) s s new new new new s s s s G G G G w w v v w w G G G G new new new new new new C C C C G G G G G G G G G G s s s s v v w w so a “must-link” constraint is not added. C C D( , ) D( , ) s s w w 2.1.3 Seed Instances and Constraints for Clustering (3/3) • Constraints • “cannot-link” only, • “must-link” only, • Both constraints, • “cannot-link” and “must-link” without outliers [“must-link” without outliers]
2.2 Features for WSD Obtained Using Clustering Results (a) Inter-cluster information • TF in a cluster (TF) • Cluster ID (CID) • Sense frequency (SF) (b) Context information regarding adjacent words • Mutual information (MI) • T-score (T) • (CHI2) (c) Context information regarding two words to the right and left of the target word • Information gain (IG) *We employ (b) and (c) to reflect the concept of “one sense per collocation.” [Yarowsky, ACL’95]
4. Experiments 4.1 Experimental Data 4.2 Semi-supervised Clustering 4.3 Word Sense Disambiguation
4.1 Experimental Data • RWC corpus from the “SENSEVAL-2 Japanese Dictionary Task” • 3000 Japanese newspaper articles issued in 1994 • Sense tags in Japanese Dictionary, “Iwanami Kokugo Jiten” were manually assigned to 148,558 ambiguous words. • 100 target words • 50 nouns, and 50 verbs
4.2 Semi-supervised Clustering • Comparison of distance-based approaches • Our semi-supervised clustering approach outperforms other distance-based approaches. • Our method locally adjusts the centroid of a cluster Distance-based approaches [Observations]
4.3 Word Sense Disambiguation • Experimental Results [Observations] • The best accuracy is obtained, when we add features • from clustering results, CID, MI, and IG to the baseline features. • According to the results of OURS, TITECH, NAIST, WSD accuracy is • significantly improved by adding features computed from clustering results.
II-1 Characterizing Web Pages Using Hyperlinked Neighboring Pages [Outline] 1. Introduction 2. Our Proposed Method 3. Experiments
Users are not satisfied with • the ease of use, • retrieval accuracy. 1. Introduction Transitions of Web Search Engines The first-generation search engines: • Only terms included in Web pages were utilized as indices of • Web pages. • Peculiar features of the Web such as hyperlink structures are not exploited. The second-generation search engines: • The hyperlink structures of Web pages are considered. e.g., (1) “Optimal Document Granularity” based IR systems (2) HITS (CLEVER project), Teoma (DiscoWeb) (3) PageRank (Google)
p p w w tgt tgt p tgt tgt w’ w’ refined feature vector of : p p tgt 2. Our Proposed Methods t p target Web page: m tgt p feature vector of : tgt feature vector of Web page hyperlinked from p tgt t 3 t 2 t 1
p w tgt p tgt w’ Method Ⅰ [in the Web space] [in the vector space] t m p L(in) tgt L(out) t 3 t 2 t 1
p w tgt p tgt w’ Method Ⅱ [in the Web space] [in the vector space] t m L(in) p tgt t L(out) 3 t 2 t 1 clusters generated from groups of Web pages in each level from target page centroid vector of clusters
p w tgt p tgt w’ Method Ⅲ [in the Web space] [in the vector space] t m p L(in) tgt t L(out) 3 t 2 t 1 clusters generated from groups of Web pages in each level from target page centroid vector of clusters
Experimental Setup • Data Set • TREC WT10g test collection (10GB, 1.69 million Web pages) • Specification of Workstation • Sun Enterprise 450 • CPU: UltraSparc-II 480MHz • Memory: 2 GB • OS: Solaris 8
p tgt Web pages Using in the Experiment
average precision %improvement - HITS 0.032 average precision %improvement modified HITS +10.4 0.136 modified HITS with weighted links +10.6 0.138 - tfidf 0.313 0.342 0.342 +2.9 +31.0 (MI-a), L(in)=3 (MI-a), L(in)=3 (MII-a), L(in)=1, K=2 (MII-a), L(in)=1, K=2 0.340 0.340 +2.7 +30.8 (MIII-a), L(in)=2, K=3 (MIII-a), L(in)=2, K=3 0.345 0.345 +3.2 +31.3 The contents of a target Web page can be represented much better by making a point of in-linked pages of a target page. Experimental Results Comparison of best search accuracy obtained using each Method I, II, III
p w tgt p tgt w’ Method Ⅲ [in the Web space] [in the vector space] t m 2 1 t 3 t p 2 L(in) tgt t 1 clusters generated from groups of Web pages in each level from target page centroid vector of clusters
II-2 Adaptive Web Search Based on User’s Information Needs [Outline] 1. Introduction 2. Our Proposed Method 3. Experiments
In general, each user hasdifferent information needs for his/her query. Web search results should adapt to users with different information needs. 1. Introduction • The World Wide Web (WWW) • It has become increasingly difficult for users to find relevant information. • Web search engines help users find useful information on the WWW. • Web Search Engines • Return the same results regardless of who submits the query.
User profile construction based on (1) pure browsing history, (2) modified collaborative filtering. 2. Our Proposed Method
per today P P (1) User Profile Construction Based on Pure Browsing History (1/3) We assume that the preferences of each user consist of the following two aspects: (1) Persistent (or long term) preferences , (2) Ephemeral (or short term) preferences and construct user profile . P
per today P P th n bh (cur) (br) (r) P P P (1) User Profile Construction Based on Pure Browsing History (2/3) [Persistent preferences] [Ephemeral preferences] 1 S S S 1 1 N 2 1 Browsing history of N days ago Browsing history of 2 days ago Browsing history of 1 days ago Browsing history of today (0 day ago) (r) (1) (1) (r) (r) (n ) (cur) (cur) 1 S 1 hp S 1 S 1 S bh 0 0 0 0 1st browsing history in today rth browsing history in today browsing history in today current session : Web page : Window size
(a) (b) (1) User Profile Construction Based on Pure Browsing History (3/3) User profile is finally constructed as follows: P
Overview of the Pure Collaborative Filtering Algorithm User-item ratings matrix for collaborative filtering Item that prediction is computed item 1 item 2 item i item I user 1 2 3 2 user 2 5 3 1 4 Active user user a 5 user U 4 2 5
(2) User Profile Construction Based on Modified Collaborative Filtering (1/2) User-term weight matrix for modified collaborative filtering Term weight that prediction is computed term 1 term 2 term i term T When each user browsed k Web pages user 1 0.745 0.362 0.718 user 2 0.835 0.534 0.126 Active user user a 0.639 0.485 user U 0.247 0.461 0.928 Term weight that prediction is computed term T+1 term T+2 term T+v term 1 term 2 term i term T When each user browsed k+1 Web pages user 1 0.745 0.362 0.718 0.451 user 2 0.835 0.534 0.126 0.723 Active user user a 0.639 0.485 0.328 0.563 user U 0.247 0.461 0.928 0.686 0.172
User Profile Construction Based on Modified Collaborative Filtering Algorithm User Profile Construction Based on (1) Static Number of Users in the Neighborhood, (2) Dynamic Number of Users in the Neighborhood.
Experiment (1/2) Construction of User Profile • Explicit Method (1) Relevance Feedback, • Implicit Method (2) Method based on pure browsing history, (3) Method based on modified collaborative filtering • Observed browsing history • 20 subjects • 30 days
Experiments (2/2) • Query • 50 query topics employed as test topics in the TREC WT10g test collection • Evaluation • Compute similarity between user profile and feature vector of Web page in search results • R-precision (R=30)
per (br) (cur) P P P User Profile Based on Pure Browsing History (Implicit Method) Using , and , user profile is defined as follows: P
A user profile that provides search results adaptive to a user can be constructed when a window size with about 15 days is used. • This approach can achieve about 3% higher precision than the relevance feedback-based user profile. The user’s browsing history strongly reflects the user’s preference. Experimental Results
per (br) (pre) P P V User Profile Based on Modified Collaborative Filtering (Implicit Method) Using , and , user profile is defined as follows: P User Profile Construction Based on (1) Static Number of Users in the Neighborhood, (2) Dynamic Number of Users in the Neighborhood
In all of our experimental results, the best precision is obtained in the case of x=0.129, y= 0.871. This allows each user to perform more fine-grained search compared with the static method. Experimental Results(Dynamic Method) In this method, the neighborhood of each user is determined by the centroid vectors of clusters of users, and the number of clusters is different user by user.