490 likes | 622 Views
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion. Product Review Summarization from a Deeper Perspective. Ly Duy Khang Supervisor: A/P KAN Min Yen. Ly Duy Khang. CS4101 B.COMP. DISSERTATION. Introduction Product Facet Identification
E N D
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Product Review Summarization from a Deeper Perspective Ly Duy Khang Supervisor: A/P KAN Min Yen Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Outline • Introduction • Motivation • Related work • Problem statement & Our approach • Product Facet Identification • Preliminaries • Methodology • Evaluation • Improvement • Subtopic Summarization • Preliminary • Methodology • Evaluation • Discussion and Conclusion Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Outline • Introduction • Motivation • Related work • Problem statement & Our approach • Product Facet Identification • Preliminaries • Methodology • Evaluation • Improvement • Subtopic Summarization • Preliminary • Methodology • Evaluation • Discussion and Conclusion Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Product review A media commonly provided by online merchants for customers to review and express opinions on the products that they have purchased. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Product review is an important source of information: More and more people are shopping online, as a result of the expansion of e-commerce. Enables customers to find opinions about products easily, as well as to share them with their peers. Allows producers to get certain degree of feedback. Problems The number of reviews is often too large, and is still growing rapidly. It is difficult to locate and capture opinions effectively. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Product review summarization system Automatically process a large collection of reviews. Identify topics and opinions in the review. Aggregate all information and present a concise summary to the user. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Summarization • The task of extracting and presenting the most important information • from the inputs. • News headline • Program agenda • Scientific paper abstract • … Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Review Summarization • Focus on opinions (techniques from Sentiment Analysis): • Thumbs-up/Thumbs-down indication: [Turney02] • Facet-based summary: [Hu04a],[Hu04b],[Popescu05] • Comparative summary: [Hu05] Product Facet examples: Camera: “battery life”, “lens”, “flash”, “resolution”, etc. Music player: “sound” , “weight”, “size”, “storage”, etc. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Google Product Bing Shopping Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Problem statement • Produce a facet-based summary of product review that captures • Opinions of users. • Evidences that support those opinions. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Motivation Related work Problem statement & Our approach Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Approach and Contribution • Two main components: • Product Facet Identification • Re-implement the baseline from [Hu04a] • Contribute a new effective heuristic to improve the accuracy • Subtopic Summarization • Initiate a sentence clustering solution • Make necessary modification to sentence semantic similarity measurement (adopted from [Li06] and [Kong07]) Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Outline • Introduction • Motivation • Related work • Approach • Product Facet Identification • Preliminaries • Methodology • Evaluation • Improvement • Subtopic Summarization • Overview • Methodology • Evaluation • Discussion and Conclusion Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Why do we want to automate this task? • It is hard or even impossible to obtain a complete list of facets. • e.g., iPhone’s alarm function • Different set of words used by users and manufacturers/sellers to describe the same facet. • e.g., Price vs. Value; Body vs. Case • The manufacturer may not want to include those weak facets of their product. • e.g., iPhone is unable to play Flash on the Web Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Explicit/Implicit product facet Product facets can be expressed explicitly or implicitly. The pictures of this camera are very clear. The camera fits nicely into my palm. We only consider explicit facet – appears as noun/noun phrase in the sentence. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Architecture Overview Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion a/ Preprocessing Process each input sentence with a Part-of-Speech (POS) Tagger to obtain the POS label for each word. Remove stop words from the result. Stem each word to obtain its root form Only noun/noun phrases are fed to the next module. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion b/ Frequent Mining Identify all frequent noun/noun phrases that satisfy the minimum support, which is defined as the minimum number of sentences containing that noun/noun phrases. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion c/ Post Processing (1/2) • Usefulness pruning: Remove single-word facet that is likely to be meaningless. • e.g. life battery life • Compactness pruning: Remove facet phrase that is not compact. • e.g. sample photo photo Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion c/ Post Processing (2/2) • Infrequent facet discovery: help discover genuine facets that are not mentioned a lot. • Gather opinion words that modify frequent facets. • For each sentence that does not contain frequent facet but one or more opinion words, include the nearest noun/noun phrase as facet. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion d/ Sentence Extraction • Sentences that contain any of the product facets that we have discovered are labeled with that corresponding facet. • Only opinionated sentences are sent down to the next component. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion a/ Experimental Data • From the same dataset as in [Hu04a]: • 1 Digital Camera (45 reviews) • 1 DVD Player (99 reviews) • 1 Cell phone (41 reviews) Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion b/ Evaluation Measure Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion c/ Experimental Result (Baseline) Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Improvement - Syntactic Role (1/2) Improvement - Syntactic Role (2/2) • Many noisy results such as: “light”, “hand”, “time”, “month”, “hour”, • etc. • Filtered by considering the word’ syntactic role in the sentence. During the preprocessing step, we do not pass down to the next module those noun/noun phrases that do not appear as subject/object in the sentence. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Preliminaries Methodology Evaluation Improvement Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Experimental Result (Baseline with Syntactic Role) Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Outline • Introduction • Motivation • Related work • Approach • Product Facet Identification • Preliminaries • Methodology • Evaluation • Improvement • Subtopic Summarization • Overview • Methodology • Evaluation • Discussion and Conclusion Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion How often does subtopic exist? Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Architecture Overview Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion a/ Preprocessing • General Entity pruning • Product class name: “camera”, “DVD”, “phone”, etc. • Brand name: “Nikon”, “Canon”, “iPod”, “Kingston”, etc. • Similarity pruning ([Kong07]) • “picture” vs. “image”, “photo” • “display” vs. “monitor” • “Megapixel” vs. “Resolution” Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion b/ Sentence representation & Semantic similarity measurement (1/2) Adopted from the work by [Li 06], a scalable vector formulation is used to represent sentence, followed by cosine distance between two vectors for sentence semantic similarity measurement Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion b/ Sentence representation & Semantic similarity measurement (2/2) S1 = The battery of my camera is very impressive. S2 = This camera always has a long battery life. Joint Concept Vector: C = {battery, camera, impressive, long, battery life} V1 = { 1.0 , 1.0 , 1.0 , 0.25, 0.5 } V2 = { 0.5 , 1.0 , 0.25 , 1.0 , 1.0 } sim(S1, S2) = = 0.75 Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion c/ Sentence clustering (1/2) • Hierarchical clustering: • Non-hierarchical clustering: Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion c/ Sentence clustering (2/2) To estimate the number of clusters, we adopt the graph-based algorithm proposed in [Hat01] Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion d/ Compact presentation Sentences are now grouped into subtopics. Determine the orientation for every sentences in the cluster. For each positive/negative partition P, we would select the sentence with the maximum representative power to display Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion a/ Experimental Data • From the same dataset used in the previous component, we extract a • subset of those facets with high frequency in each product. • Camera: 8 facets • Phone: 8 facets • DVD: 6 facets Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Experiment Results – Number of subtopics (average) Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion b/ Evaluation Measure (1/2) Purity: rewards the clustering solution that introduces less noise in each cluster: Inverse Purity: rewards the clustering solution that gathers more elements (of the same cluster in the gold standard) into a corresponding cluster: Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion b/ Evaluation Measure (2/2) F-measure: The harmonic mean of purity and inverse purity (α = 0.5): Ly Duy Khang CS4101 B.COMP. DISSERTATION
Overview Methodology Evaluation Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion c/ Experiment Results – Performance using SenSim (+ADJ) Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Outline • Introduction • Motivation • Related work • Approach • Product Facet Identification • Preliminaries • Methodology • Evaluation • Improvement • Subtopic Summarization • Overview • Methodology • Evaluation • Discussion and Conclusion Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Limitation and Future work We do not conduct human evaluation on the effectiveness of the new proposed summary compared to the current ones. Automatic sentiment analysis module integration. Better sentence semantic similarity measurement with deep analysis. Implicit facets handling. Sentence reformulation for summary output. Extend subtopics to other review summarization settings. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion Conclusion We designed a complete summarization system targeting the domain of product reviews. We introduced an effective heuristic rule using syntactic role to improve the process of identifying product facets. We showed the existence of subtopic within the discussion of product facets and addressed this limitation in current summarization system with our proposed clustering component. We extended the sentence semantic similarity measurement with sentiment information. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion References [Barzilay02] Barzilay, R., Elhadad, N., & McKeown, K. (2002). Inferring strategies for sentence ordering in multidocument news summarization. Journal of Artificial Intelligence Research, 17, 35–55. [Car98b] Carbonell, J., & Goldstein, J. (1998). The use of MMR, Diversity-based Re-ranking for Reordering Documents and Producing Summaries. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 335–336. [Ding08] Ding, X., Liu, B., & Yu, P. S. (2008). A Holistic Lexicon-based Approach to Opinion Mining. Proceedings of the international conference on Web search and web data mining – WSDM [Hat01] Hatzivassiloglou, V., Klavans, J. L., Holcombe, M. L., Barzilay, R., yen Kan, M., & McKeown, K. R. (2001). Simnder: A exible clustering tool for summarization. In Proceedings of the NAACL Workshop on Automatic Summarization, 41-49 [Hat97] Hatzivassiloglou, V., & McKeown, K. R. (1997). Predicting the Semantic Orientation of Adjectives. Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics , 174-181. [Hovy01] Hovy, E. H. (2001). Automated text summarization. Handbook of computational linguistics. Oxford University Press, Oxford. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion References [Knight00] Knight, K., & Marcu, D. (2000). Statistics-based summarization-step one: Sentence compression. Proceedings of the National Conference on Artificial Intelligence, 703–710 [Barzilay99] Barzilay, R., Mckeown, K. R., & Elhadad, M. (1999). Information fusion in the context of multi-document summarization. Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, 550–557. [Hu04b] Hu, M., & Liu, B. (2004b). Mining Opinion Features in Customer Reviews. Proceedings of the National Conference on Artificial Intelligence, 755-760 [Hu05] Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. Proceedings of the 14th international conference on World Wide Web [Kim06] Kim, S. M., & Hovy, E. (2006). Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text. Computational Linguistics [Li06] Li, Y., McLean, D., Bandar, Z. A., O'Shea, J. D., & Crockett, K. (2006). Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Trans. on Knowledge and Data Engineering, 18 (8), 1138-1150. Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification Subtopic Summarization Discussion and Conclusion References [Liu09] Liu, B. (2009). Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing, 1-38 [Popescu05] Popescu, A. M., & Etzioni, O. (2005). Extracting Product Features and Opinions from Reviews. Computational Linguistics, 339-346. [Radev04] Radev, D., Jing, H., Styś, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing and Management, 40(6), 919–938. [Turney02] Turney, P., C., & Littman, M. (2002). Unsupervised Learning of Semantic Orientation From a Hundred-Billion-Word Corpus. [Wiebe99] Wiebe, J. M., Bruce, R. F., & O'Hara, T. P. (1999). Development and Use of a Gold-standard Data Set for Subjectivity Classifications. Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics 246-253 [Ye05] Ye, S., Qiu, L., Chua, T., & Kan, M. Y. (2005). NUS at DUC 2005: Understanding Documents via Concept Links. Document Understanding Conference (DUC) [Yu03] Yu, H., & Hatzivassiloglou, V. (2003). Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences. Proceedings of the conference on Empirical methods in natural language processing,129-136 Ly Duy Khang CS4101 B.COMP. DISSERTATION
Q & A Ly Duy Khang CS4101 B.COMP. DISSERTATION
Thank you for your attention Ly Duy Khang CS4101 B.COMP. DISSERTATION