340 likes | 457 Views
Tell us the first step you would do to comprehend the below passage?.
E N D
Tell us the first step you would do to comprehend the below passage? Slumdog Millionaire, the latest megahit flip, talks about rags to riches story a slum dweller. The movie, an adaptation of novel, is based on popular Indian version of American soap contest – who wants to be millionaire –which was well accepted by the Masses. Freida Pinto is heroin of the movie. She hails from Mumbai. Even though it was her debut movie, because of her exemplary performance she has received offers for many Hollywood movie. Slumdog received numerous accolades from all over the world. Apart from the Oscar, some notables where – Toronto International festival, Cannes etc. Dept of CSE -IIT Bombay
DiscourseSegmentation CS 626 Course Seminar Dept of CSE,IIT B Group-1: Sriraj (08305034) Dipak(08305901) Balamurali(08405401)
The way we go…. Introduction Motivation TextTiling Context Vector and Segmentation Lexical Chains and Segmentation Segmentation with LSA Conclusion Reference
INTRODUCTION Discourse comes from Latin word 'discursus' Discourse: A continuous stretch of (especially spoken) language larger than a sentence, often constituting a coherent unit such as a sermon, argument, joke, or narrative -(Crystal 1992) Discourse: Novels, as well as short conversations or groans(cries) Dept of CSE -IIT Bombay
Beaugrande definition of discourse • Cohesion - grammatical relationship between parts of a sentence essential for its interpretation; • Coherence - the order of statements relates one another by sense. • Intentionality - the message has to be conveyed deliberately and consciously; • Acceptability - indicates that the communicative product needs to be satisfactory in that the audience approves it; • Informativeness - some new information has to be included in the discourse; • Situationality - circumstances in which the remark is made are important; • Intertextuality - reference to the world outside the text or the interpreters' schemata; Dept of CSE -IIT Bombay
DISCOURSESTRUCTURE - SALIENT FEATURES • Existence of a Hierarchy. • Segmentation at semantic level. • Domain-specific knowledge Dept of CSE -IIT Bombay
DISCOURSESEGMENTATION "Partition of full length text into coherent multi-paragraph units " - Marti Hearst Dept of CSE -IIT Bombay
MOTIVATION • Text Summarization • Question and Answering • Sentiment Analysis • Topic Detection Dept of CSE -IIT Bombay
TEXTILING Use of TF-IDF concept within a document Analogy IR : Document->Entire Corpus NLP : Block-> Entire Document A term used more inside a block weighs more. Adjuscent Blocks contain more related terms - - An evidence of strong cohesion Dept of CSE -IIT Bombay
CONTD... Algorithm - • Divide Text into blocks(say k sentence long). • Compute cosine similarity with adjacent blocks. cos(b1,b2) = • Smoothed Interpolated similarity v/s sentence gap number is plotted. • Lowermost portion of valleys - Boundaries Dept of CSE -IIT Bombay
CONTD... Source :[1] Are we satisfied? Dept of CSE -IIT Bombay
TEXTTILING - WHAT WENT WRONG?? • Same word need not be repeated - But similar word could be • WSD was not performed - Polysemy issues. • The contextual information not considered. Dept of CSE -IIT Bombay
CONTEXT VECTORS & SEGMENTATION Capture contextual information in different blocks. Steps: • Encoding of contextual information. - context vector creation • Creation of Block Vectors • Measurement of similarity. -instead of TF-IDF, use context vector • cos(v,w) = Dept of CSE -IIT Bombay
DID IT DO THE TRICK? Yes! • Precision increased 32 to 52% • Recall increased 40 to 51% Lets try to improvise a bit more! Dept of CSE -IIT Bombay
LEXICAL CHAINS • A lexical cohesion computing Technique. • A sequence of related words in the text. • Independent of the grammatical structure. • Provides a context for disambiguation. • Enable identification of the concept. Dept of CSE -IIT Bombay
Different forms of Lexical Cohesion • Repetition • Repetition through synonymy • Police, officers • Word Association through • Specialization/Generalization • murder weapon, knife • part-whole/whole-part relationship • Committee, members • Statistical association between words • Osama Bin laden and Word Trade center
How • Uses an auxiliary resource to cluster words into sets of related concepts (wordnet) • Areas of low cohesive strength are good indicators of topic boundaries • Process • Tokeniser • Lexical chainer • Boundary detector
Process • Tokenizer • POS tagging is done • Morphological analysis is done • Lexical Chainer • To find relation between tokens • Single pass clustering • First token is start of first chain • Tokens added to most recently updated chain that it shares the strongest relationship
Process Contd... • Boundary Detection • A high concentration of chains begin and end between two adjacent textual units • Boundary Strength w(n,n+1) = E * S • E = number of lexical chains whose span ends at sentence n • S = number of chains that begin their span at sentence n+1 • Take the mean of all non zero scores • This mean acts as minimum allowable boundary strength
And the Improvement is … • Evaluation Metrics • Precision • Recall
Problems with Frequency Vector Based Similarity Short Passages • Similarity estimate is inaccurate for short passages • An additional occurrence of a common word (reflected in numerator) causes a disproportionate increase in sim(x,y) unless the denominator is large
Problems with Frequency Vector Based similarity..cont’d(2) Term Matching Problem • Car; Automobile • Car; Petrol • Similar/related but distinct words are considered negative evidence • Solutions • Stemming • Thesaurus, Wordnet based similarity measures • Latent Semantic Analysis
Introduction to LSA • LSA stems from work in IR • Represents word and passage meaning as high-dimensional vectors in the semantic space • Does not use humanly constructed dictionaries, knowledge bases, semantic networks, etc. • Meaning of word : Average of the meaning of all passages in which it appears • Meaning of passage: Average of the meaning of all the words it contains
Training LSA • Input: set of texts • Vocabulary
Training LSA …cont’d(2) • The values are scaled according to general form of inverse document frequency • Dimensionality reduction using SVD
Training LSA …cont’d(3) • is k – dimensional LSA space for • LSA feature vector for word wi • Benefitsof applying SVD • is concise representation. Storage and complexity of the similarity matrix is reduced • Captures major structural associations between • words and documents • Noise is removed simply by omitting the less salient dimensions in U
Applying LSA • A sentence siis represented by its term frequency vector fi where fij is the frequency of term j in si • Meaning of si
Significance of k • Finding optimal dimensionality: Important step in LSA • Hypothetically, the optimal space for the reconstruction has the same dimensionality as the source that generates discourse. • Source generate passages by choosing words from a k-dimensional space in such a way that words in the same paragraph tend to be selected from nearby locations.
LSA results • LSA is twice as accurate as the word similarity based co-occurrence vector. (Error reduced from 22% to 11 %) • LSA values become less accurate as more dimensions are incorporated into the feature vectors
Conclusion • Text tiling, context vector based similarity, lexical chaining and LSA all are bag of word approaches. • Bag-of-word approaches are sufficient .. to some extent. “LSA makes no use of word order, thus of syntactic relations or logic, or of morphology. Remarkably, it manages to extract reflections of passage and word meanings quite well without these aids, but it must still be suspected of resulting incompleteness or likely error on some occasions” Excerpt from [5]. 1
Contd.. • LSA is purely statistical whereas other approaches use some form of external knowledge bases in addition to statistical techniques. • Role of external Knowledge. • To move to next level we need some linguistics. • We need right mix of statistical and linguistics approaches to move forward. Dept of CSE -IIT Bombay
Reference [1]. Hearst, M. A. 1993 Texttiling: a Quantitative Approach to Discourse segmentation. Technical Report. UMI Order Number: S2K-93-24., University of California at Berkeley. [2]. Kaufmann, S. 1999. Cohesion and collocation: using context vectors in text segmentation. In Proceedings of the 37th Annual Meeting of the Association For Computational Linguistics on Computational Linguistics , Pages 99-107 [3]. Landauer, T. K., Foltz, P. W., & Laham, D. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25, pages 259-284. [4]. Barzilay, Regina and Michael Elhadad. 1997.Using lexical chains for text summarization. In Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS-97), Madrid, Spain [4]. Freddy Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. InProceedings of NAACL , pages 26-33 [5]. Freddy Y. Y. Choi, Peter Wiemer-hastings, Johanna Moore. 2001. Latent semantic analysis for text segmentation. InProceedings of EMNLP, pages 109-117 [6]. Stokes, N., Carthy, J., Smeaton, A.F. 2002. Segmenting Broadcast News Streams Using Lexical Chains. in Proceedings of 1st Starting AI Researchers Symposium (STAIRS 2002), volume 1, pp.145-154. Dept of CSE -IIT Bombay
Contd.. [7]. http://www.freewebs.com/hsalhi/Discourse%20Analysis%20Handout.doc [8]. http://ilze.org/semio/005.htm [9]. http://www.dfki.de/etai/SpecialIssues/Dia99/denecke/ETAI-00/node11.html [10]. http://www.csi.ucd.ie/staff/jcarthy/home/Lex.html Dept of CSE -IIT Bombay