490 likes | 638 Views
Conversa Overview. Conversa System Pipeline. Raw text. Expressions. Centers (n=1). Tokenization. Polygram Analysis. Collocation Discovery. Centers (n=1). Term clusters. Disambiguation & Splitting. Term Clustering. Co-Occurrence Matrix. Splits. Co-occurrence Vectors. Tokens.
E N D
Conversa System Pipeline Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n=1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Tokenization Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) He acted awful mysterious like, and finally he asks me if I'd like to own half of a big nugget of gold. I told him I certainly would." "And then?" asked Sam, as the old miser paused to take a bite of bread and meat. Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Tokenization Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n=1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Polygram Analysis Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) <start>_acted he Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix <start> heactedawful mysterious like , and finally he asks me if i'dlike to own half of a big nugget of gold . <start> i told him i certainly would . “ <start> and then “ ? <start> asked sam , as the old miser paused to take a bite of bread and meat . Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Polygram Analysis Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) he_awful acted Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix <start> heactedawfulmysterious like , and finally he asks me if i'dlike to own half of a big nugget of gold . <start> i told him i certainly would . “ <start> and then “ ? <start> asked sam , as the old miser paused to take a bite of bread and meat . Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Polygram Analysis Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) acted_mysterious Term clusters awful Disambiguation & Splitting Term Clustering Co-Occurrence Matrix <start> he actedawfulmysteriouslike , and finally he asks me if i'dlike to own half of a big nugget of gold . <start> i told him i certainly would . “ <start> and then “ ? <start> asked sam , as the old miser paused to take a bite of bread and meat . Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Polygram Analysis Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) a_of Term clusters bite Disambiguation & Splitting Term Clustering Co-Occurrence Matrix <start> he acted awful mysterious like , and finally he asks me if i'dlike to own half of a big nugget of gold . <start> i told him i certainly would . “ <start> and then “ ? <start> asked sam , as the old miser paused to take abiteof bread and meat . Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Collocation Discovery Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Collocation Discovery Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n=1) a_. Term clusters big nugget of gold Disambiguation & Splitting Term Clustering Co-Occurrence Matrix <start> he acted awful mysterious like , and finally he asks me if i'dlike to own half of abig nugget of gold. <start> i told him i certainly would . “ <start> and then “ ? <start> asked sam , as the old miser paused to take a bite of bread and meat . Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Collocation Discovery Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) of_. Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix <start> he acted awful mysterious like , and finally he asks me if i'dlike to own half of a big nugget of gold . <start> i told him i certainly would . “ <start> and then “ ? <start> asked sam , as the old miser paused to take a bite ofbread and meat . bread and meat Splits Co-occurrence Vectors Tokens Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Constructing Co-Occurrence Matrix Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n=1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Constructing Co-Occurrence Matrix Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Disambiguation and Split Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) days Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix part Splits Co-occurrence Vectors Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Disambiguation and Split Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Associative Non-associative Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Non-associative Term Clusters
Disambiguation and Split Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) there_be first_of of_things … three_of a_of great_to early_of … she_no the_of … Term clusters Associative Non-Associative Zero Counts days Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Associative Non-Associative Zero Counts part Splits Co-occurrence Vectors Associative Non-associative Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Non-associative Term Clusters
Disambiguation and Split Associative Non-Associative Zero Counts days Raw text Expressions Centers (n=1) Zero Counts Associative days Tokenization Polygram Analysis Collocation Discovery Non-Associative Zero Counts Zero Counts days Centers (n>1) … Associative Non-Associative Zero Counts part Raw text Expressions Centers (n=1) Term clusters Associative Zero Counts part Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Non-Associative Zero Counts part Zero Counts Non-associative Splits Co-occurrence Vectors Associative Tokens/Types Surrounds Automatic Annotation Centers Occ. Counts Text Synthesis Non-associative Term Clusters
Disambiguation and Split Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Disambiguation and Split Raw text Expressions Centers (n=1) Tokenization Polygram Analysis Collocation Discovery Centers (n>1) Term clusters Disambiguation & Splitting Term Clustering Co-Occurrence Matrix Splits Co-occurrence Vectors Tokens/Types Automatic Annotation Surrounds Centers Occ. Counts Text Synthesis Term Clusters
Collocation Discovery • Collocation: a multi-word expression that correspond to some conventional way of saying things. • Non-Compositionality • Non-Substitutionality • Non-Modifiability • Current Methods • Word Counts on Span • Word-to-Word Comparison • Assumption of Independence
Collocation Discovery using Stop Words • High-frequency stop words carry very little semantic content but indicate grammatical relationships with other words • Can be used to delimit collocations: Example: frequently occurring stop words {a, in} can detect noun phrases: "Start the buzz-tail," said Cap'n Bill, with atremblein his voice. There was now apromise of snowin the air, and a few days later the ground as covered to the depth of an inch or more.
Definitions Triplet UVW: combined predecessor, center, and successorof three or more words contained within a sentence Surrounds:any observed pairing of predecessor and successor words enclosing one or more centers encloses centers .
Collocation Discovery Step 1: discover surrounds from the corpus, count their occurrences, and rank order them from highest to lowest occurrence count
Step 2: Select Surrounds Select top k from the rank-ordered surrounds satisfying the surrounds total proportionality criterion Example: υ=25%, the top 1,848 surrounds are selected for collocation candidate extraction
Step 3: Extract Collocation Candidates Here it leaped [..], just as awild beastin captivity paces angrily [..] . Given Surrounds {a, in}, discover Collocation Candidates: <Start>
Step 3: Extract Collocation Candidates It was the one that seemed to have had ahole boredin it and then plugged up again . <Start> hole bored is not really a good collocation!
Step 4. Select Collocations Apply a non-parametric variation of a frequently applied method to determine which collocation candidates co-occur significantly more often than chance. The null hypothesis assumes words are selected independently at random and claims that the probability of a collocation candidate V is the same as the product of the probabilities of the individual words
Step 4. Select Collocations • Sample variance s2=P(V) is based on the assumption that the null hypothesis is true • Selection of a word is essentially a Bernoulli trial with parameter P(V), with sample variance s2=P(V)(1-P(V))and mean µ=p. • Since P(V) << 1.0, s2 ≈P(V).
Use of the Empirical CDF Empirical CDF approximate of the true, but unknown distribution, FN:
Confidence Bounds for the Empirical CDF Dvoretzky, Kiefer and Wolfowitz (DKW) Inequality provide a method of computing the upper an lower confidence bound for an empirical CDF given a type-I error probability α and the total number of instances within the sample N: The upper and lower bounds of the empirical CDF can then be calculated: and The Gutenberg Youth Corpus with N = 9,743,797 and a selected α=0.05, provides for a very tight uncertainty bound of 95% ± 0.04% and a critical value of 2.51.
Splitting Problem: Solve a multi-membership clustering problem where: • Targets, t, belong to one or more classes, C • All Targets have N Feature Vectors, ft , of occurrence counts [0, ∞) • Class membership is indicated uniquely by one or more feature vectors • Feature Vectors are noisy – random counts may occur that are false indicators of actual class membership Objective: Cluster targets by class membership, such that each class forms a distinct, homogeneous sub-tree and each target is placed in class-clusters representing the complete target’s class membership (i.e. a target must appear in one or more class-clusters)
Example Given five words, generate a clustering by POS class membership using surrounds (words before and after) feature counts Quickly Wise Test Wise Run Test Run Eat Test
Splitting Feature Vectors Fundamental measure of distance is the Pearson Product-Moment Correlation Coefficient: and and When most features have counts near 0, if both zA,i , zB,i are > 0, then feature fi strengthens the correlation of A, B Define fi an correlative feature for target A and B, if zA,i , zB,i > 0; otherwise, fi is defined non-correlative Split f into two vectors of length N: correlative f(a)and Non-correlative f(n): if is correlative, , and 0 otherwise, 0
Dealing with Noisy Features Need a statistical test to separate noisy correlative from non-correlative features: Assume a random process is uniformly inserting counts that do not indicate class membership A Non-Correlative B: AB A Correlative B: AB Perform Test on each fi given Type I Error Probability α Assume zA,i , zB,i ~N(0,1) [assumption of normality seems weak, since the distribution of z is so highly skewed – perhaps applying a geometric est. would be more appropriate] Null Hypothesis H0: fi is non-correlative zA,i = 0 or zB,i= 0 Alt Hypothesis Halt: fi is correlative zA,i > 0 and zB,i> 0 H0Rejection Region: zA,i > zα and zB,i > zα
Test Example Four distinct classes with correlative features: Three Targets with counts from each class: Additional Up to 10% noise is distributed uniformly on all N=1000 features
Correlative Features Detected at α=20% Features f7 and f8 from Class CC, and f1, f2, and f3 from CA detected associating Targets t1 and t3 Features f10 and f12 from Class CD detected associating Targets t1 and t2 No association detected between Targets t2 and t3
Bibliography Baldwin, T., Kordoni, V., Villavicencio, Al., (2009) Prepositions in Applications: A Survey and Introduction to the Special Issue, Association for Computational Linguistics Volume 38, Number 2 Entity Extraction 04053148: High-Performance Unsupervised Relation Extraction from Large Corpora, Rozenfeld, Felman, ICDM’06 Unsupervised relation discovery based on clustering Used small number of existing relationship patterns 05743647: Extracting Descriptive Noun Phrases From Conversational Speech, BBN Technologies, 2002 Used the BBN Statistically-derived Information from Text (SIFT) tool to extract noun phrases, combined with speech recognition, using the Switchboard I tagged corpus 05340924: Named-Entity Techniques for Terrorism Event Extraction and Classification, 2009 Thai Language, features derived from the terrorism gazetteer, terrorism ontology, and terrorism grammar rule, TF-IDF distance for some standard machine learning algorithms (k-neares, SVM, Dtree) 05484737: Text analysis and entity extraction in asymmetric threat response and prediction Use of named entities lexicon and fixed bigrams to extract entities – refer to NIST Automated Content Explorer (ACE)
Bibliography 05484763: Unsupervised Multilingual Concept Discovery from Daily Online News Extracts, 2010 Applies Left and Right context to extract multi-word key terms, then applies hierarchical clustering for concept discovery News corpus was obtained using RSS feed application called: TheYolk 0548765: Entity Refinement using Latent Semantic Indexing, Agilex, 2010 starts with “state-of-the-art” commercial entity extraction software, create LSI representation multiword text blocks, query LSI space for raked list of major terms 047-058: Evaluation of Named Event Extraction Systems Java Entity Extractors: Annie, Lingpipe 10 evaluated systems indicate problems extracting noun phrases Addresses deficiencies with conferences for NERC: CON-LL, MUC, NIST ACE / Text Analysis Conf: http://www.nist.gov/tac/2012/KBP/ 2002-coling-names –pub: Unsupervised Learning of Generalized Names, 2002 Uses patterns from seed terms to learn new terms (but apparently not the surrounds) Seeks to identify generalized names from medical corpora, like “mad cow disease”
Bibliography qatar-bhomick: Rich Entity Type Recognition in Text, 2010 word-based context-based tagger using perceptron-trained HMM Collocation Extraction beyond Independence Assumption, ACL 2010 Extracting collocations using PMI and Aggregate Markov Models German collocation gold standard Very low precision results reported Aromatically Extracting and Representing collocations for language generation (1998), Smadja, McKeown Stock Market collocations An extensive empirical study of collocation extraction methods, ACL Student Research Workshop, 2005 87 features of similarity for collocations performance measured in Precision, Recall