200 likes | 495 Views
A Sentence Boundary Detection System. Student: Wendy Chen Faculty Advisor: Douglas Campbell. Introduction. People use . ? and ! End-of-sentence marks are overloaded. Introduction. Period - most ambiguous. Decimals, e-mail addresses, abbreviations, initials in names, honorific titles.
E N D
A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell
Introduction • People use . ? and ! • End-of-sentence marks are overloaded.
Introduction • Period - most ambiguous. • Decimals, e-mail addresses, abbreviations, initials in names, honorific titles. • For example: U.S. Dist. Judge Charles L. Powell denied all motions made by defense attorneys Monday in Portland's insurance fraud trial. Of the handful of painters that Austria has produced in the 20th century, only one, Oskar Kokoschka, is widely known in U.S. This state of unawareness may not last much longer.
Introduction • Sentence boundary detection by humans is tedious, slow, error-prone, and extremely difficult to codify. • Algorithmic syntactic sentence boundary detection is a necessity.
Five Applications I. Part-of-speech tagging • Examples of part-of-speech include nouns, verbs, adverbs, prepositions, conjunctions, and interjections. • John [noun] Smith [noun], the [determiner] president [noun] of [preposition] IBM [noun] announced [verb] his [pronoun] resignation [noun] yesterday [noun].
Five Applications II. Natural language parsing • Identify the hierarchical constituent structure in a sentence. S NP S NP NP NP PP VBD NP NP IN NP NNP NNP DT NN NNP PRP$ NN NN John Smith the president of IBM announced his resignation yesterday
Five Applications III. Reading level of a document • The Bormuth Grade Level, the Flesch Reading Ease use information on the sentences in the documents.
Five Applications IV. Text editors • The command to move to the end of a sentence. V. Plagiarism detection
Related Work • As of 1997: “identifying sentences has not received as much attention as it deserves.” [Reynar and Ratnaparkhi1997] “Although sentence boundary disambiguation is essential . . ., it is rarely addressed in the literature and there are few public-domain programs for performing the segmentation task.” [Palmer and Hearst1997] • Two approaches • Rule based approach • Machine-learning-based approach
Related Work I. Rule based • Regular expressions • [Cutting1991] • Mark Wasson converted grammar into a finite automata with 1419 states and 18002 transitions. • Lexical endings of words • [Müller1980] uses a large word list.
Related Work II. Machine-learning-based approach • [Riley1989] uses regression trees. • [Palmer and Hearst1997] uses decision trees or neural network.
Our Approach • Punctuation rules to disambiguate end-of-sentence punctuation. • Punctuation rule-based model is simple in design, and is easy to modify.
Our Reference Corpus • A “sentence” reference corpus is a corpus with each sentence put on its own line. • We manipulated the Brown Corpus to create a 51590 sentence reference corpus. • Two sections - training text and final run text.
High Level Architecture Text Document Sentenizer Module Rules Sentence Recognizer Reference Corpus Sentences Analysis Module
Our Sentenizer Module • Our sentenizer module has two parts: • A set of end-of-sentence punctuation rules. • An engine to apply the rules.
Our Analysis Module diff Sentenizer Analysis Module <rule>.txt Reference Corpus rules_summary
Our Analysis Module • <rule>.txt The Japanese want to increase exports to the U.S. |||| While they have been curbing shipments, they have watched Hong Kong step in and capture an expanding share of the big U.S. market. The Hartsfield home is at 637 E. Pelham Rd. @@PE@@ NE. But what came in was piling up. |||| @@PE@@ The nearest undisrupted end of track from Boston was at Concord, N. H.
Overview of Experiment Results | | Percentage of Run | Key description | corrected marked Number | | sentences -------|-----------------------------------------|----------------- * Run 1 |All marks | 84.35% * Run 2 |Mark at token end | 89.03% Run 3 |Correction of text | 88.31% * Run 4 |Double punctuation endings | 89.01% * Run 5 |Check next word capitalization | 89.53% Run 6 |Correction of text | 89.55% Run 7 |Modify capitalization function | 91.41% Run 8 |Correction of text | 91.35% Run 9 |Modify capitalization function | 91.35% Run 10 |Correction of text | 90.40% Run 11 |Correction of text | 90.58% * Run 12 |Add abbreviation list | 95.60% Run 13 |Check single initials | 98.94% * Run 14 |Form black chunk of token | 99.12% Run 15 |Check numbering lists and double initials| 99.85% * Run 16 |Reduce abbreviation list | 98.90% Run 17 |Check special abbreviations | 99.83% Run 18 |Confidence ratings | 99.83% * Run 19 |Check sentences with ellipsis points | 99.83% * Run 20 |Check sentences with parenthesis marks | 99.83%
Evaluation on Testing Corpus • Testing corpus - 26647 sentences • Sentenizer - 26613 sentences • Total 120 errors • 43 false positives • 77 false negatives • 99.84% accuracy
Contributions • Highly accurate • 99.8% accuracy rate • Comparable to or better than existing systems • Highly efficient • About 50 double spaced papers per second • About 1000 sentences per second • Easily modifiable • A rule-based model