810 likes | 922 Views
DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES. Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul, Turkey ( Visiting Professor at TALP Research Center, UPC ). OUTLINE. INTRODUCTION LITERATURE SURVEY Search Engines and Query Types
E N D
DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul, Turkey (Visiting Professor at TALP Research Center, UPC)
OUTLINE • INTRODUCTION • LITERATURE SURVEY • Search Engines and Query Types • Automatic Analysis of Documents • Automatic Summarization • OVERVIEW OF METHODOLOGY • System Architecture • Implementation • Data Collection • STRUCTURAL PROCESSING • Rule-based Approach • Machine Learning Approach • SUMMARY EXTRACTION • DISCUSSION • FUTURE RESEARCH
Introduction • Rapid growth of information sources • World Wide Web • “information overload” • 50% of documents viewed in search engine results • not relevant (Jansen and Spink, 2005) • Users are interested in different types of search • rather than queries with commonplace answers • e.g. capital city of Sweden • specific and complex queries • e.g. best countries for retirement • tasks such as background search • e.g. literature survey on Mexican air pollution
Introduction (cont.) • Available search engines • results in response to a user query • each presented with a short ‘summary’ • 2-3 line extracts • document fragments containing query words • fail to reveal their context within the whole document • The users • scroll down the results • click those that seem relevant to their real information need • inadequate summaries • missing relevant documents • spending time with irrelevant documents • not feasible to open each link
Introduction (cont.) • Automatic summarization • as successful as humans • long-term research direction (Sparck Jones, 1999) • improve effectiveness of other tasks • e.g. information retrieval • Traditionally, automatic summarization research: • general-purpose summaries • e.g. the “abstract page” of a report • But, need to bias towards user queries • in an information retrieval paradigm • a document is seen as a flat sequence of sentences • ignoring the inherent structure • But, Web documents • complex organization of content • sections and subsections with different topics and formatting
Research Goals • a novel summarization approach for Web search • combining these two aspects • Document structure • Query-biased techniques • not investigated together in previous studies • Intuition • providing the context of searched terms • preserving the structure of the document • Sectional hierarchy and heading structure • may help the users to determine the relevancy of results better • Two-stage approach • Structural processing • Summary extraction
Research Goals (cont.) • Web documents • no domain restriction • typically heterogeneous • images, text in different formats, forms, menus, etc. • diverse content • with sections on different topics, advertisements, etc. • Structural and semantic analysis of Web documents • Heading-based sectional hierarchy • Use of this structural and semantic information • during summarization process • in the output summaries • query-biased techniques
Search Engines • Information retrieval (IR) • storage, retrieval and maintenance of information • differences on the Web • distributed architecture • the heterogeneity of the available information • its size and growth rate, etc. • Search engine • allows the user to enter search terms (queries) • run against a database • retrieves Web pages that match the search terms
Query Types • Boolean search • keywords separated by (implicit or explicit) Boolean operators • Phrase search • a set of contiguous words • Proximity search • Range searching • Field searching • Natural language search • Thesaurus search • Fuzzy search
Information Needs of Users • Categorization (Ingwersen & Järvelin, 2005) • intentionality or goal of the searcher • the kind of knowledge currently known by the searcher • the quality of what is known • well-defined knowledge of the user • specific information sources are searched • in ill-defined (muddled) cases • the search process is exploratory • Types of information need in Web search (White et al., 2003) • search for a fact • search for a number of items • decision search • background search
General Document Analysis • physical components • paragraphs, words, figures, etc. • logical components • titles, authors, sections, etc. • as a syntactic analysisproblem • physical and logical components of a document • ordered tree • transformation-based learning • generalized n-gram model • probabilistic grammars • incremental parsing • syntactic parsing (Collins and Roark, 2004) • generating table-of-contents for a long document (Branavan et al., 2007)
Web Document Analysis • Web documents • HTML (Hypertext Markup Language) • presentation of content • semi-structured documents • Motivations • to filter important content • to convert HTML documents into semantically-rich XML documents • obtaining a hierarchical structure for the documents • display content in small-screen devices such as PDAs • more intelligent retrieval of information, summarization, etc • Approaches • HTML tags and DOM tree • rule-based or machine learning-based • certain domain or domain-independent
Web Document Analysis (cont.) • Different from most previous work • section and subsection headings • HTML • Markup tags, attributes and attribute values • e.g. <font size = 3> • Two types of HTML tags • container tags (e.g. <table>, <td>, <tr>, etc.) • contain other HTML tags or text • format tags (e.g. <b>, <font>, <h1>, <h2>, etc.) • usually concerned with the formatting of text • DOM (Document Object Model) • provides an interface as a tree
Automatic Summarization • Process of distilling the most important information • from a source (or sources) to produce a shortened version • for particular users and tasks • Uses • as an aid for browsing • single large documents or sets of documents • in sifting process • to locate useful documents in a large collection • as an aid for report writers • by providing abstracts • related to and influenced by • information retrieval • information extraction • text mining
Automatic Summarization (cont.) • Types of summaries • “Extract” vs “abstract” • “Generic” vs “query-relevant” • “Single-document” vs “multi-document” • “Indicative” vs “informative” • Phases of summarization • Analysis of input text • Transformation into a summary representation • Synthesis of output summary
Automatic Summarization (cont.) • Approaches • Surface-level approaches • use shallow features to identify important information in the text • thematic features, location, background, cue words and phrases, etc. • Entity-level approaches • build an internal representation of the text • by modeling text entities and their relationships • e.g. using graph topology • Discourse-level approaches • global structure of the text and its relation to communicative goals • Hybrid approaches • Evaluation • intrinsic • the summary itself is evaluated • extrinsic • i.e. task-based evaluation
Recent Work on Summarization • Mostly generic summaries • based on sentence weighting • Tombros & Sanderson, 1998 • query-biased summaries in information retrieval • Google, Altavista • White et al, 2003 • longer query-biased summaries • summary window • Alam et al, 2003 • structured and generic summaries • “table of content”-like hierarchy of sections and subsections
Recent Work on Summarization (cont.) • Yang & Wang, 2008 • fractal summarization • hierarchical structure of document • levels, chapters, sections, subsections, paragraphs, sentences and terms • generic summaries • Varadarajan & Hristidis, 2005 • adding structure • document is divided into fragments (paragraphs) • connecting related fragments as a graph (implicit structure) • query-biased • In this research, combining • explicit document structure and query-biased techniques
Structural Processing • Rule-based and machine learning-based approaches • Input • a Web document in HTML format • Output • a tree representing the sectional hierarchy of the document • intermediate nodes: headings and subheadings, • leaves: other text units
Summarization • Using the output of structural processing • document tree • indicative summaries • extractive approach • longer summaries • in a separate frame
Implementation • GATE (A General Architecture for Text Engineering) • open source project using component-based technology in Java • commonly used natural language functionalities • Tokeniser, Sentence Splitter, Stemmer, etc. • Cobra Java HTML Renderer and Parser • open source project • supports HTML 4, Javascript and Cascading Style Sheets (CSS) • Implemented modules • Structural analysis of HTML documents • Summarization engine
Data Collection English queries • Users • mostly Boolean queries with 2-3 words • Current search interests • various domains • English Collection • Turkish Collection • Extended English Collection Turkish queries
The Method • A heuristic approach based on DOM processing • Heading-based sectional hierarchy identification • nontrivial task • heterogeneity of Web documents • the underlying HTML format • Three steps • DOM tree processing • Heading identification • Hierarchy restructuring
Step 1: DOM Tree Processing • Semantically related parts • same or neighboring container tags • Traverse DOM tree in a breadth-first way • Sentence boundaries • Format tags such as <font> are passed as features • Output: a simplified version of the original tree
Step 2: Heading Identification • Heading tags in HTML • <h1>through <h6> • rarely used for this purpose • Headings • formed by formatting them differently from surrounding text • more emphasized than following content • Heuristics • if-then rules
Step 3: Hierarchy Restructuring • Headings + feature set • to differentiate different levels of heading • Restructure the document tree • bottom-up approach
Performance Measures Heading Extraction • Hierarchy Extraction • Parent-child relationships in the document tree • Heading-subheading • Heading- underlying text
English Collection Heading extraction • Baseline • using only heading tags <h1> through <h6> • High value for heading recall • Precision is lower • cluttered organization in Web documents
English Collection (cont.) Hierarchy extraction • a significant improvement to accuracy • compared to the baseline
Turkish Collection Heading extraction Hierarchy extraction • Baseline method failed • no <h> tags used • Additional analysis • 50 documents on boun.edu.tr domain • 71% accuracy
The Approach • Machine learning • can be more flexible • by combining several features using a training corpus • rather than predefined rules • Extraction of sectional hierarchy of a Web document • A tree-based learning approach needed • as in syntactic parsing • exponential search space • incremental algorithm • making a sequence of locally optimal choices • to approximate a globally optimal solution • Document • as a sequence of text units
Heading Extraction Model • Binary classification • As a sequence of text units • Headings: positive examples • Non-headings: negative examples
Hierarchy Extraction Model • Learn a mapping from X (a set of documents) to Y (a set of possible sectional hierarchies of documents) • Training examples (xi, yi)for i = 1…n • A function GEN(x) enumerating a set of possible outputs for an input x • A representation Φ mapping each (xi,yi) to a feature vector Φ(xi, yi) • A parameter vector α • Estimate α such that it will give highest scores to correct outputs:
Features • Unit features • Formatting features • e.g. font size, boldness, color, etc. • DOM tree features • e.g. DOM address, DOM path, etc. • Content features • e.g. cue words / phrases, number of characters, punctuation mark, etc. • Other features • Visual position in the rendered Web document • Contextual features • composite features of two units in context • distance and difference between features • uij : unit i levels above a unit u, and j units to its left • Global features • e.g. the depth of sectional hierarchy
Incremental Learning Approach • Document graph • left to right based on the order of appearance • Positive and negative examples • Parent-child relationships (based on golden standard hierarchy) • Two constraints • Document order • Projectivity rule • “When searching for the parent of a unit uj, consider only the previous unit (uj-1), the parent of uj-1, that unit’s parent, and so on to the root of the tree.
Incremental Learning Approach (cont.) • Training set • Web documents and corresponding golden standard hierarchies • Algorithm • works on units sequentially
Testing Approach • Beam search • Set of partial trees • Beam width • Two operations • ADV (i.e. Advance) • potential attachments of current unit to partial trees • FILTER • to prevent exponential growth of the set