400 likes | 631 Views
RINGKASAN DOKUMEN. SHINTA P. Pendahuluan. Apa hal pertama yang Anda baca dalam sebuah novel? Memberikan ringkasan halaman web diambil terkait dengan permintaan pengguna. Diperlukan mesin peringkas otomatis. Ringkasan yang dihasilkan manusia yang mahal. Informasi: Headline news.
E N D
RINGKASAN DOKUMEN SHINTA P.
Pendahuluan • Apa hal pertama yang Anda baca dalam sebuah novel? Memberikan ringkasan halaman web diambil terkait dengan permintaan pengguna. • Diperlukan mesin peringkas otomatis.Ringkasan yang dihasilkan manusia yang mahal.
Informasi: Headline news SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
TV-GUIDES — Pengambilan Keputusan SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
Abstracts of papers — Menghemat waktu SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
Graphical maps — Orientasi SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
Contoh: Buatlak Ringkassan untuk pembaca berikut ini1. Event Organizer2. Fans ABG
TRIBUNNEWS.COM, BANDA ACEH - Sembilan personel girlband Cherrybelle konser di Hotel Hermes Palace, Banda Aceh, Selasa (30/4/2013) malam. Dalam konser tadi malam mereka membawakan lagu dari album pertama maupun album terbarunya, yaitu Diam-diam Suka. Saat tampil membawakan lagu pertama Best Friend Forever, para fans Cherrybelle menyambutnya dengan histeris. Ketika tampil di atas panggung, sembilan perempuan imut ini terlihat memakai busana yang berbeda. Tidak seperti tampil di daerah lain, baju minimalis mereka pun ditanggalkan. Sebagai gantinya blus lengan panjang dengan warna-warna kalem membalut tubuh mereka. Bagaimana dengan tatanan kepala? Cherrybelle tampil polos tanpa aksesoris apa pun menempel di rambutnya. Hanya terlihat gaya ikat ekor kuda dan sebagian jepit. Tak ada pula kerudung atau hijab di kepala mereka, sebagaimana layaknya penampilan perempuan Aceh lainnya. Barulah saat di sela-sela persiapan menjelang tampil tadi malam, personel Cherrybelle menyisihkan waktu untuk melayani pertanyaan Serambinews.com (Tribunnews.com Network). Saat sesi wawancara dan foto barulah mereka mengenakan kerudung putih/ Dalam wawancara singkat mereka katakan bahwa Banda Aceh merupakan kota pertama yang mereka kunjungi dalam roadshow ini. Mereka langsung tertarik pada Kota Banda Aceh sejak pertama kali tiba di Bandara Sultan Iskandar Muda yang berlokasi di Blangbintang, Aceh Besar. Konser di Aceh ini merupakan rangkaian dari agenda roadshow Cherrybelle Beat Indonesia di 33 provinsi selama 31 hari. "Kami tuh sudah tertarik sama Banda Aceh sejak menginjakkan kaki di bandara. Bandaranya bagus banget, beda kali dengan bandara di kota lain. Di sini atapnya berbentuk kubah masjid, indah banget," puji mereka kompak saat wawancara eksklusif dengan Serambinews.com.
Purpose • Indicative vs Informative • Indicative - indicates types of information (“alerts”) • “The work of Consumer Advice Centres is examined…” • Informative • “The work of Consumer Advice Centres was found to be a waste of resources due to low availabily…” • Critic / Evaluative • Evaluates the content of the document
Form • Abstrak (IR) • Ekstrak( IE)
Dimension • Single Doc • Multi Doc
Konteks • Query Independent • Query Spesifik
Pendekatan • Shallow Approach • Hanya pada permukaan dokumen • Hasil berupa Sentence Extraction • Bisa out of Context • Deep Method • Hasil berupa Abstrak
Computational Approach: Basics • Bottom-Up: • I’m dead curious: what’s in the text? • Pengguna ingin mendapatkan semua info penting. • System butuh data2 yang penting untuk pencarian. Top-Down: • I know what I want! — don’t confuse me with drivel! • Pengguna hanya ingin jenis info tertentu. • Sistem membutuhkan kriteria tertentu yang menarik, digunakan untuk memfokuskan pencarian. SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
Top-Down: Info. Extraction (IE) • IE task: Given a form and a text, find all the information relevant to each slot of the form and fill it in. • Summ-IE task: Given a query, select the best form, fill it in, and generate the contents. • Questions: • 1. IE works only for very particular forms; can it scale up? • 2. What about info that doesn’t fit into any form—is this a generic limitation of IE? xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx x xxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xx xx xxxx xxx xxx xx xx xxxx x xxx xx x xx xx xxxxx x x xx xxx xxxxxx xxxxxx x x xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx x xx xx xxxx xxx xxxx xx xxxxx xxxxx xx xxx x xxxxx xxx Xxxxx: xxxx Xxx: xxxx Xxx: xx xxx Xx: xxxxx x Xxx: xx xxx Xx: x xxx xx Xx: xxx x Xxxx: xx Xxx: x SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
xx xxx xxxx xxx xxxx xxx xx xxx xx xxxxx x xxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xx xx xxxx xxx xxx xx xx xxxx x xxx xx x xx xx xxxxx x x xx xxx xxxxxx xxxxxx x x xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx x xx xxxx xxx xxxx xx xxxxx xxxxx xx xxx x xxxxx xxx Bottom-Up: Info. Retrieval (IR) • IR task: Given a query, find the relevant document(s) from a large set of documents. • Summ-IR task: Given a query, find the relevant passage(s) from a set of passages (i.e., from one or more documents). • Questions: • 1. IR techniques work on large volumes of data; can they scale down accurately enough? • 2. IR works on words; do abstracts require abstract representations? SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
IR: • Approach: operate at word level—use word frequency, collocation counts, etc. • Need: large amounts of text. • Strengths: robust; good for query-oriented summaries. • Weaknesses: lower quality; inability to manipulate information at abstract levels. • IE: • Approach: try to ‘understand’ text—transform content into ‘deeper’ notation; then manipulate that. • Need: rules for text analysis and manipulation, at all levels. • Strengths: higher quality; supports abstracting. • Weaknesses: speed; still needs to scale up to robust open-domain summarization. Paradigms: IE vs. IR SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
The Optimal Solution... Combine strengths of both paradigms… ...use IE/NLP when you have suitable form(s), ...use IR when you don’t… …but how exactly to do it? SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
A Summarization Machine MULTIDOCS DOC QUERY 50% Very Brief Brief Headline 10% 100% Long ABSTRACTS Extract Abstract ? Indicative Informative CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS Generic Query-oriented EXTRACTS Just the news Background SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
The Modules of the Summarization Machine MULTIDOC EXTRACTS E X T R A C T I O N I N T E R P R E T A T I O N G E N E R A T I O N F I L T E R I N G ABSTRACTS DOC EXTRACTS CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS ? EXTRACTS SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
Karakteristik Ringkasan • Pengukuran • - Compression Rate = panjang ringkasan/ panjang doc asil • 2. Keinformatifan • - Kepercayan pada summber,bias atau tidak khususnya bagi yang bersifat evaluatif • 3. Bentuk yang baik • Perbaikan terhadap dagling, kalimat tidak nyambung, dan Anaphora (ketidak jelasan reference)
Langkah-langkah: Kalimat terpilih Koleksi Docs Ringkasan Koheren : dapat dibaca dan dimengerti maksudnya Pilih Kalimat : Metode Zift, Tf Idf, etc Ekstrak: Urutkan kalimat sesuai lokasi, lakukan smoothing, Ubah jadi kalimat yang baik
Typical 3 Stages of Summarization 1. Topic Identification: find/extract the most important material 2. Topic Interpretation: compress it 3. Summary Generation: say it in your own words …as easy as that! SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
Some Definitions • Language: • Syntax = grammar, sentence structure sleep colorless furiously ideas green — no syntax • Semantics = meaning colorless green ideas sleep furiously — no semantics • Evaluation: • Recall =how many of the things you should have found/did, did you actually find/do? • Precision = of those you actually found/did, how many were correct? SIGIR'99 Tutorial Automated Text Summarization, August 15, 1999, Berkeley, CA
Metode Evaluasi • Intriksi, • Menguji sendiri dengan kriteria tertentu: • Koherens, mudah dibaca dn dimengerti • Informatifness, dapat memberikan informasi tentang doc asli • Ekstrinsi • Menguji sistem dalam hubungannya dengan tugas lain dengan meminta orang lain untuk mengevaluasi.
SUMMARIZER 1 • The main steps of SUMMARIZER 1 are: • For each sentence i S, compute the relevance measure between Si and D: Inner Product, or Cosine Similarity, or Jaccard coefficient. • Select sentence Sk that has the highest relevance score and add it to the summary. • Delete Sk from S, and eliminate all the terms contained in Sk from the document vector and S vectors. Re-compute the weighted term-frequency vectors (D and all Si). • If the number of sentences in the summary reaches the predefined value, terminate the operation: otherwise go to step 1. A. Bellaachia
SUMMARIZER 2 • This summarizer is the simplest among all the proposed techniques. • It uses the TF*IDF weighting schema to select sentences. • It works as follows: • Create the weighted term-frequency vector Si for each sentence i S using TF*IDF (Term frequency * Inverse Document Frequency). • Sum up the TF*IDF score for each sentence and rank them. • Select the predefined number of sentences in the summary from S. A. Bellaachia
SUMMARIZER 3 • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 0 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 1 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 2 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 3 A. Bellaachia
SUMMARIZER 3 (Cont’d) • This summarizer works as follows: • Create the weighted term-frequency vector Ai for each sentence Si using TF*IDF. • Form a sentences-by-terms matrix and feed it to the K-means clustering algorithm to generate k clusters. • Sum up the TF*IDF score for each sentence in each cluster. • Pick the sentence with the highest TF*IDF score from within each cluster and add it to the summary. A. Bellaachia