1 / 27

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems. Georg Rehm 1 , Marina Santini 2 , Alexander Mehler 3 , Pavel Braslavski 4 , Rüdiger Gleim 3 , Andrea Stubbe 5 , Svetlana Symonenko 6 , Mirko Tavosanis 7 , Vedrana Vidulin 8.

Download Presentation

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems Georg Rehm1, Marina Santini2, Alexander Mehler3, Pavel Braslavski4, Rüdiger Gleim3, Andrea Stubbe5, Svetlana Symonenko6, Mirko Tavosanis7, Vedrana Vidulin8 Language Resources and Evaluation Conference – LREC 2008

  2. Introduction • Genres are specific types of text. • Genres have, roughly speaking, three characteristic properties: • Content topic • Form layout, design, text structure etc. • Function communicative purpose etc. • Genres are socially specified sets of rules and conventions. • Genres are recognised by particular discourse communities. • Genres usually have established names.

  3. Examples of Traditional Genres Guidebook Cookbook Almanac Dictionary Textbook Novel

  4. Scope of this Talk • There are not only hundreds (Dimter, 1981), but thousands (Adamzik, 1995) of genres: • Shopping list • Love letter • Flyer • Weather forecast • CV • PhD thesis • … • This talk is not about traditional, paper-based genres. • This talk is about web genres.

  5. Web Genres • Studies have shown that genres also exist in the web, e.g.: • Personal homepage • FAQ • Blog • Search engine • Encyclopedia • Web shop • Web genres are more complex than traditional genres: • The web is a hypertext system • Interactive features • Multimedia

  6. Automatic Web Genre Identification • If we were able to identify web genres automatically, we could exploit this information in search engines. Find: • textbook web pages that contain “language resource” • PhD thesis web pages that contain “RCG parsing” • About 20 different approaches have been published in this area (incl. the identification of traditional genres). They mainly use • Machine learning methods • Hand-crafted genre detection rules

  7. Automatic Web Genre Identification • All approaches have some characteristics in common. • Nearly every group of researchers • have their own personal definition of “web genre”, • create their own document collection, • create their own set of web genre labels, • annotate their corpora with these web genre labels. DIY DIY DIY

  8. Automatic Web Genre Identification It’s impossible to compare such isolated approaches.

  9. Towards a Reference Corpus of Web Genres Reference Corpus of Web Genres enables comparative evaluation

  10. Towards a Reference Corpus of Web Genres Shared genre category set or sets Reference collection of web documents Annotation tool

  11. Towards a Reference Corpus of Web Genres Shared genre category set or sets Reference collection of web documents Annotation tool

  12. Assigning Genre Labels to Web Pages • The construction of a genre corpus involves the task of assigning genre labels to web documents by a group of annotators. • Previous studies have shown that this is a very hard task. Set of genre categories tag with genre category

  13. Preliminary Study • We conducted a survey amongst the group of authors: • Goal: to measure the agreement of genre labels assigned to a random sample of 50 web documents by persons who are engaged in genre-related research. • Seven of the nine authors participated. • Result: the categories assigned by the participants contain a very high number of disparate terms at various levels of abstraction. • Conclusion: the task of assigning genre labels to web documents is – even for linguists who work on genres – very hard.

  14. Assigning Genre Labels to Web Pages • Consistency: High • Participant 1: News article • Participant 2: Article/commentary • Participant 3: Article • Participant 4: Feature • Participant 5: A newsletter article • Participant 6: News article • Participant 7: Journalistic

  15. Assigning Genre Labels to Web Pages • Consistency: Low • P1: Entry page of the website of a research journal • P2: Table of contents with snippets • P3: Portal, link collection • P4: Bibliography/List of Articles • P5: A homepage of a subscription-based academic journal • P6: Homepage • P7: Index, Content Delivery

  16. Genre Category Sets in Previous Approaches • Almost all category sets used in previous approaches are • limited in size and scope and • contain categories that cannot be considered genres:

  17. Shared Genre Category Sets A set of genre categories is needed so that we can assign web genre labels to web documents. Requirements for this shared category set: It should be precise, scalable, as unambiguous as possible, and reflect the genre-reality as it presents itself in the web. The majority of researchers in this field should agree upon the category set or sets. We used a wiki to come up with an initial proposal of 78 web genre categories.

  18. Our Proposal for a Shared Genre Category Set 1. About Page 2. Abstract 3. Agenda (Schedule, Calendar) 4. Announcement 5. Application 6. Bibliography 7. Biography 8. Chronicle 9. Code Listings 10. Column / Editorial / Lead Article 11. Comic 12. Contact Form 13. Contract / Disclaimer / Terms and Conditons 14. Corporate Blog 15. Curriculum Vitae / CV / Resume 16. Data / Statistics / Data Sheet 17. Diary, Blog 18. Dictionary 19. Directory of Persons or Organisations 20. Discussion Group / Newsgroup 21. Download 22. Drama / Play 23. Encyclopedia 24. Errata 25. Error Message / Empty Page / Under Construction Page 26. Essay 27. Exercises (Problems) 28. FAQ 29. Feature Story / News Reportage 30. Game (Quiz, Puzzle) 31. Glossary 32. Guestbook 33. Homepage / Front Page / Entry Page 34. Horoscope 35. Index 36. Instruction 37. Interview 38. Invitation 39. Job Listing 40. Joke 41. Law / Regulation / Rule / Proclamation 42. Letter / Mail / E-Mail 43. Letter to the Editor 44. Linkfarm 45. Link Collection / Hotlist 46. List of Products 47. List of Projects 48. Login Page 49. Media (Images, videos, music, sound) 50. Meeting minutes 51. News Article 52. News Collection / Newsletter / Digest 53. Obituary 54. Official Report 55. Ordering Form / Booking Form 56. Pamphlet 57. Petition 58. Promotional / Advertisement 59. Poem / Poetry / Lyrics 60. Pornographic 61. Prose Fiction 62. Quotation 63. Reportage 64. Research Report 65. Review (Testimonial) 66. Script (Manuscript) 67. Search Form 68. Sermon 69. Shop 70. Specification 71. Speech 72. Splash Page / Gateway / Welcome Page 73. Strategic Plans 74. Survey 75. Table of contents / Sitemap / Navigation 76. Thesis 77. Travel Guide 78. Tutorial

  19. Tagging HTML Documents with Genre Categories 1) tag HTML documents; the most common approach 2) tag websites tag tag 3) tag page segments tag tag tag tag tag

  20. Towards a Reference Corpus of Web Genres Shared genre category set or sets Reference collection of web documents Annotation tool

  21. Reference Collection of Web Documents • We plan to build the reference corpus in two stages: • First, we will apply our shared set of genre categories to existing collections as a proof of concept. Initial step towards an objective evaluation and integrative compatibility of individual approaches. • Second, we will use a crawler to gather more recent as well as more diverse sets of documents.

  22. Reference Collection of Web Genres (Selection) • Web Corpus for English (Santini, 2007): editorial, biography, do-it-yourself guide, feature article (20 web pages each). • German corpus (Mehler et al., 2007, 2008): conference website (50 sites), personal academic homepage (68 sites), project website (52 sites), city website (180 sites). • Hierachical Web Genre Collection (Stubbe and Ringlstetter, 2007), 32 genre classes, 40 HTML files/class, English. • Corpus of 400 blog posts, Italian (Tavosanis, 2007). • English (65,177 pages)and Russian (29,650 pages)corpora (Sharoff, 2007).

  23. Towards a Reference Corpus of Web Genres Shared genre category set or sets Reference collection of web documents Annotation tool

  24. Corpus Management and Annotation Tools • Construction of the reference corpus requires tools that support • compiling a document collection and • annotating HTML documents. • We use the HyGraph toolbox: • Supports researchers in the process of corpus compilation, annotation and analysis • Annotate at various levels • Assign confidence values • Support for multiple tag setsand category systems • Uses stand-off annotation

  25. Towards a Reference Corpus of Web Genres Shared genre category set or sets Reference collection of web documents Annotation tool Reference Corpus of Web Genres

  26. Summary and Future Work • We construct a reference corpus of web genres. • Provide a shared resource for researchers who work on web genre identification and the evaluation of these systems. • Future work includes the further realisation of this resource: • Apply a set of genre categories to existing corpora. • Collect a large set of new documents that will be categorised based on annotation guidelines using HyGraph. • Assign genre labels to single web documents first and to page segments as well as complete websites later.

  27. Q/A Thanks for your attention! Please get in touch if you (plan to) work in the field of automatic web genre identification or a related area: georg.rehm@uni-tuebingen.de http://129.70.40.20/WebGenreWiki/ A mailing list will be available soon.

More Related