1 / 12

Corpus Statistics

Corpus Statistics. ACE2005/ACE2007 English EDR Chars: 1.5M Words: 257K Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K , LOC 897, WEA 579 , VEH 571 ) Mentions: 55K (PRO 20K, NAM 18K, NOM 17K) CDC Entities (PER, ORG, LOC, GPE) IDC Entities 7,129 (Entities with at least one name)

emmy
Download Presentation

Corpus Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Statistics • ACE2005/ACE2007 English EDR • Chars: 1.5M Words: 257K • Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K,LOC 897, WEA 579, VEH 571) • Mentions: 55K (PRO 20K, NAM 18K, NOM 17K) • CDC Entities (PER, ORG, LOC, GPE) • IDC Entities 7,129 (Entities with at least one name) • CDC Entities 3,660 (after manual linking) • 2,390 singleton entities • CDC Annotation Effort • Approximately 2 staff weeks • Annotated after automatic pre-linking of entities that shared at least one identical (case-sensitive) name string

  2. Cross-Document Entity Mention Count Histogram Rank MFreq Entity Name 1 259 US 2 182 Iraq 3 96 Baghdad 4 93 George W. Bush 5 89 Saddam Hussein 6 83 CNN …

  3. Total Mentions Covered byFrequency-Sorted Entities

  4. Callisto/EDNA • Entity Disambiguation and Normalization Annotation (EDNA) tool • A plug-in for Callisto client • Multiple annotators supported with single Tomcat server (with document locking) • Document set indexed by APF-customized Lucene search engine • Assumes documents annotated for ACE EDR (entity mentions and intra-document coreference)

  5. Logging onto the Server

  6. File Selection, Locking & Status

  7. Highlighted Mentions and ACE Annotations Source document ACE Annotations

  8. Default and Customizable Entity Search Entity-based Search Criteria Search Results Selected Entity Details

  9. Color Coding Entity Status & Type

  10. Reviewing Target Link Target in Context of Source Document

  11. Type Restrictions in Search Can Be Relaxed

  12. Annotator Comments can be Added and Retained

More Related