1 / 29

Kyoungryol Kim

Meeting Information Extraction from Meeting Announcement in Korean. Kyoungryol Kim. Table of Contents. Introduction Motivation Goal Problem Definition Contribution The Proposed Method Finding Discussion. Introduction. Motivation ( 1 /3) : Necessity.

coral
Download Presentation

Kyoungryol Kim

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MeetingInformation Extraction from Meeting Announcement in Korean Kyoungryol Kim

  2. Table of Contents • Introduction • Motivation • Goal • Problem Definition • Contribution • The Proposed Method • Finding • Discussion

  3. Introduction

  4. Motivation (1/3) : Necessity • Everyday we receive a lot of Meeting Announcement • Conference, Seminar, Workshop, Meeting, Appointment… • Meeting announcement accounts for 17% (30,201 out of 183,022) of emails in Enron Email Dataset. • Smartphone era • Many people manage schedule using online-calendar via smartphonee.g. Google Calendar • But, typing by touch screen keyboard make many errors and even it’s difficult. * Enron Email Dataset, August 21, 2009 version, http://www.cs.cmu.edu/~enron/

  5. Goal • Extracting schedule information from meeting announcement,and update them to the calendar, automatically. Meeting Announcement 무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다. 날짜 : 7월 19일(토) 오후 2시 장소 : 민들레영토 민들레영토오는길 지도와 같이 명동역 8번 출구로 나오셔서 쭉 상가 끼고 걸어가시면 저기YMCA빌딩 1층에 있습니다. Extract Update

  6. Problem Definition To find Meeting Location, the problem divided into 3 parts : • Finding locations for each type of complexity. • Named entity disambiguation on found locations. 무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다. 날짜 : 7월 19일(토) 오후 2시 장소 : 민들레영토 기본 안건 - 제작지원비 지급 지연에 대한 설명 - 기금 조정 운영안 - 가을 워크샵준비위 구성 - 기타(기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다) 민들레영토 오는길 지도와 같이 명동역8번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA빌딩 1층에 있습니다. 참고하세요 무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다. 날짜 : 7월 19일(토) 오후 2시 장소 : 민들레영토 기본 안건 - 제작지원비 지급 지연에 대한 설명 - 기금 조정 운영안 - 가을 워크샵준비위 구성 - 기타(기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다) 민들레영토 오는길 지도와 같이 명동역8번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA빌딩 1층에 있습니다. 참고하세요 1. Finding Locations(Location-type NER) 2. NE Disambiguation Start/End Time Extraction

  7. 3. Normalization & Co-reference

  8. Definition • Definition 1. Location Named Entity A particular point or place in physical space (Wiktionary). • [Cyber Space] Exceptionally, If the cyber space is used as a place gathering people, then the cyber space can be a location. e.g. MSN에서 9시에 모입니다. • [Road, Street, Transportation] cannot be a location, except if it points particular place or it is necessary to describe the location. e.g. 진천 I/C, 왼쪽에 석촌지하차도가 보임 • [Bridge]can be a location. e.g. 납안교, 한강대교 • [Train/Subway Station, Bus-stop] can be a location. e.g. 도곡역1번출구, 뱅뱅사거리 • [Address] Full/partial address can be a location. e.g. 전북 무주군 설천면심곡리43-15 • [Organization, Company, Heritage, Building] can be a location if it is used to represent the location. • [Parenthesis] If the location is ambiguous when the string in the parenthesis is removed and separated by the parenthesis, then the string including parenthesis are the part of the location. e.g. COEX 컨퍼런스센터4층 (402호), 건국대학교(서울) 의생명연구동 강당, 경인교육대학교 (경기캠퍼스),부산벡스코(BEXCO) 컨벤션홀201호, 생명과학관(녹지) 139호 • [Enumeration] The different representations for same location are recognized separately. e.g. 장소 ? 가야 레스토랑. 전화/215-654-8900, 주소/1002 Skippack Pike, Blue Bell, PA 19422전주 화산체육관 (전북 전주시 완산구 중화산동 1가 45번지), 2. 장소 : 늘푸름(오산시 은계동91-8) • Definition 2. Meeting Location Meeting Location is the Location where the meeting will be held. • Definition 3. Location Landmark Location Landmark is the Location where can be used as a landmark to go to the meeting location.

  9. Complexity of the problems • d

  10. The Proposed Method 1) Location Named Entity Recognition 2) Relation Type Classification 3) Co-reference 4) Normalization

  11. Overall Architecture 무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다. 날짜 : 7월 19일(토) 오후 2시 장소 : 민들레영토 기본 안건 - 제작지원비 지급 지연에 대한 설명 - 기금 조정 운영안 - 가을 워크샵준비위 구성 - 기타(기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다) 민들레영토 오는길 지도와 같이 명동역8번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA빌딩 1층에 있습니다. 참고하세요 무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다. 날짜 : 7월 19일(토) 오후 2시 장소 : 민들레영토 기본 안건 - 제작지원비 지급 지연에 대한 설명 - 기금 조정 운영안 - 가을 워크샵준비위 구성 - 기타(기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다) 민들레영토 오는길 지도와 같이 명동역8번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA빌딩 1층에 있습니다. 참고하세요 무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다. 날짜 : 7월 19일(토) 오후 2시 장소 : 민들레영토 기본 안건 - 제작지원비 지급 지연에 대한 설명 - 기금 조정 운영안 - 가을 워크샵준비위 구성 - 기타(기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다) 명동 민들레영토 오는길 지도와 같이 명동역8번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA빌딩 1층에 있습니다. 참고하세요 Input Document OUTPUT Named Entity Recognition (Location) Relation Type Classification Co-reference Normalization

  12. 1) Location Named Entity Recognition

  13. Architecture of Location NER Training the system (supervised learning) Testing the system (actual use and evaluation) Training Corpus Input: Morpheme-level tokenized sentence list Web Feature Extraction Gazetteer Extraction Tokenization TF-IDF Calculation Boundary Marking (IOB2) Boundary Tagging (IOB2) by CRFs Model Gazetteer Feature Extraction Boundary Merging TF-IDF Score Data CRFs Learning CRFs Model Output: NE Annotated Email Document

  14. NER - Boundary Detection • Boundary Tagset : IOB2 • Features • Linguistic • {-2,-1,0,1,2} POS-level word, {-2,-1,0,1,2} POS-tag, POS-tag + length of the word • Orthographic : 18 types of the word • isKorean, isAlpha, isAlnum, 2DigitNum, ... • Gazetteer: • Person/Location Pronoun dictionary (ETRI 99) • from Training corpus : • Heading words, Surrounding words, NE words • External resources : • Person : Chosun/Joins.com Person DB (64,042) • Location : Nate Local DB 35,335, Sigaji.com 8,193, Ofood 43,390BusStop 19,431, Address,B/D 23,365, Subway 1,288,Hotel (Auction accomodation, hotelnjoy) 884,Country/Place name 11,946, School(Elementary~University) 21,957 • Syntactic : • Position of the POS-level word in the chunk (relative:S/C/E, absolute) • Position of the chunk in the sentence (relative:S/SC/CE/E, absolute) • Position of the sentence in the document (relative:S/SC/CE/E, absolute) • TF-IDF

  15. Features : Gazetteer data • Location : • Shop Name (80,436) • Nate Local DB (3~10 chars.) (http://localinfo.nate.com) • Sigaji.com Shop DB (3~10 chars.) (http://sigaji.com/location/) • oFood (http://ofood.co.kr) • Hotel Name (884) • Auction Accomodation (http://accommodations.auction.co.kr) • Hotelnjoy(http://www.hotelnjoy.com) • Public Transportation (20,719) • Subway stations • Bus-Stop names • Address (from Zipcode DB) (23,365) • Si/do, Gu/gun, Dong/myun/ri, B/D names

  16. Evaluation Result (1/2) Baseline • Boundary Detection • Target : 13,076 sentences in 1,011 documents. • CRFs Model, 10-fold cross validation, 3-order, Exact Matching • Baseline is the case applying Word and POS-tag feature only B-Location I-Location

  17. 2) Relation Type Classification

  18. Architecture of Relation Type Classifier Training the system (supervised learning) Testing the system (actual use and evaluation) Training Corpus Input: Location NE-tagged Document Web Feature Extraction Gazetteer Extraction Tokenization Gazetteer Relation Type Classification By SVMs Model Feature Extraction Template Generation SVMs Learning SVMs Model Output: Extracted NE with Meeting-NE Relation Type

  19. Statistics of Relation Types • Document-Location Relation Type Classification • Target : 1,844 Location-type Terms • 848 isHeldAt (45.99%) • 161 locationLandmark (8.78%) • 835 generalLocation (45.28%)

  20. Features • Linguistic • Gazetteer • Named Entity Dictionary • Nate Local DB 35,335, Sigaji.com 8,193, Ofood 43,390BusStop 19,431, Address,B/D 23,365, Subway 1,288,Country/Place name 11,946, • from Training Corpus : • Heading words in the current sentence • Heading words in the previous sentence • NE consisting words • Lexical Pattern • POS-tag feature before and next to the NE • Is this NE the first location NE next to colon? • Is this term in the parenthesis? • Is parenthesis opened and closed next to the NE ? • Is direction word just next to the NE? • Syntactic • Syntactic Features • Is the NE the first or the last Location-type of NE in the sentence? • Ratio of location NE in the current sentence to the document • Relative position of the NEs in the sentences • Is the NE the longest location NE in the sentence?

  21. Experiment : Features (1/3) • Gazetteer • Named Entity Dictionary • Collected from the web • Check if each morpheme, eojeol or term matches the word in the dictionary. • Nate Local DB, Sigaji.com, Ofood • Address, Building name • Bus-stop, Subway station • Country name • Location-related Vocabulary • from Training Corpus : • Heading words in the current sentence. • Heading words in the previous sentence. Heading word is the word before the colon in the sentencee.g. 장소 :피오레웨딩컨벤션(봉계동 여수 세무서 옆) • Eojeol-level NE consisting words Feature 1A Feature 1 (A+B)

  22. Experiment : Features (2/3) • Lexical Patterns • POS-tag feature just before and next to the NE e.g. 장소 : 피오레웨딩컨벤션 (봉계동 여수 세무서옆) • Is this NE the first location NE next to colon? e.g. 장소 : 피오레웨딩컨벤션(봉계동 여수 세무서 옆) • Is this NE in the parenthesis? e.g. 장소 : 피오레웨딩컨벤션(봉계동 여수 세무서 옆) • Is parenthesis opened and closed next to the NE ? e.g. 장소 : 피오레웨딩컨벤션(봉계동 여수 세무서 옆) • Is direction word just next to the NE? • 34 direction words : 위, 아래, 밑, 옆, 앞, 내, 외, … e.g. 장소 : 피오레웨딩컨벤션(봉계동 여수 세무서 옆) • Is the unit of length appeared in the next 3 eojeolsof the NE? • [0-9]+(m|km|ft|yd|mile|미터|킬로미터|피트|야드|마일|리|초|분|시간) • Is transportation words contained in the left eojeol? Feature 1+2 (A~G)

  23. Experiment : Features (3/3) • Syntactic Features • Is the NE the first or the last Location-type of NE in the sentence? e.g. (1호선에서 갈아탈 경우 동묘역에서6호선을 갈아타고 봉화산방향으로 타고 오시면 2번째 정거장이 보문역입니다. • Ratio of NEs in the current sentence to the document (<25%,<50%,<75%,<100%,=100%) • Relative position of the sentence to the document. (S / SC / CE / E) • Relative position of the eojeol to the sentence. (S / SC / CE / E) • Relative position of the NEs in the sentence (S / SC / CE / E) e.g. (1호선에서 갈아탈 경우 동묘역에서6호선을 갈아타고 봉화산방향으로 타고 오시면 2번째 정거장이 보문역입니다. S CE E • Is the NE the longest location NE in the sentence? • Is this only location NE in the sentence? • Is the NE on the previous/next to the NE in the sentence? • Is same type of NE in the prev/next sentence ? • Is phone number on the left or right side of the NE? • Surrounding word (on the right side of the NE, n=1, pos=etc,p,j,m,x ) ? • Is Colon included in the curr/prev/next sentence? • Is the sentence starts/ends with the NE? • # of chunks of the NE (max:99) • Length of the NE (max:300) • Is location related word included in heading ? • Is location related word included in heading of prev. sentence? • is the NE which is appeared more than 2, is next to this NE? • Is the NE appeared more than 2? • Transport Dic Feature (left / right) • Order of the ne in the sentence. • does sentence starts with special char? Feature 1+2+3 (G,H)

  24. Experiment: Relation Type Classification • Meeting-Location Relation Type Classification • Target : 1,844 Location-type NEs • SVMs 3 classes (multi-class) classifier • Total Accuracy : 82.23

  25. 3) Normalization & Co-reference

  26. Architecture of Normalizer Address Format 1 2 3 4 5 6 7 8 9 10 11 Country | State/City | City/Gu/Gun | Dong/Eup/Myeon | Ri | House no. | Org. | B/D | Floor | Shop Name | Room no. Subway Format 1 2 3 4 City | Line no. | Station Name | Gate no. A1 : 대한민국 A2 : 서울시 A3 : 중구 A4 : 명동 A5 : - A6 : - A7 : - A8 : YMCA빌딩 A9 : 1층 A10 : 민들레영토 A11 : - 민들레영토 민들레영토 YMCA빌딩(A8)/ 1층(A9) 서울시(S1) 4호선(S2) 명동역(S3)/ 8번출구(S4) 민들레영토 민들레영토 YMCA빌딩(A8)/ 1층(A9) 명동역(S3)/ 8번출구(S4) Input Document Verifying Elements Expansion Combine OUTPUT Subway Open Map Services (Google Maps, Yahoo! Maps,Daum Map, Naver Map) Addr. Pattern

  27. Discussion 1) Limitations 2) Applications

  28. Limitations • Performance • Both system should be refined more in detail, with sophisticated experiment. • Scaling Up • For our corpus consist of 1,011 emails, the method to cover more data in the real-world should be mentioned. • Feature Selection • Since we use +165,000 word-gazetteer and many of these features always zero in the training data. In order to save memory and to maximize the performance, these unsupported features need to be removed.

  29. Applications • Smartphone application • Extracting start/end time, location from email and update them to Google Calendar. • Contribution to OpenStreetMapcommunity • Update found locations automatically to openstreetmap.com

More Related