350 likes | 510 Views
LOD2 KOREA : Towards Publishing Korean Linked Data on the Web. Key-Sun Choi. Joint work with Martin Rezk Jungyeul Park. Yoon Yongun Kyungtae Lim. YoungGyun Hahm. Key-Sun Choi - Personal History. NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation
E N D
LOD2 KOREA :Towards Publishing Korean Linked Data on the Web Key-Sun Choi Joint work with Martin Rezk JungyeulPark Yoon Yongun Kyungtae Lim YoungGyunHahm
Key-Sun Choi - Personal History • NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation • Korean Part-of-Speech Tagset, Corpus, Dictionary • CoreNet (Korean-Chinese-Japanese) Semantic Wordnet (2004) • KORTERM: Korea Terminology Research Center for Language and Knowledge Engineering (1998-2007), Research Center of Ministry of Culture • KAIST Research Grand Award(1998) • ISO/TC37/SC4 Founding member (Language Resource Management Standards) • ISWC 2007 PC Co-Chair (International Semantic Web Conference) • AFNLP President(2009-2010) • DBPediaKorea http://ko.dbpedia.org/ • http://lod2.eu/ partner (EU FP7)
NLP2RDF • Triple in Natural Language • Subject • Object • Predicate • Extract from Sentences • 野生種의 장미는 主로北半球의 溫帶와 寒帶 地方에 分布한다. • Wild rose is located mainly in the northern hemisphere of its temperate and figid zones. • Subject : 장미 (rose) • Object : 북반구의 온대지방, 한대 지방 (Northern hemi-sphere, Temperate and Frigid Zones) • Predicate : 分布 (isDistributedAt) Key-Sun Choi - LOD2 Korea
마이크로소프트 Wind River 실시간 임베디드 운영체제 통신 미들웨어 미디어 플레이어 응용 프로그램 비실시간 임베디드 운영체제 VxWorks WinCE pSOS VRTX 미들웨어 브라우져 임베디드 소프트웨어 임베디드 시스템 임베디드 운영체제 운영체제 DVD 플레이어 개발환경 RTOS 소프트웨어 가전기기 시스템 플랫폼 제조회사 셋탑박스 디지털카메라 MP3플레이어 consists_of 제조사 reside_on 5
NLP2RDF <Conceptonal Layer> <DBpedia> (based on DBpedia Ontology) Barack Obama URI = dbpedia12415 (conceptonal Unique) <Career> President <Nationality> United States <Party> Democrats ,,, LOD algorithm Barack Obama is the President of the United States Barack Obama URI = sen1word1 (documentary Unique) <POStag> NNG </POStag> ,,, The Output of NLP tools “KNIF” Wrapper Sentence: ‘Barack Obama is the President of the United States’
For these work • For RDF Mapping • Triples and URI • Ontology • String Ontology • Structured Sentence Ontology • NIF and Korean language • For LOD Mapping • URI for DBpedia entity • Mapping Word in Text DBpedia Key-Sun Choi - LOD2 Korea
Parser tree to Summary • 물체의 낙하 거리는 시간의 제곱에 비례한다 • <Triple> • Subject • 물체의 낙하거리 • Predicate • 비례한다 • Contents • 시간의 제곱 Key-Sun Choi - LOD2 Korea
Why NLP? Why Syntactic,Semantics? • Advanced technology on the higher-level layers Key-Sun Choi - LOD2 Korea
NLP Layer Cake Key-Sun Choi - LOD2 Korea
Semantic Web vs. NLP layer cake Key-Sun Choi - LOD2 Korea
How to develop parser and semantic classifier creatively? • Open Source NLP tools • Rich English, Japanese open tools/resources • A few Korean tools • How to adapt Korean tools to the already developed tools • Already developed Koreanlanguage resources • KAISTtools/resources • KAIST open source in sourceforge and web • Cambridge University Press: NLP Textbook (undergoing) • Linked Data – http://lod2.eu/ partner Key-Sun Choi - LOD2 Korea
Background • The idea of linking data from different sources is not new: • Network Database Model: 70’s • Linked Data: Today • The goal is to facilitate sharing and re-using information. • Linked Data aims to extend the Web with data commons by creating typed links between data from different sources Key-Sun Choi - LOD2 Korea
Background • These links are usually modeled using the Resource Description Framework (RDF) • Each piece of data is identified with an URI • The first task towards linking data is to identify which resources and which properties we want to describe Key-Sun Choi - LOD2 Korea
Introduction • NLP2RDF is a LOD2 Community project that is developing the NLP Interchange Format (NIF) • NIF aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations • The output of NLP tools can be converted into RDF and used in the LOD2 Stack • http://nlp2rdf.org • NIF… • Is based on RDF/OWL • Enables users to annotate for several languages in a uniform way • Enables users to query text documents with SPARQL (EX http://semanticweb.kaist.ac.kr/nlp2rdf/) • Sentence : 다크나이트는 미국의 영화이다. • Dark knight is a American film. Key-Sun Choi - LOD2 Korea
NIF Wrapping • NLP Interchange Format (NIF) is an RDF/OWL-based format that allows to combine and chain several Natural Language Processing (NLP) tools in a flexible, light-weight way. Key-Sun Choi - LOD2 Korea Sebastian Hellmann, AKSW, UniversitatLeipzig, NLP Interchange Format(NIF)
Structure of NLP2RDF NLP Layer Interchange Layer Key-Sun Choi - LOD2 Korea Data Layer
EnglishNLP Example of NLP Layer Tokenization InputSentence CFG Parser Dependency Parser Key-Sun Choi - LOD2 Korea
How to create RDF from NLP output Process Example My dog also likes eating sausage. Raw Texts NLP Tools output Key-Sun Choi - LOD2 Korea NIF Wrapper StanfordWrapper.Java RDF
Example of NLP2RDF in ENG • http://nlp2rdf.lod2.eu/demo.php • Sentence: Obama is the president of USA. <http://prefix.given.by/theClient#offset_0_5> sso:oliaLink <http://purl.org/olia/penn.owl#NNP> ; sso:posTag "NNP" ; sso:lemma "Obama" ; str:referenceContext<http://prefix.given.by/theClient#offset_0_30> ; str:anchorOf "Obama" ; rdf:typesso:Word , str:String . Key-Sun Choi - LOD2 Korea
Korean NLP2RDF • Resources: morphemes, words (eojeols) and sentences in Korean • Properties: POS, grammatical roles, etc. • Problems to solve: • Linguistic Modeling (OLiA) • Processing Korean Text (NLP) • How to Produce and Query RDF Key-Sun Choi - LOD2 Korea
Linguistic Modeling (1) • We use OLiA(Ontologies of Linguistic Annotation) to link the Sejongtagsetwith language-independent reference concepts. • Sejongtagset is a Korean default standard • OLiA consists of three different ontologies: • the OLiAreference model (language-independent), • the OLiAannotation model (depends on the tagset), • the OLiAlinking model (depends on the tagset). • We developed afragment of these last two ontologies for Korean, that is, for the Sejongtagset. Key-Sun Choi - LOD2 Korea
Linguistic Modeling (2) • We use the NIF(NLP Interchange Format) to • standardizethe input/output of the different tools to ease to connection among them, and to • uniquely identify (parts of) text, entities and relationships. • NIF provides two URI schemes to identify resources • Offset-based • Hash-based • We opt in our application for the Hash-based Key-Sun Choi - LOD2 Korea
Korean NLP2RDF Platform RAW Text • HanNanum • Korean Open Source Morpheme Analyzer • Developed by SWRC, KAIST Morpheme Analyzer • Korean Berkeley Parser • Training set: Modified Sejong Treebank(DongHyun Choi, Jungyeul Park, Key-Sun Choi , Korean Treebank Transformation for ParsrTraining, ACL - SPMRL 2012) • F1-score: 82.12% Parser Key-Sun Choi - LOD2 Korea Wrapper • Produce triples • Use OLiA (Ontologies of Linguistic Annotation) to link the Korean tagsets with language-independent reference concepts • The OLiA annotation model and the OLiA linking model produce triples using the Sejongtagset NIF output
Korean Language information KoreanNLP Korean Grammar Framework Input KoreanSentence Morph.Analyzer CFG Parser Parsedresult URI, Tag DataBase Dependency Parser Mappings Ontologies RDF generator OnTopFramework RDF triples Key-Sun Choi - LOD2 Korea SPARQL Query SPARQL Query Handler RDF triples
NIF Output • Each piece of data is identified with an URI (Hash-based) • Resources: Morphemes, Words (eojeols), Sentences in Korean • Properties: POS-tag, Grammatical roles, etc. Key-Sun Choi - LOD2 Korea Some produced triples DEMO site: http://semanticweb.kaist.ac.kr/nlp2rdf Parsing results
NIF Output 이탈리아에서 공부하고 온 마틴은 한국을 사랑합니다. Martin who came from Italy after studying there loves Korea. Key-Sun Choi - LOD2 Korea
Specific Issues of Korean • Korean Tagset • Linking with OLiA Ontology: String Ontology Structured Sentence Ontology (SSO) OLiA Penn Parser Output String Word, Sentence, Phrase,,, Tag ,,, Sejong Tag Set Key-Sun Choi - LOD2 Korea NLP2RDF: Produce Triples RDF output
Conclusions: • We presented a framework that allows • processingKorean text, • Efficiently producing RDF triples, and • queryingthe NLP tools outcome • The RDF outcome of our framework is compliant with the NIF (NLP Interchange Format) and the OLiA ontologies to facilitate its combination with other NLP tools • Future: • complete the development of the language-dependent part of the OLiAontologies, • include the missing features required by NIF, • allow richer SPARQL queries, and • disambiguate the different entities in the text and link them with Wikipedia articles. Key-Sun Choi - LOD2 Korea
Issues • DBpedia • How to link between produced triples and DBpedia triples • Josa (postposition case marker) • Korean specific grammatical feature Key-Sun Choi - LOD2 Korea Sentence : 다크나이트는미국의 영화이다. Sentence : Dark knight is the American movie.
Source • OnTop • https://babbage.inf.unibz.it/trac/obdapublic/wiki/ObdalibPluginIntro • Demo Site : for Korean • http://semanticweb.kaist.ac.kr/nlp2rdf • Demo site : for English • http://nlp2rdf.lod2.eu/demo.php • NLP2RDF • http://nlp2rdf.org Key-Sun Choi - LOD2 Korea
Key-Sun Choi, Mun-Yong Yi, In-Young Koh, Younghee Lee(CS/WebST, Knowledge Service Eng., CS/WebST, CS)Tony Veale (Invited Professor, Computational Creativity)Yoon, Yong-Un (research professor, NLP+DB)Martin Rezk (postdoctoral researcher, Logic)Park, Jung-Yeol (researcher, parser)Lee, Jae-Sung (Professor, morphology and word)Graduate Students:Soon-Gil Hong, Young-GyunHahm , KyungtaeLim, Se-Mi Jang, Youngho Jeong, … http://ko.dbpedia.org/http://semanticweb.kaist.ac.krkschoi@kaist.ac.kr