110 likes | 316 Views
Semi-Automatically Generating Data-Extraction Ontology. Yihong Ding March 6, 2001. Extract information from Web document. ------------------------------------------------------------------------- -- Cars Application Ontology -- -- $Revision: 1.2 $ -- -- $Log: cars.osm,v $
E N D
Semi-Automatically Generating Data-Extraction Ontology Yihong Ding March 6, 2001
Extract information from Web document ------------------------------------------------------------------------- -- Cars Application Ontology -- -- $Revision: 1.2 $ -- -- $Log: cars.osm,v $ -- Revision 1.2 1998/02/20 00:15:55 liddl -- Cleaned up header -- -- Revision 1.1 1998/02/20 00:14:14 liddl -- Initial revision -- Car [-> object]; Car [0:1] has Year [1:*]; Year matches [4] constant { extract "\d{2}"; context "([^\$\d]|^)[4-9]\d[^,\dkK]"; substitute "^" -> "19"; }, { extract "\d{2}"; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, { extract "\d{2}"; context "\b'[4-9]\d\b"; substitute "^" -> "19"; }, { extract "\d{2}"; context "([^\$\d]|^)0\d[^,\dkK]"; substitute "^" -> "20"; },
Ontology Car [-> object]; Car [0:1] has Make [1:*]; Make matches [10] constant { extract "\baudi\b"; }; end; Car [0:1] has Model [1:*]; Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; }; end; Car [0:1] has Mileage [1:*]; Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";}; end; Car [0:1] has Price [1:*]; Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";}; end; • a computational entity, a resource containing knowledge about what “concepts” exist in the world and how they relate to one another • Components • Concepts • Domain dependent • Context free • Context sensitive • Domain independent • Context free • Context sensitive • Relationship (relational schema between the concepts) • Constraints
My work • Pre-assumptions • Given information knowledge base that already containing domain dependent and domain independent concepts • Pre-defined ontologies • Mikrokosmos, Gene, our ontologies, etc. • Component recognizers • date, time, price, phone number, etc. • Given sample training Web documents • Semi-automatically generate the ontology
Classify related concepts for the sample documents Pattern learning & updating Pattern learning & updating Need modification Information knowledge base Training Web documents Satisfied User Control Interface Output final ontology Raw completed ontology Partial completed ontology Raw completed ontology Architecture
Example: CIA Factbook • Country: China • Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea and Vietnam • Geographic coordinates: 35 00 N, 105 00 E • Map references: Asia • Area: • total: 9,596,960 sq km • land: 9,326,410 sq km • water: 270,550 sq km
CountryName matches [30] constant { extract “\bChina\b”; }, { extract “\bUnited States\b”; }; … end; Location matches [50] constant { extract "\bAsia\b"; }, { extract "\bEurope\b"; }, … { extract “\bYellow Sea\b”; }, … end; Latitude matches [10] constant { extract "\b[1-9]\d{0,2}\b[1-9]\d{0,1}(E|W)"; }, end; Longitude matches [10] constant { extract "\b[1-9]\d{0,2}\b[1-9]\d{0,1}(N|S)"; }, end; Number matches [6] constant { extract "[1-9]\d{0,5}"; }, { extract "[1-9]\d{0,2},\d{3}"; }, end; Country: China Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea and Vietnam Geographic coordinates: 35 00 N, 105 00 E Map references: Asia Area: total: 9,596,960 sq km land: 9,326,410 sq km water: 270,550 sq km Partial completed ontology
Country [-> object]; Country [0:1] has CountryName [1:1]; Country [0:1] has Location1 [1:*]; ... Country [0:1] has Location8 [1:*]; Country [0:1] has Latitude [1:*]; Country [0:1] has Longitude [1:*]; Country [0:1] has Number1 [1:*]; Country [0:1] has Number2 [1:*]; Country [0:1] has Number3 [1:*]; -- ** Generalization/Specializations Location1 : Location ... Location8 : Location Number1 : Number Number2 : Number Number3 : Number Country: China Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea and Vietnam Geographic coordinates: 35 00 N, 105 00 E Map references: Asia Area: total: 9,596,960 sq km land: 9,326,410 sq km water: 270,550 sq km Raw completed ontology
User control interface • Country: China {CountryName} • Location: Eastern Asia {Location1}, bordering the East China Sea {Location2}, Korea Bay {Location3}, Yellow Sea {Location4}, and South China Sea {Location5}, between North Korea {Location6}, and Vietnam {Location7} • Geographic coordinates: 35 00 N {Latitude}, 105 00 E{Longitude} • Map references: Asia{Location8} • Area: • total: 9,596,960{Number1} sq km • land: 9,326,410{Number2} sq km • water: 270,550{Number3} sq km • Country: China {CountryName} • Location: Eastern Asia {Location1}, bordering the East China Sea {Location2}, Korea Bay {Location3}, Yellow Sea {Location4}, and South China Sea {Location5}, between North Korea {Location6}, and Vietnam {Location7} • Geographic coordinates: 35 00 N {Latitude}, 105 00 E{Longitude} • Map references: Asia{MapReference} • Area: • total: 9,596,960{TotalArea} sq km • land: 9,326,410{LandArea} sq km • water: 270,550{WaterArea} sq km • Country: China {CountryName} • Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea, and Vietnam {Location} • Geographic coordinates: 35 00 N {Latitude}, 105 00 E{Longitude} • Map references: Asia{MapReference} • Area: • total: 9,596,960{TotalArea} sq km • land: 9,326,410{LandArea} sq km • water: 270,550{WaterArea} sq km • Output to user • raw completed ontology • tagged training web pages • the query results • User may • modify attribute name • combine attributes • delete useless attributes • change relationships • add new attributes, new relations, and constraints • … • When satisfied, output the final ontology
Problems • Obtain knowledge base • Classify related concepts for the sample documents • Refine • Tag the document based on the raw completed ontology • User interface design and control • Update strategy to raw completed ontology based on user modification
Contribution • Exploit existing knowledge • Semi-automatically generate an extraction ontology