270 likes | 388 Views
Data Extraction and Integration from Imprecise Web Sources. Lorenzo Blanco , Mirko Bronzi, Valter Crescenzi , Paolo Merialdo , Paolo Papotti Università degli Studi Roma Tre (Creative Commons License , see last slide). Data-intensive websites. Data-intensive websites. target.
E N D
Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative CommonsLicense, see last slide)
Data-intensive websites target Website Template1 Database Template2 Template3
Flint goal Last Min Max StockQuote … Volume 52high Open
System architecture Flint Web Search [WIDM08] Data Extraction Data Integration The Web
Novel contribution Data Extraction Data Integration • Unsupervised • Automatic • Scalable • No knowledgeavailable • Unsupervised • Automatic • Scalable • Uncertain Data • No labelsavailable • No corpus available WebTables [Vldb08] Cimple [Vldb07] MetaQuerier [Cidr05] PayGo [Cidr07] RoadRunner [Vldb01] ExAlg [Sigmod03] TurboWrapper [Vldb07]
Data Extraction AAPL, GOOG, MSFT, INTC, … 128.09, 439.54, 34.89, 112.37, … 127.81, 439.25, 32.13, 111.01, … 132.43, 443.82, 33.67, 114.32, … 0.50%, -0.38%, 1.23%, 3.92%, -1.65%, … Add AAPL toYour Portfolio, Add GOOG toYour Portfolio, Add MSFT toYour Portfolio, Add INTC toYour Portfolio, … …
Data Extraction HTML fragments taken from two pages belonging to the same website: Extractionerror! ? /html/body/table/tr[1]/td[2] 1,132,228 , 1,735,857 /html/body/table/tr[2]/td[2] $20.66 , $414.58 /html/body/table/tr[3]/td[2] $11.70 , $247.30 /html/body/table/tr[4]/td[2] $20.72 , $414.06 /html/body/table/tr[5]/td[2] $0.02 , 99,494,200 /html/body/table/tr[6]/td[2] 4,732,600 , null
Data Integration 10 33 16 4 25 10 AA GO MS (min) (max) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 4 25 10 AA GO MS 1.0 1.0 1.0 (min) (max) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 4 25 10 AA GO MS (min) (max) (stock)
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 AA GO MS AA GO MS (max) (max) (min) (min) (stock) (stock) 4 25 10 AA GO MS 6 26 12 0.6 1.0 (min) (stock) (price) 1.0
Data Integration t=0.5 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS ? (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock) 1.0
Data Integration t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock)
Data Integration t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) 4 25 10 AA GO MS 1.0 (min) (stock)
Data Integration t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (min) (min) (price) (stock) (stock) (stock) 4 25 10 (min)
Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS ? ? (max) (max) (min) (min) (min) (price) (stock) (stock) (stock) 0.0 0.0 0.3 (weak) 0.3 (weak) 10 null 10 (min/max)
Wrapper Refinement //td[contains(text(),‘Open')]/../td[2] //td[contains(text(),‘Open')]/../../tr[5]/td[1] //td[contains(text(),‘Open')]/../../tr[5]/td[2] //td[contains(text(),‘High')]/../td[2] … matching value nearby template tokens
Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (min) (min) (min) (price) (stock) (stock) (stock) 1.0 1.0 //td[contains(text(),‘Max')]/../td[2] 10 33 16 4 2510 //td[contains(text(),‘Min')]/../td[2] (max) (min) 10 null 10 (min/max)
Wrapper Refinement t=0.7 t=0.7 t=0.5 t=0.5 10 33 16 10 33 16 10 33 16 4 25 10 4 25 10 4 25 10 4 2510 6 26 12 AA GO MS AA GO MS AA GO MS (max) (max) (max) (min) (min) (min) (min) (price) (stock) (stock) (stock) 10 null 10 (min/max)
Experimental Results(100 websites for each domain) • Soccer domain • (45,714 pages) • Attribute |m| • Name 90 • Birth Date 61 • Height 54 • Nationality 48 • Club 43 • Position 43 • Weight 34 • League 14 • Videogame domain • (49,262 pages) • Attribute |m| • Title 86 • Publisher 59 • Developer 45 • Genre 28 • ESRB rating 40 • Release Date 9 • Platform 9 • # Players 6 • Finance domain • (57,623 pages) • Attribute |m| • Stock Symbol 84 • Price Change 73 • % Change 73 • Volume 52 • Day Low 43 • Day High 41 • Last Price 29 • Open Price 24
Demo • Found Websites • Integrated Data
the end! http://flint.dia.uniroma3.it
License • This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.