180 likes | 325 Views
Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents. David W. Embley and Li Xu Department of Computer Science Brigham Young University. Overview. Context: the Larger Project Record Location and Reconfiguration The Problem Record-Recognition Heuristic
E N D
Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents David W. Embley and Li Xu Department of Computer Science Brigham Young University
Overview • Context: the Larger Project • Record Location and Reconfiguration • The Problem • Record-Recognition Heuristic • RLR Algorithm • Experiments and Results • Conclusions
Objective: Convert Unstructured Multiple-Record Web Documents into Structured DB Tables Car Year Make Model Price Phone Nr . . . Car Feature . . .
Car [-> object]; Car [0:0.98:1] has Year [1:*]; Car [0:0.93:1] has Make [1:*]; . . . PhoneNr [1:*] is for Car [0:1.15:*]; Year matches [4] constant {extract “\d{2}”; context “\b’[4-9]\d\b; . . . `89 Subaru SW. Auto, AC, $1900 OBO. Call 835-8579 Year `89 1 3 Make Subaru 5 10 Model SW 12 13 Feature Auto 16 19 Feature AC 22 23 Price $1900 26 30 PhoneNr 835-8579 42 49 Framework Application Ontology Web Page Ontology Parser Record Extractor Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme Constant/Keyword Recognizer Unstructured Record Documents Database-Instance Generator Populated Database Data-Record Table
Record Location and Reconfiguration Application Ontology Web Page Ontology Parser Record Extractor Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme Constant/Keyword Recognizer Unstructured Record Documents Database-Instance Generator Populated Database Data-Record Table
factored split joined interspersed off-page Problems Encountered
Proposed Solution • Maximize a Record-Recognition Measure • Improvements: • Split joined records • Distribute factored values • Link off-page information • Join split records • Discard interspersed records
Record-Recognition Heuristic: Vector Space Modeling VSM VSM Measure: Cosine Vector Length DV OV
1989SubaruSW. Auto, AC, $1900 OBO. Call (336)835-8579. ‘53ChevyBel Aire. All original, looks like new. Serious inquiries only. $8500. Call (336)468-8924 after 4 pm. ‘85BuickPark Avenue. $500. Head may be cracked. Will run. Body good condition. Call (336)526-2768. ‘95FordThunderbird. Loaded, V-8, 45K, $6995. Call S&J Motors at (336)874-3403. ‘96MercuryTracer. 4 door, 5 speed, 34K, $4995. Call S&J Motors at (336)874-3403. ‘88 Firebird. V8, 5.0, fuel injected, T-tops, 109,000 miles, red, runs great. $1880. Call (336)526-1164 anytime. ‘95ChevyAstro, V6, w/ac & fully equipped utility shelves. $9400. Call (336)526-2675 & leave message. RLR-A: Initial Records (OK) OV DV Year 0.98 7 Make 0.93 6 Model 0.91 6 Mileage 0.45 3 Price 0.80 7 Feature 2.10 7 PhoneNr 1.15 7 Cosine c : = 0.95 Estimator for Number of Records: lDVl / lOVl = 5.4 Number of <hr> tags: 6
RLR-A: Initial Records (not OK) separated by lines separated by white space
RLR-A: Join Split Records c= 0.80 CHEV Beretta, 2 door PW. PL, tilt, Sun roof, new engine W/only 13,000 miles runs GREAT! $12,290 c= 0.85 c= 0.29
RLR-A: Link Off-Page Information c= 0.38 JERRY SEINER MIDVALE, 566-3800 (see list) { BUICK Regal Sedan, highest rating! GM guy price! $13,999. CHEV Lumina, 6 pass., super deal! $11,977. CHEV Beretta, 2 door PW. PL, tilt, Sun roof, new engine W/only 13,000 miles runs GREAT! $12,290 MASDA 626 LX, auto, leather, sunrf. Must see. $15,000 c= 0.90
c= 0.40, lDVl / lOVl = 0.47 DODGE Neon $6,995 DODGE Stratus $10,295 DODGE Avenger $9,295 KenGarff.com 526-1700 c= 0.40, lDVl / lOVl = 0.47 c = 0.54, lDVl / lOVl = 0.57 RLR-A: Split Joined Records c= 0.56, lDVl / lOVl = 1.74 DODGE Neon $6,995 DODGE Stratus $10,295 DODGE Avenger $9,295 KenGarff.com 526-1700
c= 0.54, lDVl / lOVl = 0.57 c = 0.54, lDVl / lOVl = 0.57 c = 0.54, lDVl / lOVl = 0.57 c= 0.40, lDVl / lOVl = 0.47 DODGE Neon $6,995 DODGE Stratus $10,295 DODGE Avenger $9,295 KenGarff.com 526-1700 c= 0.40, lDVl / lOVl = 0.47 c = 0.54, lDVl / lOVl = 0.57 RLR-A: Distribute Factored Values DODGE Neon $6,995 526-1700 DODGE Stratus $10,295 526-1700 DODGE Avenger $9,295 526-1700
RLR-A: Discard Interspersed Records c= 0.94 > threshold 1998 CHEV Cavalier, 4 door, auto., air, cass., very clean, $9,995, 571-7214, DL1497. 1998 Next to new – must sell washer and dryer, Whirlpool, electric, $750 c= 0.32 < threshold
Test Set Characteristics • 30 pre-selected documents • Characteristics • 8 contained only regular car ads. • 13 contained inside-boundary joined car ads all with inside-boundary factored values. • 1 contained outside-boundary factored values. • 13 contained interspersed non-car-ads. • None contained off-page or split ads.
Results • Correctly reconfigured 91% (of 304) • 36 false drops • 11 car ads improperly discarded (value-recognition problem) • 25 car ads improperly reconfigured • 20 ads with identical phone numbers on every 5th • 5 inside-boundary ads not split (missing years & makes; not all models recognized) • Correctly discarded 94% (of 47) • 3 false positives • all snowmobile ads • Correctly produced 97% (of 1,077)
Conclusions • Record location and reconfiguration is possible. • Key idea: maximize VSM measure • Establish expectations. • Reconfigure recognized patterns with respect to expectations. www.deg.byu.edu