1 / 18

Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents. David W. Embley and Li Xu Department of Computer Science Brigham Young University. Overview. Context: the Larger Project Record Location and Reconfiguration The Problem Record-Recognition Heuristic

oswald
Download Presentation

Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents David W. Embley and Li Xu Department of Computer Science Brigham Young University

  2. Overview • Context: the Larger Project • Record Location and Reconfiguration • The Problem • Record-Recognition Heuristic • RLR Algorithm • Experiments and Results • Conclusions

  3. Objective: Convert Unstructured Multiple-Record Web Documents into Structured DB Tables Car Year Make Model Price Phone Nr . . . Car Feature . . .

  4. Car [-> object]; Car [0:0.98:1] has Year [1:*]; Car [0:0.93:1] has Make [1:*]; . . . PhoneNr [1:*] is for Car [0:1.15:*]; Year matches [4] constant {extract “\d{2}”; context “\b’[4-9]\d\b; . . . `89 Subaru SW. Auto, AC, $1900 OBO. Call 835-8579 Year `89 1 3 Make Subaru 5 10 Model SW 12 13 Feature Auto 16 19 Feature AC 22 23 Price $1900 26 30 PhoneNr 835-8579 42 49 Framework Application Ontology Web Page Ontology Parser Record Extractor Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme Constant/Keyword Recognizer Unstructured Record Documents Database-Instance Generator Populated Database Data-Record Table

  5. Record Location and Reconfiguration Application Ontology Web Page Ontology Parser Record Extractor Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme Constant/Keyword Recognizer Unstructured Record Documents Database-Instance Generator Populated Database Data-Record Table

  6. factored split joined interspersed off-page Problems Encountered

  7. Proposed Solution • Maximize a Record-Recognition Measure • Improvements: • Split joined records • Distribute factored values • Link off-page information • Join split records • Discard interspersed records

  8. Record-Recognition Heuristic: Vector Space Modeling VSM VSM Measure: Cosine Vector Length DV OV

  9. 1989SubaruSW. Auto, AC, $1900 OBO. Call (336)835-8579. ‘53ChevyBel Aire. All original, looks like new. Serious inquiries only. $8500. Call (336)468-8924 after 4 pm. ‘85BuickPark Avenue. $500. Head may be cracked. Will run. Body good condition. Call (336)526-2768. ‘95FordThunderbird. Loaded, V-8, 45K, $6995. Call S&J Motors at (336)874-3403. ‘96MercuryTracer. 4 door, 5 speed, 34K, $4995. Call S&J Motors at (336)874-3403. ‘88 Firebird. V8, 5.0, fuel injected, T-tops, 109,000 miles, red, runs great. $1880. Call (336)526-1164 anytime. ‘95ChevyAstro, V6, w/ac & fully equipped utility shelves. $9400. Call (336)526-2675 & leave message. RLR-A: Initial Records (OK) OV DV Year 0.98 7 Make 0.93 6 Model 0.91 6 Mileage 0.45 3 Price 0.80 7 Feature 2.10 7 PhoneNr 1.15 7 Cosine c : = 0.95 Estimator for Number of Records: lDVl / lOVl = 5.4 Number of <hr> tags: 6

  10. RLR-A: Initial Records (not OK) separated by lines separated by white space

  11. RLR-A: Join Split Records c= 0.80 CHEV Beretta, 2 door PW. PL, tilt, Sun roof, new engine W/only 13,000 miles runs GREAT! $12,290 c= 0.85 c= 0.29

  12. RLR-A: Link Off-Page Information c= 0.38 JERRY SEINER MIDVALE, 566-3800 (see list) { BUICK Regal Sedan, highest rating! GM guy price! $13,999. CHEV Lumina, 6 pass., super deal! $11,977. CHEV Beretta, 2 door PW. PL, tilt, Sun roof, new engine W/only 13,000 miles runs GREAT! $12,290 MASDA 626 LX, auto, leather, sunrf. Must see. $15,000 c= 0.90

  13. c= 0.40, lDVl / lOVl = 0.47 DODGE Neon $6,995 DODGE Stratus $10,295 DODGE Avenger $9,295 KenGarff.com 526-1700 c= 0.40, lDVl / lOVl = 0.47 c = 0.54, lDVl / lOVl = 0.57 RLR-A: Split Joined Records c= 0.56, lDVl / lOVl = 1.74 DODGE Neon $6,995 DODGE Stratus $10,295 DODGE Avenger $9,295 KenGarff.com 526-1700

  14. c= 0.54, lDVl / lOVl = 0.57 c = 0.54, lDVl / lOVl = 0.57 c = 0.54, lDVl / lOVl = 0.57 c= 0.40, lDVl / lOVl = 0.47 DODGE Neon $6,995 DODGE Stratus $10,295 DODGE Avenger $9,295 KenGarff.com 526-1700 c= 0.40, lDVl / lOVl = 0.47 c = 0.54, lDVl / lOVl = 0.57 RLR-A: Distribute Factored Values DODGE Neon $6,995 526-1700 DODGE Stratus $10,295 526-1700 DODGE Avenger $9,295 526-1700

  15. RLR-A: Discard Interspersed Records c= 0.94 > threshold 1998 CHEV Cavalier, 4 door, auto., air, cass., very clean, $9,995, 571-7214, DL1497. 1998 Next to new – must sell washer and dryer, Whirlpool, electric, $750 c= 0.32 < threshold

  16. Test Set Characteristics • 30 pre-selected documents • Characteristics • 8 contained only regular car ads. • 13 contained inside-boundary joined car ads all with inside-boundary factored values. • 1 contained outside-boundary factored values. • 13 contained interspersed non-car-ads. • None contained off-page or split ads.

  17. Results • Correctly reconfigured 91% (of 304) • 36 false drops • 11 car ads improperly discarded (value-recognition problem) • 25 car ads improperly reconfigured • 20 ads with identical phone numbers on every 5th • 5 inside-boundary ads not split (missing years & makes; not all models recognized) • Correctly discarded 94% (of 47) • 3 false positives • all snowmobile ads • Correctly produced 97% (of 1,077)

  18. Conclusions • Record location and reconfiguration is possible. • Key idea: maximize VSM measure • Establish expectations. • Reconfigure recognized patterns with respect to expectations. www.deg.byu.edu

More Related