220 likes | 355 Views
Information Extraction on Real Estate Rental Classifieds. Eddy Hartanto Ryohei Takahashi. Overview. We want to extract 10 fields:. Security deposit Square footage Number of bathrooms Contact person’s name Contact phone number. Nearby landmarks Cost of parking Date available
E N D
Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi
Overview • We want to extract 10 fields: • Security deposit • Square footage • Number of bathrooms • Contact person’s name • Contact phone number • Nearby landmarks • Cost of parking • Date available • Building style / architecture • Number of units in building • These fields can’t easily be served by keyword search
Approach • Hand labeled test set as precision and recall computation base • Pattern matching approach with Rapier • Statistical approach using HMM with different structures
Hidden Markov Models • We consider three different HMM structures • We train one HMM per field • Words in postings are output symbols of HMM • Hexagons represent target states, which emit the relevant words for that field
Training Data • We use a randomly-selected set of 110 postings to use as the training data • We manually label which words in each posting are relevant to each of the 10 fields
HMM Structure #1 • A single prefix state and single suffix state • Prefixes and suffixes can be of arbitrary length
HMM Structure #2 • Varying numbers of prefix, suffix, and target states
HMM Structure #3 • Varying numbers of prefix, suffix, and target states • Prefixes and suffixes are fixed in length
Cross-Validation • We use cross-validation to find the optimal number of prefix, suffix, and target states
Preventing Underflow • Postings are hundreds of words long • Forward and backward probabilities become incredibly small => underflow • To avoid underflow, we normalize the forward probabilities: • instead of
Smoothing • We perform add-one smoothing for the emission probabilities:
Rapier • Rapier automatically learns rules to extract fields from training examples • We use the same 110 training postings as for the HMMs
Data Preparation • Sentence Splitter (Cognitive Computation Group at UIUC, http://l2r.cs.uiuc.edu/~cogcomp/tools.php): puts one sentence on each line • Stanford Tagger (Stanford NLP Group, http://nlp.stanford.edu/software/tagger.shtml): tags each word with part of speech • We then manually create a template file for each of the files, with the information for the 10 fields filled in
Test Data • We use a randomly-selected set of 100 postings to use as the test data • We manually label these 100 postings with the fields
Rapier Results • We use Rapier’s “test2” program to evaluate performance on the labeled postings • Training Set • Precision: 0.990099 • Recall: 0.408998 • F-measure: 0.578871 • Test Set • Precision: 0.747126 • Recall: 0.151869 • F-measure: 0.252427
Insights • Relatively good performance with Rapier • Not too good performance with HMM, due to lack of training data (only 0.67% or 100 sampled randomly from 15000 postings) while test data is 10% or 1500 postings sampled from 15000 postings. • Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names. • Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names