290 likes | 429 Views
MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION METHOD. By Wuri Wedyawati. Agenda. Introduction Data Warehousing MasterDW Data Warehouse Predictive Regression Real Estate Price Prediction Conclusion. Introduction.
E N D
MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION METHOD By Wuri Wedyawati
Agenda • Introduction • Data Warehousing • MasterDW Data Warehouse • Predictive Regression • Real Estate Price Prediction • Conclusion
Introduction • The object of this project is to develop a knowledge discovery system for prospective real estate sellers and buyers to determine their properties price, based on available sold listings in their areas. • The prediction of continuous values, such as properties selling price, is modeled by a statistical technique called predictive regression. • The suitable data warehouse design for this project is a star schema with one large fact table surrounded by many dimension tables. • This project uses Visual Basic .NET to create user interface. The communication between Oracle and .NET framework is established by adding Oracle Provider for OLE DB (OraOLEDB) component as reference in the project.
Data Warehouse • Data warehouse is a collection of data gathered and organized so that it can easily by analyzed, extracted, synthesized, and otherwise be used for the purposes of further understanding the data. • Building a data warehouse is the first step to do as the preparation for data mining.
Data Warehouse (cont.) • Several processes in building a data warehouse: 1. Extraction 2. Transformation and cleansing 3. Modeling 4. Transport
1. Extraction • Document the sources of data • Identify the databases and files containing the data of interest • Analyze and document the business meaning of the data, data relationships and business rules • Determine how much data needs to be extracted • Extract all of subset of the data in the source system • Use unload utility • Use data manipulation language statement • Extract the changes made to the source data • Use a recovery log • Use a database trigger
2. Transformation and Cleansing • Check the integrity of the source data to verify that it conforms to the business rules and relationships identified in extraction process. • Check the accuracy of the source data. • Identify the tasks and products required for data cleansing. • Document the rules and identify the products required for transforming and integrating the data into the format required by the target system.
3. Modeling • Fact is a numeric (or other type of) data element by which an organization measures aspects of its business. Examples: dollar amounts (budget, expenditure, encumbrance, revenue, value of order), counts (headcount, credit hours, number of items) • Dimension is a set of attributes, usually hierarchical, that is used to describe an organization’s business by constraining and grouping facts. Examples: time, students, faculty, organization, funds, product.
4. Transport • Identify the products and techniques to be used for loading the data into the target system • SQL*Loader utility (for flat file data) • Transportable tablespaces (for Oracle database) • Evaluate the need for data compression and encryption if captured or transformed data is to be transported across a network
RESI.TXT (Data Source) Transformation and Cleansing RESSOLDLOG.TXT (Log File) Update RES.TXT Transformation and Cleansing 2 OFCSRC.TXT AGTSRC.TXT Duplicate Detection OFFICE.TXT AGENT.TXT RESIDENTIAL.TXT AREA.TXT Load Load Load OFFICES TABLE AGENTS TABLE RESIDENTIAL TABLE AREA TABLE MasterDW Data Warehouse
MasterDW Extraction • The operational data source is extracted from Sacramento, El Dorado, Placer, and Yolo Counties Multiple Listings Services (MLS) database. • The data source is from the residential MLS database backup that is scheduled in January 9, 2004. It captures all the residential data in the source system since January 1, 1998 until January 9, 2004. • The source data is in the “|” delimited flat file and contains of 191 fields and 295787 rows (“RESI.TXT”).
MasterDW Transformation and Cleansing There are four steps of transformation and cleansing in this project: • Transformation and cleansing 1 • Update process for the result of transformation and cleansing 1 • Transformation and cleansing 2 • Duplication detection for office and agent records
1. Transformation and Cleansing 1 • Listing Price Check If intLP <= 0 Or intLP > 99999999 Then LPCheck = strMLSNo & " : INVALID LP = " & Str(intLP) ElseIf intLP < 10000 Or intLP > 50000000 Then LPCheck = strMLSNo & " : LP EXCEEDS LIMIT = " & Str(intLP) End If • Square Footage Check If intSQFT = 0 Then SQFTCheck = strMLSNo & " : SQFT IS NULL = " & Str(intSQFT) ElseIf intSQFT > 10000 And intLP < 1000000 Then SQFTCheck = strMLSNo & " : SQFT EXCEEDS LIMIT = " & Str(intSQFT) End If
1. Transformation and Cleansing 1 (cont.) • Listing Date Check If strLD = "0000-00-00" Or Len(strLD) < 8 Then LDCheck = strMLSNo & " : INVALID LD = " & strLD ElseIf DateValue(strLD) < "1900-01-01" Then LDCheck = strMLSNo & " : LD EXCEEDS LIMIT = " & strLD End If • Number of Full Bathroom and Half Bathroom Check If intFull <= 0 Then BathCheck = strMLSNo & " : NO FULL BATHROOM = " & Str(intFull) & " AND " & Str(intHalf) End If
1. Transformation and Cleansing 1 (cont.) • Number of Bedroom Check If intBed <= 0 Then BedCheck = strMLSNo & " : NO BEDROOM = " & Str(intBed) End If • Year Built Check If Len(strYearBlt) = 0 Then YearBltCheck = strMLSNo & " : NO YEAR BUILT = " & strYearBlt ElseIf Val(strYearBlt) <= 1900 Then YearBltCheck = strMLSNo & " : INVALID YEAR BUILT = " & strYearBlt End If
1. Transformation and Cleansing 1 (cont.) • Pending Date Check If strSD = "0000-00-00" Or Len(strSD) < 8 Then SDCheck = strMLSNo & " : INVALID SD = " & strSD ElseIf Len(LDCheck(strMLSNo, strLD)) = 0 And DateValue(strSD) < DateValue(strLD) Then SDCheck = strMLSNo & " : SD / LD = " & strSD & " / " & strLD ElseIf Len(PDCheck(strMLSNo, strPD, strLD)) = 0 And DateValue(strSD) < DateValue(strPD) Then SDCheck = strMLSNo & " : SD / PD = " & strSD & " / " & strPD ElseIf DateValue(strSD) < "1990-01-01" Then SDCheck = strMLSNo & " : SD EXCEEDS LIMIT = " & strSD End If • Days on Market Check If intSP <= 0 Or intSP > 99999999 Then SPCheck = strMLSNo & " : INVALID SP = " & Str(intSP) ElseIf intSP < 10000 Or intSP > 50000000 Then SPCheck = strMLSNo & " : SP EXCEEDS LIMIT = " & Str(intSP) ElseIf Len(LPCheck(strMLSNo, intLP)) = 0 And (intSP < intLP / 1.5 Or intSP > intLP * 1.5) Then SPCheck = strMLSNo & " : LP / SP EXCEEDS NORM = " & intLP & " / " & intSP End If
1. Transformation and Cleansing 1 (cont.) • Sold Date Check If strPD = "0000-00-00" Or Len(strPD) < 8 Then PDCheck = strMLSNo & " : INVALID PD = " & strPD ElseIf Len(LDCheck(strMLSNo, strLD)) = 0 And DateValue(strPD) < DateValue(strLD) Then PDCheck = strMLSNo & " : PD IS LESS THAN LD => PD = " & strPD & " & LD = " & strLD ElseIf DateValue(strPD) < "1990-01-01" Then PDCheck = strMLSNo & " : PD EXCEEDS LIMIT = " & strPD End If • Sold Price Check If (LDCheck(strMLSNo, strLD)) = 0 And Len(PDCheck(strMLSNo, strPD, strLD)) = 0 And DateDiff(DateInterval.Day, DateValue(strPD), DateValue(strLD)) > 730 Then DOMCheck = strMLSNo & " : DOM TOO LARGE = " & DateDiff(DateInterval.Day, DateValue(strPD), DateValue(strLD)) End If
2. Update Process for the Result of Transformation and Cleansing 1 • 132110169 : LP EXCEEDS LIMIT = 132 (132000) • 30015346 : SQFT EXCEEDS LIMIT = 12700 (1270) • 30015611 : LD EXCEEDS LIMIT = 1920-05-07 (2000-05-07) • 30015755 : NO FULL BATHROOM = 0 AND 3 (3 AND 0) • 102100090 : INVALID YEAR BUILT = 96 (1996) • 30028591 : INVALID YEAR BUILT = 1056 (1956) • 102000035 : PD IS LESS THAN LD => PD = 2000-03-30 & LD = 2020-01-26 (2000-01-26) • 102000035 : SD / LD = 2000-05-31 / 2020-01-26 (2000-01-26) • 122003643 : SD / PD = 2000-11-01 / 2000-11-05 (2000-11-05 / 2000-11-01) • 132001727 : SP EXCEEDS LIMIT = 124 (124000) • 30016715 : LP / SP EXCEEDS NORM = 226000 / 22600000 (226000)
3. Transformation and Cleansing 2 • The agents table (“AGTSRC.TXT”) fields: Listing agent id, Listing agent name, Listing agent phone 1, Listing agent phone 2, Listing agent phone 3, Listing agent phone type 1, Listing agent phone type 2, Listing agent phone type 3, Listing office id, Listing co-agent id, Listing co-agent name, Listing co-agent phone, Listing co-office id, Selling agent id, Selling agent name, Selling agent phone, Selling office id, Selling co-agent id, Selling co-agent name, Selling co-office id. Example: “SREIDMAR|Marjorie Reid|916-485-5124|916-485- 5124||1|2||LYON01”
3.Transformation and Cleansing 2 (cont.) • The offices table (“OFCSRC.TXT”) fields: Listing office id, Listing office name, Listing office phone, Listing office address, Listing office zip, Listing co-office id, Listing co-office name, Listing co-office phone, Selling office id, Selling office name, Selling office phone, Selling co-office id, Selling co-office name. Example: “LYON01|Lyon Real Estate|916-481-3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|95825|Sacramento”
3. Transformation and Cleansing 2 (cont.) • The areas table (“AREA.TXT”) fields: Area number, area name, county. Example: “10819|East Sacramento & Vicinity|Sacramento County”
3. Transformation and Cleansing 2 (cont.) • The residential table (“RESIDENTIAL.TXT”) fields: Example: “15501835|2367|Glen Ellen|95822|Sacramento|10822 ||Thomas Bros. (PL,SA)|317 D-5|035-0132-012-0000|Residential|Sold|22-Jan-95|28-Jan-95|20-Apr-99|01-Jan-00|20-May-99|01-Jan-00|1549|1700 |764705.9|17.56|2|1|4|130000|159500|SGREENCA ||GCNA||130000|SSTANWIL||CLBA20||1959|3|FHA |Sacramento|Sacramento Unified|Sacramento Unified|Sacramento Unified”
4. Duplication and Detection for Agent and Office Records • “AGTSRC.TXT” contains duplicate records. An agent can be a selling agent, a buyer agent, or both in a listing. An agent can have more than one listing in “RES.TXT”. Example: “SAKBARIR|Rouhi N. Akbari|916-484-5456|916-223- 7647||1|C||LYON01” • “OFCSRC.TXT” contains duplicate records. An office can be a selling office, a buyer office, or both in a listing. An office can have more than one listing in “RES.TXT”. Example: “LYON01|Lyon Real Estate|916-481-3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|95825|Sacramento”
AGENTS Dimension Table OFFICES Dimension Table RESIDENTIAL Fact Table AREAS Dimension Table MasterDW Modeling
MasterDW Transport • Load “AREA.TXT” to AREAS dimension table c:\>sqlldr masterdw/masterdw control=area.ctl log=area.log • Load “OFFICE.TXT” to OFFICE dimension table c:\>sqlldr masterdw/masterdw control=office.ctl log=office.log • Load “AGENT.TXT” to AGENTS dimension table c:\>sqlldr masterdw/masterdw control=agent.ctl log=agent.log • Load “RESIDENTIAL.TXT” to RESIDENTIAL dimension table c:\>sqlldr masterdw/masterdw control=residential.ctl log=residential.log
Predictive Regression Predictiveregression is regression that uses continuous values in the data set to predict unknown or future values of other variables of interest. The objective of regression analysis is to determine the best model that can relate the output variable to various input variables. nn β = [ ∑ (xi – meanx) . (yi – meany) ] / [ ∑ (xi – meanx)2 ] i=1i=1 α = meany - β .meanx y = α + β.x
Conclusion • There are several things that can be learned from this project: • Real estate terms and transaction process • Building a data warehouse using Oracle data warehousing tools • Statistical data analysis (predictive regression method) • Visual Basic .NET programming • Oracle Provider for OLE DB (ORAOLEDB) • Future Work: • Enhance this project by making it online. • Determine what kind of house improvement that a real estate seller can do to increase his/her house price on the market.