190 likes | 336 Views
Komplexpraktikum Datenbank-Anwendungen. Organisatorisches. Wer? Claudio Hartmann claudio.hartmann@tu-dresden.de Ulrike Fischer ulrike.fischer@tu-dresden.de Für wen? Diplom PO 2004 Informatik, Medieninformatik Komplexpraktikum (Schein) Diplom PO 2010 Informatik Modul PM-FPA Bachelor
E N D
Organisatorisches • Wer? • Claudio Hartmann claudio.hartmann@tu-dresden.de • Ulrike Fischer ulrike.fischer@tu-dresden.de • Für wen? • Diplom PO 2004 Informatik, Medieninformatik • Komplexpraktikum (Schein) • Diplom PO 2010 Informatik • Modul PM-FPA • Bachelor • Informatik B-510, B-520 • Medieninformatik B-530, B-540 • Master • Informatik PM-FPA • Medieninformatik E-4 • Leistungsumfang • 4 SWS oder 8 SWS Komplexpraktikum Datenbank-Anwendungen
Organisatorisches • Ziele • Selbständig… • In Sachverhalte einarbeiten • Probleme erkennen und Lösungsansätze entwickeln • Eigene Ansätze und Ideen umsetzen und Evaluieren • Ablauf • Kick-Off in erster Vorlesungswoche (heute) • Sync-Treffen in regelmäßigen Abständen (Termin?) • Alle 1-2 Wochen • Abschlusspräsentation • Vorstellen der entworfenen Ansätze und Ergebnisse • Am Ende des Semesters • Kommunikation/Material • Diskussion - Auditorium (https://auditorium.inf.tu-dresden.de/courses/2154549) • Code & Testdaten - SVN (Zugang folgt) Komplexpraktikum Datenbank-Anwendungen
Scenario & Challenges • Monthly Report • Targets • Sneak Peak Reporting • Missing data: fill gaps • Further targets • Outlier Detection • Fraud Detection • Development Reports Updates Forecast Impute Refine Adjust Historie … ES CC MFD … Aggregate COOL PTV FIRA Estimation Komplexpraktikum Datenbank-Anwendungen
The data look like this… • Time column • period (monthly date stamp) • Measure columns • sales_units / _nc_ne / _nc_e / _c_ne • purchase_units • stock_new_units • Attribute columns • Some product group specific (id, price; e.g. color, energy_label, size, brand, …) • Some outlet specific (id, distributionfactor, extrapolationfactor, turnover_class, nuts1, channel, …) • Size (cooling) • 2.3 mio “item x outlet”-tuple on 5051 items, 1116 outlets and 36 periods outlet item FIRA / SIS
Common scenario • Model usage • Train a statistical model on historical data • Model seasonalandtrendeffects • Requiresequidistantvalues • Calculate forecast values 3rd year 2nd year 1st year Liebherr KT 1434 optimize Model Komplexpraktikum Datenbank-Anwendungen
Problem • Too short time series • Very sparse data on low aggregation levels • No statistical model available for some specific time series 3rd year 2nd year 1st year Liebherr KT 1434 Bosch KSl 20s53 SEG MS210 A … Komplexpraktikum Datenbank-Anwendungen
Solution • Cross-sectional forecasting • Assume similar behavior of some groups of time series • Use transitions over months from a set of time series • Train model on transitions of several time series • Use last known period as input to calculate forecasts 3rd year 2nd year 1st year Liebherr KT 1434 Bosch KSl 20s53 SEG MS210 A … Model Report calculation on previous period Komplexpraktikum Datenbank-Anwendungen
Attribute hierarchies • Many different ways to partition the data • Different forecast error on different forecast targets • Research goal • What is the best partition for which forecast target? outlet outlet outlet ESP - YES TSS - YES outlet item item item ESP - NO TSS - NO item channel X no_frost channel no_frost YES ESP TSS NO All x item FIRA / SIS
Parallel FIRA processes • Distribute each value of each attribute to an dedicated node • Parallel execution of FIRA-process for each value • Split into two phases system configuration Relation cooling Node 1 channel ESP Node 2 channel TS Node 3 nofrost YES Node 4 nofrost NO process configuration Updates Node 5 nofrost N.A. Node 6 brand AEG Node 7 brand SIEMENS Node 8 Channel x nofrost ESP & YES DB-Server FIRA … ES CC MFD Forecast Impute Refine Adjust Variant 2 … Historie Variant 3 COOL Node 9 Channel x nofrost ESP & NO … ES CC PTV MFD … configuration repository Estimation process … Aggregate database COOL PTV FIRA model exploitation / usage Estimation Self-Adjusting Imputation System Komplexpraktikum Datenbank-Anwendungen
Phases of prediction approach • Prediction phase • Fetch all necessary data for model training • Transitions of all time series covered by the attribute value • Train the model • Fetch data from the pre period • Calculate predictions • Evaluation phase • Fetch data of predicted period • Join with predicted data • Calculate forecast error Model Model error Komplexpraktikum Datenbank-Anwendungen
Prediction phase one attribute • 1st Map-Reduce • Map: • Create Model on item x attribute level • Calculate forecasts on item x attribute level • Modell training data: • Query workload • Fetch time series corresponding to nodes task • Once per time slice outlet item DB-Server Zielperiode: 27 SELECT … FROM (SELECT … FROM cooling WHERE nindexIN ( 15, 3 ) AND channel = ‘ELECTRICSP’ AND sales_units>0 group by itemid,nindex ) AS foo, (SELECT … FROM cooling WHERE nindexIN ( 14, 2 ) AND channel = ‘ELECTRICSP’ GROUP BY itemid,nindex ) AS bar WHERE foo.nindex= bar.nindex+1 AND foo.itemid=bar.itemid AND sales_units_1>0 ESP Node 1 channel ‘ELECTRICSP’ Komplexpraktikum Datenbank-Anwendungen
Prediction phase one attribute • 1st Map-Reduce • Map: • Create Model on item x attribute level • Calculate forecasts on item x attribute level • Modell input data: • Query workload • Fetch time series corresponding to nodes task • Once per time slice outlet item DB-Server Zielperiode: 27 SELECT * FROM ( SELECT … FROM cooling WHERE nindex = 26 AND channel = ‘ELECTRICSP’GROUP BY itemid )foo WHERE sales_units_1>0 AND stock_new_units_1>=0 ESP Node 1 channel ‘ELECTRICSP’ Komplexpraktikum Datenbank-Anwendungen
Prediction phase one attribute • 1st Map-Reduce • Map: • Create Model on item x attribute level • Calculate forecasts on item x attribute level • Modell trainingdata: • Query workload • Fetch time series corresponding to nodes task • Once per time slice outlet item DB-Server Zielperiode: 27 SELECT … FROM (SELECT … FROM cooling WHERE nindexIN ( 15, 3 ) AND channel = ‘TECSUPERST’ AND sales_units>0 group by itemid,nindex ) AS foo, (SELECT … FROM cooling WHERE nindexIN ( 14, 2 ) AND channel = ‘TECSUPERST’ GROUP BY itemid,nindex ) AS bar WHERE foo.nindex= bar.nindex+1 AND foo.itemid=bar.itemid AND sales_units_1>0 TSS Node 2 channel ‘TECSUPERST’ Komplexpraktikum Datenbank-Anwendungen
Prediction phase two attributes • 1st Map-Reduce • Map: • Create Model on item x attribute level • Calculate forecasts on item x attribute level • Modell trainingdata: • Query workload • Fetch time series corresponding to nodes task • Once per time slice outlet ESP - YES item DB-Server Zielperiode: 27 SELECT … FROM (SELECT … FROM cooling WHERE nindexIN ( 15, 3 ) AND channel = ‘ELECTRICSP’ AND nofrost = ‘YES’ AND … group by itemid,nindex ) AS foo, (SELECT … FROM cooling WHERE nindexIN ( 14, 2 ) AND channel = ‘ELECTRICSP’ AND nofrost= ‘YES’ GROUP BY itemid,nindex ) AS bar WHERE foo.nindex= bar.nindex+1 AND foo.itemid=bar.itemid AND sales_units_1>0 Node 8 Channel x nofrost ‘ELECTRICSP’ & ‘YES’ Komplexpraktikum Datenbank-Anwendungen
Evaluation phase • 2nd Map-Reduce: • Map • Get real data from database • Joinwithpredictions • Reduce: • Aggregate Data to demanded aggregation level • Calculate error • Query workload • Fetch all data for one task • to join with predictions and calculate errors • Only once DB-Server error SELECT … FROM ( SELECT nindex AS time, itemid AS itemid FROM cooling WHERE nindex>12 AND channel = ‘ELECTRICSP’ GROUP BY time, itemidHAVING sum(sales_units)>0 ) AS t1, ( SELECT … FROM cooling WHERE nindex>13 AND channel = ESP GROUP BY … ) AS t2 WHERE t1.time+1=t2.time AND t1.itemid=t2.itemid Node 1 channel ‘ELECTRICSP’ Komplexpraktikum Datenbank-Anwendungen
Parallel FIRA processes • Distribute each value of each attribute to an dedicated node • Parallel execution of FIRA-process for each value • Split into two phases system configuration Relation cooling Node 1 channel ESP Node 2 channel TS Node 3 nofrost YES Node 4 nofrost NO process configuration Updates Node 5 nofrost N.A. Node 6 brand AEG Node 7 brand SIEMENS Node 8 Channel x nofrost ESP & YES DB-Server FIRA … ES CC MFD Forecast Impute Refine Adjust Variant 2 … Historie Variant 3 COOL Node 9 Channel x nofrost ESP & NO … ES CC PTV MFD … configuration repository Estimation process … Aggregate database COOL PTV FIRA model exploitation / usage Estimation Self-Adjusting Imputation System Komplexpraktikum Datenbank-Anwendungen
Zielstellungen • Einarbeiten in notwendige Technologien • Hadoop, RDBMS, verteilte Datenbanken • Prognoseansatz (Workload) • Verkürzung der Ausführungszeit durch Optimierung des Datentransfers • Erarbeiten verschiedener Ansätze zur Datenhaltung • Hadoop-basierte Lösungen • RDBMS-basierte Lösungen • Andere Lösungsansätze? • Mglw. einschließlich angepasster Prognoseverarbeitung • Evaluation (Erweiterter Aufgabenbereich für 8 SWS) • Vergleich gegen SetUp mit zentraler Datenbank • Besondere Eignung einzelner Ansätze herausstellen und begründen Komplexpraktikum Datenbank-Anwendungen
Einstieg • Einarbeitung in Hadoop • Hadoop v.1.2.1 • http://hadoop.apache.org/ • Einrichten eines Single nodeclusters als erste Testumgebung • http://hadoop.apache.org/docs/stable/single_node_setup.html • http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ • Einrichten von R und RHadoop • https://github.com/RevolutionAnalytics/RHadoop/wiki Komplexpraktikum Datenbank-Anwendungen