170 likes | 250 Views
Database Systems Group. Research Overview 2010. Goal: Isolate factors that cause significant changes in a measured value Ex: Increase in age causes increase in risk for heart disease Combined OLAP with Means Comparison Parametric Test
E N D
Database Systems Group Research Overview 2010
Goal: Isolate factors that cause significant changes in a measured value Ex: Increase in age causes increase in risk for heart disease Combined OLAP with Means Comparison Parametric Test Used to pair similar groups and determine if they are significantly different Want to reject hypothesis that the two groups have the same mean Developed GUI that allows for easy user interface OLAP Statistical Tests Zhibo Chen Advisor: Dr. Carlos Ordonez
Association Rules – technique used to detect patterns within items of dataset HighAge, High Cholestrol => Heart Disease Compare results from both techniques OLAP Statistical Test discovered more rules than Association Rules p-value is more reliable than confidence (considers pdf) OLAP affected less by distribution than AR AR better when performance is priority and data is skewed OLAP Statistical Test better when data is distributed OLAP Statistical Tests Zhibo Chen Advisor: Dr. Carlos Ordonez
Blue and red lines represent location of the averages of the two groups Averages are fairly different from one another Confidence says that the two groups are similar Many blue points above 50 Many red points above 50 confidence is low OLAP Statistical Test versusAssociation Rules Zhibo Chen Advisor: Dr. Carlos Ordonez
On-Line Analytical Process (OLAP) Set of techniques allowing users to explore various aggregations of a dataset Ex: dataset with day, month, year, sales What were average sales for Sundays? Solve by grouping on day and then extracting Sunday Normally done outside the database or with OLAP servers We want to study how to perform the same techniques inside the DBMS (SQL or UDF) Found that users can efficiently perform OLAP exploration using UDFs OLAP Exploration with UDF Zhibo Chen Advisor: Dr. Carlos Ordonez
Digital Libraries in a DBMS Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez • Information retrieval techniques have been traditionally exploited outside relational database systems due to storage overhead, complexity to suit them in a relational model, and slower performance in SQL implementations. • Searching and querying documents under information retrieval models in relational database systems can be performed with optimized SQL. • We explore three phases: • Document preprocessing. • Document storage. • Document retrieval (VSM, OPM, • DPLM).
Keyword Search Across Document and Databases Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez • Sometimes the meaning and structure of a database is unknown. • There are external semi-structured sources that can help to describe it. • We found that we can link these two worlds to identify relationships between the structured data with the semi-structured data. • We believe that is the right • approach to do it inside the • database. • We implemented a prototype • entirely in SQL.
Bayesian Statistics Carlos Garcia-Alvarado Advisor: Dr. Carlos Ordonez • Latest trend in advanced statistics; very demanding: CPU and large data sets • Applied to microarray data in the DBMS. The problem involves high dimensionality data of few samples. • Variable selection is the first issue that we have been trying to solve. Computational expensive looking for the best model (2^d), where d is de number of dimensions. • Applying SQL optimizations and data layout modifications, we obtain less than 3 seconds selections of > 1 M dimensions , but still not enough. • Current work: Gibbs Sampler Variable Selection.
PCA Mario Navas Advisor: Dr. Carlos Ordonez • Black-box • Rotation of the input space • Make the representative components evident • No Covariance between attributes • Variance represented by the eigenvalues • Deal with high dimensionality
DB Implementation • Summary matrices n L Q • Correlation matrix • Eigenvalue decomposition problem
Outliers detection in microarray data • Deal with high dimensionality • Redundancy minimized • Find distance based outliers in a reduced space Distance-based Outliers [126] PCA -based Outliers [2D] Distance-based Outliers [7D] PCA -based Outliers [2D] Matching top 10
Bayesian Classification Based On Decomposition via Clustering An Extension Of Naïve Bayes. Class Decomposition of the Gaussians Using Clustering Using K-Means and E-M Scalability - Query Optimizations for Computationally and Memory Intensive Computations Incremental Learning of the Classifier Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez
Computing Distance & Sufficient Statistics Using SQL & UDFs Five different SQL optimizations and one User Defined Function (UDF) to compute Euclidean distance in K-Means Sufficient Statistics – Count, Linear Sum and Quadartic Sum for multiple clusters and multiple classes computed in a single data set scan Using SQL (or) UDF. Sasi Kumar Pitchaimalai Advisor: Dr.Carlos Ordonez
Fast Bayesian Classifier Based on FREM • The Algorithm • Initialization : Randomly initialize k clusters per class from the data set. • E-step : Compute Mahalanobis distance, find nearest cluster and then compute sufficient statistics. • M-step : Recompute the mean and variances and weight of the clusters per class. Mixture parameters updated in this step. • SplitClusters : Splitting Heavy Clusters to reach higher quality solutions and reseeding low weight clusters. • The E-step and M-step are iterated until the model converges.
Constrained Association Rules in SQL Kai Zhao Advisor: Dr. Carlos Ordonez • Association rules are a data mining technique used to discover frequent patterns in a data set. Real world application of this technique is broad and can include fields such as medical and commerce. We can automatically generate efficient SQL queries for discovering association rules
Comparison between CAR and DT Kai Zhao Advisor: Dr. Carlos Ordonez • CAR perform an exhaustive combinatorial research whereas DT recursively partition the input attribute space. • CAR aim to find all rules above the given thresholds whereas DT find regions in space where most records belong to the same class. • CAR analyze item combinations whereas DT select only one input attribute at one time.
Frequent Subgraph Mining Kai Zhao Advisor: Dr. Carlos Ordonez • Frequent subgraph • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold (B) (C) (A) FREQUENT PATTERNS (MIN SUPPORT IS 2)