340 likes | 511 Views
Towards a Unified Index Scheme for Mobile Data and Customer Profiles in a Location-Based Service Environment. Vijay Atluri. Joint Work with Nabil R. Adam and Mahmoud Youssef. C enter for I nformation M anagement I ntegration and C onnectivity ( CIMIC ) Rutgers University
E N D
Towards a Unified Index Scheme for Mobile Data and Customer Profiles in a Location-Based Service Environment Vijay Atluri Joint Work with Nabil R. Adam and Mahmoud Youssef Center for Information Management Integration and Connectivity (CIMIC) Rutgers University Partially supported by a grant from NSF
Outline • Introduction to the Mobile commerce environment • Research Problem • Our proposed solution: Unified Index Scheme • Summary and Future Work
M-Commerce: Market Opportunities for Location-based Services • The global subscriber base for mobile location services • Expected to exceed 680 million users by the end of 2006 • About 50% of all mobile subscribers • More than 70% of mobile Internet users • Location services revenue • $2 billion at the end of 2002 • More than $18.5 billion by the end of 2006 • Forecast suggests • 31% will be generated in Western Europe • 22% in the USA • 47% in Japan and rest of the world Source: European Location Based Advertising (ELBA) (2003)
Location Based Service Environment: A Scenario • A Location Service (LS) tracks customers location • The store queries the LS to obtain list of customers in the proximity • The store sends offers to the customers who satisfy the profile and location criteria • The customer receives the offer • Customers are moving in the proximity of a shopping mall • A store in the mall wants to attract the customers at the point where they are likely to buy • Store offers are personalized according to the customer profile (e.g. age, sex, salary, etc.) and preferences Source: ELBA Yellow Map
Personalization • Mobile Commerce applications use mass personalization by: • tracking customer profiles • querying the LS about their location and projected path • However, personalization raises privacy concerns
Addressing Privacy Concerns The problem of customer privacy has to be studied in various contexts: • Customer’s interests • Enjoy m – commerce convenience • Protect their private data • Businesses’ interests • Ideally would like to reach the customers only when they are likely to buy • Federal laws • Guarantee customers the rights to protecting their data • Guarantee businesses the right to collect legitimate data
The problem with Current Environment and a Solution The Problem: The customer has to trust too many merchants for her profile A Proposed Solution: The customer to trust only one third-party It is plausible then to have the LS act as the third party
Objective • Storing profiles in the Location Service • reduces privacy invasion risks, yet • the amount of data would be much larger and the queries would be more complex • Our goal is to enhance • performance • Accuracy of query processing
Location Data • Customers are modeled as moving objects • Location data is continuously updated as customers move • Traditional databases cannot handle such data for the following reasons: • The rate of update is unprecedented • Queries submitted are spatio-temporal in nature. • These queries may address future location • A new paradigm “Moving Objects Databases” has emerged to handle such data
Queries on Moving Objects • A point-in-time query with a spatial window • retrieve customers who are currently in the shopping mall • Time interval query with a spatial window (future query) • retrieve customers who will pass by the motel in the next 30 minutes • Continuous query • retrieve all the customers who are within 300 feet of the store • All these queries are modeled as a time interval and a spatial window
Data Modeling of Moving Objects 1 2 3 4 5 6 7 8 9 10 • The Moving Objects Spatio-Temporal Model (MOST) [Sistla et al. 97] • Location is a linear function of time • No need for update unless the motion parameters change or a threshold is reached • This technique turns the problem into indexing of line segments (motion lines) • Buckets Approach [Song and Roussopoulos 99] The Hash Index is updated only when the object changes its sub-area 1 2 3 4 Region being tracked Hash Index 5 6 7 8 9 10
Customer Profiles • A Customer Profile is “a hierarchical collection of personal information” (OPS 97) • Example: • Personal Contact (Name, Address, Telephone, ID, Email, Language) • Demographic (Date of birth, Gender, Marital status, Income, Education) • Business (Profession, Title, Industry, Company details) • …..
Queries on Moving Objects and Customer Profiles • A point-in-time query • retrieve customers who are currently in the shopping mall with age = [18-23], sex=female • Time interval • retrieve customers who will pass by the motel in the next 60 minutes with state PA • Continuous query • retrieve all the customers who are within 300 feet of the store with kids_4_to_8=T, Salary < $20K • Many queries may include a large number of attributes
Queries on two Databases The Unified Index Scheme • With two databases, the best query plan would take three steps • a multidimensional query on profiles database (t1) • a spatio-temporal query on moving objects database (t2) • a join operation (t3) • Pros • The query is performed in one pass • Cons • The dimensionality in the unified index is higher (K+L) • The number of records in the unified index (M) is less than the number of records in the profiles database (N)
Current Work Index Techniques for Moving Objects • Hash-based Indexes • Produce high level of false positives • Significant work: • Song and Roussopoulos 1999 • Tree-based Indexes • Suffer from the curse of dimensionality • Significant Work • Elbassioni et al. 2002 • Saltenis et al. 2000 • Kollios et al. 1999 • Tayeb et al. 1997 Non-Moving Objects indexes such as X-Tree (BKK96) and the Pyramid Tree (BBK98) have their own limitations
Some Simple Solutions that do not Work • The geometric relations in low dimensionality do not hold in high dimensionality • Almost all the records are recalled even at few dimensions • Obtaining the exact Amin and Amax in high dimensional space is very difficult (non-convex optimization) • improving that approach by adding a Cosine Distance • yielded unsatisfactory results • The recalled area was [Dmin, Dmax] [Amax, Amin] • Experimental Study on Euclidean Distance & Cosine Distance as a Hash Function • The distance is calculated from origin • The incoming query is transformed into a distance interval [Dmin, Dmax] • The Problem with this approach is the dead space around the query window
Some Observations Investigating the possibility of using existing indexes showed that: • Tree-based indexes do not scale up in dimensionality • Established in literature (Otterman 1992, Berchtold et al. 1998, C. Aggrawal and Yu 2000) • Hash-based indexes using Euclidean/Cosine distances have very little room for improvement • Investigated experimentally • A new indexing scheme that supports moving objects and multidimensional data is needed • the nature of data, types of queries, and desired precision should be the design basis for an application-specific indexing scheme • our experimental study and the literature on high-dimensional indexing in other domains (e.g., Data Mining, IR, Image Processing, etc) shows that “one size for all” approach is not a good solution
Characteristics of Profiles Data • Very low rate of update (almost static) • Large portion of the attributes is Binary and Categorical • Continuous attributes (interval-scaled) are usually modeled as Ordinal Categorical attributes • Considerable correlations among many of the attributes
Goals and Issues • Performance • Profiles data seem to lend itself to clustering • A clustering-based approach can be much more efficient than other existing approaches • A clustering-based approach would be approximate • Accuracy • Which, and how many, clusters to select • How to achieve the desired level of accuracy
Our Approach • Cluster the customers based on their profiles (categorical clustering algorithm) • Construct a TPR tree for each cluster Profiles Database Clustered Profiles Data Corresponding Location Data Cluster 1 TPR-1 Cluster 2 TPR-2 . . . Cluster n TPR-n
The Effect of Breaking a TPR-tree into Multiple Coinciding Trees Number of I/O’s per Query Number of Points in the Tree The Single Tree Outperforms the 10 Trees by a Factor About 2
The different steps • Improving Accuracy • Sort the attributes based on their accuracy of the clustering • Prune the categories of each attribute • Re-cluster using Pruned scheme, then Re-build the scheme from the new clusters • The classification Process • Query Processing
Improving Accuracy Attribute: Salary Group Attribute: Salary Group 1 Prob. 0.0211 0.0179 0.0043 0.0755 0.0099 1 Freq. 8,544 7,244 1,754 30,524 3,999 2 Prob. 0.0291 0.0447 0.0450 0.0002 0.0179 2 Freq. 11,754 18,066 18,191 95 7,267 … … Cluster 1 2 3 4 5 Cluster 1 2 3 4 5 … … Combined52,065 55,373 Assigned 30,524 36,257 Combined0.1287 0.1548 Assigned 0.0755 0.0897 • Step 1: Sort the attributes based on their accuracy of the clustering • For each attribute construct a pivot table • Number of customers in each category in each cluster • Compute the probability of each cell in the pivot table • Compute the classification factor for that attribute • Sort the attributes based on the classification factor Classification Factor = A/T T= sum(combined) A = sum(assigned)
Improving Accuracy 8 Prob. 0 0 0.00003 ~0 0 Attribute: Salary Group 1 Prob. 0.0211 0.0179 0.0043 0.0755 0.0099 2 Prob. 0.0291 0.0447 0.0450 0.0002 0.0179 … Cluster 1 2 3 4 5 … 0.00003 0.00003 Combined0.1287 0.1548 Assigned 0.0755 0.0897 • Step 2: Prune the categories of each attribute to improve performance, reduce the memory requirement • some attributes have values that have very little contribution to the classification scheme • Removing these values reduces the amount of calculations during query processing. • Set a pruning threshold(h) • A category is pruned if Where combined (x) is the combined probability of category x m is the number of categories To Be Pruned
At 1% Pruning Threshold, 16% of the categories are eliminated
The Classification Procedure 0 0 0 0 . . . . . . 0 0 Classification Array Record . . . Attribute Age Salary . . . Value 1 5 • A record can be guided to a cluster(s) as follows: • Start with a classification array of zero probabilities • Follow the order of the sorted attributes • For each Pivot Table of an attribute: • Find that attribute in the record • lookup that attribute’s value in the record • Find the column corresponding to that value in the pivot table • Add that column to the classification array • After last pivot table is visited, the cluster(s) that have the max probability is selected for classification Attribute: Age 5 P51 P5k P51 Pivot Table 1 P52 + . . . P5k Attribute: Salary 1 2 P11 P1k P11 P1 Pivot Table n P12 P2 + . . . . . . P1k Pk P1 Final Array P2 . . . Pk C1 C2 Cn Target Cluster(s)
Improving Accuracy Profiles • Step 3: Re-cluster the data using the pruned classification scheme and re-build the tables to eliminate mis-clustered records • Re-classify each record in the database using the pruned classification tables. • Re-classification procedure is the same as the Classification Procedure • Rebuild the pivot tables using the frequencies obtained in the re-classification • Re-compute the probabilities Classification Scheme Clusters Updated Classification Scheme
Query Processing Profiles Database TPR 1 Cluster 1 List of IDs Cluster 2 Cluster n TPR n • Break a query into: (1) Profiles query, and (2) Location query • Process the profiles query • The answer to the profiles query is a pointer to one or more clusters • Process the location query on the moving objects tree(s) corresponding to the selected cluster(s). • The resulting list of IDs can be further processed to eliminate false positives
Processing a Point Query on Profiles 0 0 (2) 0 0 . . . . . . (1) (3) 0 0 Attribute: Age 5 P51 P5k P51 P52 (4) + . . . Attribute: Salary 1 2 P11 P1k P11 P1 Pivot Table n P12 P2 + . . . . . . P1k Pk P1 Final Array P2 . . . Pk C1 C2 Cn Query Query Answer Array • Query Processing is closely similar to classification • A point query is guided to a cluster(s) as follows: Attribute Age Salary Value 1 5 • Start with a Query Answer Array of zero probabilities • Follow the order of the sorted attribute Pivot Table 1 • For each Pivot Table of an attribute (e.g. Age): P5k • If that attribute in the query • lookup that attribute’s value in the query (e.g., 5) • Find the column corresponding to that value in the pivot table • Add that column to the Query Answer Array • If not in the query, ignore • After last pivot table is visited, the cluster(s) that have the max probability is selected for classification Target Cluster(s)
Processing Range Query on Profiles Attrib: Age Sex Salary … Search Key 5 1 1-2 … Query Answer Array cumulative 0.01 0.03 0.45 0.08 0.0 Cluster 1 2 3 4 5 • After retrieving a query key • if the query key is a single value, proceed as point query. • If the query key is range (e.g., Salary), add all the columns representing the categories in the range to the Query Answer Array Attribute: Salary Group (Pruned Prob. Table) 1 Prob. 0.0211 0.0179 0.0043 0.0755 0.0099 2 Prob. 0.0291 0.0447 0.0450 0.0002 0.018 6 Prob. 0.0003 0.0086 0.0378 0.0720 0.0003 … Cluster 1 2 3 4 5 … Combined 0.0812 0.1278 0.14 + +
Achieving Desired Accuracy • The final classification array includes the cumulative probabilities assigned by the attributes to each cluster. • There is a tradeoff between the number of clusters to be considered (accuracy) and the performance of the scheme. • We adopt the F-score as a measure of accuracy • The question becomes: find the number of clusters k to process the query on such that Where piis the ithscore from top niis the number of records in the ith cluster
Insertion and Deletion • The profile of a customer entering the service area of the Location Service must be inserted in the active database (from the reference profiles database) • Similarly, a customer leaving that area should be deleted from the active database. • Both operations start by classifying the customer to a cluster based on her profile. The actual operation and deletion is performed on the corresponding TPR-tree
Summary • Studied the TPR-tree behavior, and Euclidean/Cosine Distance hashing • Presented a unified indexing scheme for Location-based service environment based on clustering and classification • To achieve the goal of preserve the customer privacy • The scheme overcomes performance problems facing high-dimensional indexes • The accuracy of the output is controlled through the number of clusters to be processed • An experimental Study on the accuracy of the proposed scheme
Future Work • Will study the effect of changing the index parameters (e.g., buffer size, page size, etc) on the performance of the TPR-tree • Enforce the access control for selective exposure of profile information to merchant • Enhance the unified index with access authorizations