Introduction to Research

Introduction to Research Data Management and Database http://www.cs.fsu.edu/~lifeifei lifeifei@cs.fsu.edu Feifei Li

Outline • Background • My Research Focus and Experience • Some Problems I have worked on • Current Interest and Activity • My Experience as a PhD student • Q&A

A Short History Class • Undergraduate study in Tsinghua University (1997) (China) + Nanyang Technological University (Singapore) (1998-2002) • B. Applied Science • PhD study in Boston University (2002-2007) • Computer Science, Research Area: Database • Now…

Research Focus • Data Management in General and (roughly in the order of the sequence I worked on) : • Efficient indexing, querying and managing large scale databases, or high dimensional databases • Spatio-temporal databases and applications • Sensor and Stream databases • Privacy preservation issues for data management • Query security for various types of data models • Uncertain databases and data cleaning

Experience • SDE intern at M$ SQL server group, summer 2005 (Redmond, WA) • Research intern at IBM T. J. Watson Research Center (Hawthorne, NY), database research group, summer 2006 • Visiting research student at AT&T Labs Research (Florham Park, NJ), database research group, winter 2006 and spring 2007 • Research intern at MSR, database research group,summer 2007

Outline • Background • My Research Focus and Experience • Some Problems I have worked on • Retrieving structured data from Web • Spatio-temporal databases • Sensor databases • Current Interest and Activity • My Experience as a PhD student • Q&A

The First Step • My FYP (Final Year Project), around 2000-2001 • Analyze and build structures of different websites • How to automate this process?? • View a website as a tree structure and? • Given a group of similar websites, summarizing a suitable schema… • Retrieve information from certain part(s) of a website as specified by the user • With the structure information obtained at the first step • Why bother? Information integration, BBC in favors of Bush and CNN ‘hates’ him, then what’s the response to event A? • Another issue: semi-structured data (HTML) to structured data (XML)

So What Happened

Possible Research Problems • Automatic Schema Identification • Given a collection of data sources, find a common schema that maximally describes the dataset. • Information retrieval & search from Web • IR techniques (IR is a separate field by itself, unstructured data) + database techniques (structured data), how to combine the two? • Google • Information Integration • Given data source A and data source B, both refers to the same schema, but with (slightly) different instances, how to link/combine the two?

Then Boston • Quite a pleasant transition in the summer: Singapore (90+ year round) to Boston (80 in the summer) • Winter: • Anyway…

What to Do Now? • Spatio-temporal databases and applications • Why? My advisor was in this area and… • Examples: • Indexing higher dimensional data: • 1d- B+ tree, 2d, 3d, 4d, …? kd-tree, R-tree • Space partitioning vs. Data partitioning • Queries • eg: continuous nearest neighbor query– continuously find the closest gas station when I am driving from Boston to NY. • Moving object • On Euclidian space • On a road network

Indexing High Dimensional Data: R-tree • eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page I C A G H F B J E D

A H D F G B E I C J Example • F=4 P1 P3 I C A G H F B J E P4 P2 D

P1 A H D F P2 G B E I P3 C J P4 Example • F=4 P1 P3 I C A G H F B J E P4 P2 D

P1 A H D F P2 G B E I P3 C J P4 R-trees:Search P1 P3 I C A G H F B J E P4 P2 D

Query in Spatio-Temporal Databases • Trip Planning Queries (TPQ): • Given a starting location, a destination and arbitrary points of interest try to find the best possible trip. • Example: • Minimize the total traveling time from Boston to Providence, while visiting a post office, a hardware store and a gas station.

Visual Example • We can minimize the total distance, time, etc. • We can have different categories of points of interest (gas stations, hotels, etc.). Home Work Gas station

The Nearest Neighbor Algorithm B2 A2 S D C2 B3 B1 C1 A1 • Yields a 2m+1 - 1 approximation where m is the total number of categories.

The Minimum Distance Algorithm B2 A2 S D C2 B3 B1 C1 A1 • Yields an m-approximation where m is the total number of categories.

Search over R-Tree and Road network • R-Tree: • Euclidian space, how to utilize R-tree to speed up the search? • Road network: M D p A S

Sensor Network Model • Large set of sensors distributed in a sensor field. • Communication via a wireless ad-hoc network. • Node and links are failure-prone. • Sensors are resource-constrained • Limited memory, battery-powered, messaging is costly.

Sensor Databases • Useful abstraction: • Treat sensor field as a distributed database • But: data is gathered, not stored nor saved. • Express query in SQL-like language • COUNT, SUM, AVG, MIN, GROUP-BY • Query processor distributes query and aggregates responses • Exemplified by systems like TAG (Berkeley/MIT) and Cougar (Cornell)

A Motivating Example • Each sensor has a single sensed value. • Sink initiates one-shot queries such as: What is the… • maximum value? • mean value? • Continuous queries are a natural extension. B 6 D 3 2 A G 10 7 I J H 6 H F C 9 4 12 E 1

AVG Aggregation (no losses) • Build spanning tree • Aggregate in-network • Each node sends one summary packet • Summary has SUM and COUNT of sub-tree • Reliability problem when there are losses (common for sensor network) B 6 6,1 D 3 2 A 2,1 9,2 G 10 15,3 7 6,1 10,1 I J H 6 26,4 12,1 H F C 9 9,1 4 12 10,2 E 1 AVG=70/10=7

AVG Aggregation (naive) • What if redundant copies of data are sent? • AVG is duplicate-sensitive • Duplicating data changes aggregate • Increases weight of duplicated data B 6 6,1 D 3 2 A 2,1 9,2 G 10 15,3 7 6,1 22,2 I J H 6 12,1 26,4 12,1 H F C 9 9,1 4 12 10,2 E 1 AVG=82/11≠7

AVG Aggregation (TAG++) • Can compensate for increased weight • Send halved SUM and COUNT instead • Does not change expectation! • Only reduces variance B 6 6,1 D 3 2 A 2,1 9,2 G 10 15,3 7 6,1 16,0.5 I J H 6 6,0.5 20,3.5 6,0.5 H F C 9 9,1 4 12 10,2 E 1 AVG=70/10=7

AVG Aggregation (LIST) • Can handle duplicates exactly with a list of <id, value> pairs • Transmitting this list is expensive! • Lower bound: linear space is necessary if we demand exact results. B 6 B,6 D 3 2 A A,2 B,6;D,3 G 10 A,2;G,7;H,6 F,1;I,10 7 H,6 I J H 6 C,9;E,1;F,1;H,4 F,1 F,1 H F C 9 C,9 4 12 C,9;E,1 E 1 AVG=70/10=7

COUNT Sketches • Problem: Estimate the number of distinct item IDs in a data set with only one pass. • Constraints: • Small space relative to stream size. • Small per item processing overhead. • Union operator on sketch results. • Exact COUNT is impossible without linear space. • First approximate COUNT sketch in [FM’85]. • O(log N) space, O(1) processing time per item.

Counting Paintballs • Imagine the following scenario: • A bag of n paintballs is emptied at the top of a long stair-case. • At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way. Looking only at the pattern of marked steps, what was n?

Counting Paintballs (cont) B(n,1/2) • What does the distribution of paintball bursts look like? • The number of bursts at each step follows a binomial distribution. • The expected number of bursts drops geometrically. • Few bursts after log2 n steps B(n,1/4) 1st 2nd B(n,1/2 S) S th B(n,1/2 S)

Counting Paintballs (cont) • Many different estimator ideas [FM'85,AMS'96,GGR'03,DF'03,...] • Example: Let pos denote the position of the highest unmarked stair, E(pos) ≈ log2(0.775351 n) 2(pos) ≈ 1.12127 • Standard variance reduction methods apply • Either O(log n) or O(log log n) space

Application to Sensornets • Each sensor computes k independent sketches of itself using its unique sensor ID. • Coming next: sensor computes sketches of its value. • Use a robust routing algorithm to route sketches up to the sink. • Aggregate the k sketches via in-network XOR. • Union via XOR is duplicate-insensitive. • The sink then estimates the count. • Similar to gossip and epidemic protocols. • How about SUM and other aggregates??

COUNT vs Link Loss (grid)

Outline • Background • My Research Focus and Experience • Some Problems I have worked on • Current Interest and Activity • Privacy Preservation • Query Security • My Experience as a PhD student • Q&A

Privacy Preservation Sum=$7,000 It is not legal to query about individual person’s salary. However, we are Interesting (and often time legal) at retrieving the avg. what do you do? Basic Intuition: Add Identical Independent Distributed Random (IID) Noise with Zero Mean Perturb the data… How? Add random noise…in a particular way Sum=$0 Sum=$7,000

How about Multiple Attributes (multi-dimensional data)? • Is IID noise really preserving the privacy??

Principal Component Analysis: PCA i.i.d Noise

Principal Component Analysis: PCA Correlated Noise

A* Added Noise: Utility Removed Noise σ2 Projection Error A~ Remaining Noise Privacy PCA Based Data Reconstruction A: Original Data A*: Perturbed Data A~: Reconstructed Data A Principal Direction

Added Noise: Utility σ2 A* Projection Error A~ Remaining Noise Privacy PCA Based Data Reconstruction Correlated Noise! A: Original Data A*: Perturbed Data A~: Reconstructed Data A Principal Direction

Data Perturbation: main idea • Observations • The amount of the random noise controls privacy/utility tradeoff • i.i.d (identical independently distributed) noise does not preserve the privacy! Not well enough • Lesson learned • Noise should be correlated with original data

How about Streaming Data? • Streaming data: Data continuously arrives , no global data is available, hence cannot get the global trends. Online Correlated Noise Correlated Noise i.i.d Noise

Outline • Background • My Research Focus and Experience • Some Problems I have worked on • Current Interest and Activity • Privacy Preservation • Query Security • My Experience as a PhD student • Q&A

Example of Data Publishing www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hadjieleftheriou:Marios.html www.sigmod.org/dblp/db/indices/a-tree/h/Hadjieleftheriou:Marios.html

Introduction to Research