RFID Data Management. Kamlesh Laddhad (05329014) Karthik B.(05329021) Guide: Prof. Bernard Menezes. Outline. Introduction to RFID Technology. Issues with RFID Technology. RFID Data Characteristics. Data Warehousing. Expressive Temporal Model: Dynamic Relationship ER Model RFID - Cuboids.
Outline • Introduction to RFID Technology. • Issues with RFID Technology. • RFID Data Characteristics. • Data Warehousing. • Expressive Temporal Model: Dynamic Relationship ER Model • RFID - Cuboids. • Use of Bitmap Datatype. • Data Cleaning. • Extensible Sensor stream Processing (ESP) • Statistical sMoothing for Unreliable RFid data.(SMURF) • Future Plans.
Introduction • Radio Frequency Identification: • It is an Automatic Identification and Data Capture Technology. • Fast • No contact or line of sight. • Uses radio-frequency waves to transfer data • Components • Tag: small, low-cost device that can hold a limited amount of data. • Associated with objects, such as pallets, cases, and even individual items. • Reader: Recognize presence of tag and read info stored on it. • Unique electronic product code (EPC) associated with a tag. • By placing RFID tag readers at various locations, one can track the movement of objects through supply chain networks.
Applications and Adoptions • Supply Chain Management: real-time inventory tracking. • US Department Of Defense: shipments to armed forces • Retail: Active shelves monitor product availability • Wal-Mart, Albertson: Major Retails stores • Access control: toll collection, transportation. • Airline luggage management: • British airways:20 million bags a year • Implemented to reduce lost/misplaced luggage • Anti-counterfeiting and security: • Food and Drug Administration: To reduce counterfeit in pharmaceutical supply chain
Prospective for RFID research • The physics of building tags and readers • Tags have few gates: Apart from basic operation, very less computing power. • Radio-frequency has some issues with operating in certain physical mediums. • The privacy and safety issues: • Complex encryption schemes are not possible on RFID tags. • Counterfeiting by means of either illegitimate readers or spoofed tags are possible • Reader-tag communication is wireless: Third parties can eavesdrop on signals. • Software Architecture to collect, filter, organize, and answer online queries: • No. of tags are proportional to No of items being serviced/tracked. • No. of readers are proportional to traceable strategic locations/areas • Each Reader picks up tag signals on continuous basis. • Data generated by RFID systems is enormous: • E.g. Wal-Mart is expected to generate 7 terabytes of RFID data per day. • Our Focus: Third Stream.
Data Management Challenges • Data Explosion : Example • A retailer with 3,000 stores, selling 10,000 items a day per store. • Each item moves 10 times on average before being sold • Movement recorded as (EPC, location, second) • Data volume: 300 million tuples per day. • Example OLAP Query: “Average time for items to move from warehouse to checkout counter in March 2006?”. • Costly to answer if there are a billion tuples for March 2006.
Data Characteristics • Temporal and history oriented • Applications dynamically generate observations (readings). • Objects location and containment relationship among objects changes • Need: Expressive data model. • Inaccurate data and implicit semantics • False positive: Non-existing tag incorrectly read. • False Negative: Reader missed a tag which was in its vicinity. • Noisy data & duplicate readings (redundancy): Same tag read more than once. • Need: Automated data filtering and transformation. • Streaming and large volume • Object stay in place for longer duration: Readers records them periodically. Large data keeps generating. • We need to preserve this data for tracking and monitoring. • Need: Scalable storage scheme, compression techniques to reduce data. • Data Granularity • Data collection granularity needs to be decided • Differs across applications.
Warehousing Helps!! • Lossless compression • Remove redundancy: (r1,l1,t1) (r1,l1,t2) ... (r1,l1,t10) => (r1,l1,t1,t10) • Group objects that move and stay together. • Data cleaning: Multi-reading, missed-reading, error-reading, bulky movement. • Data mining: Find trends, outliers, frequent, sequential, flow patterns. • Multi-dimensional summary: product, location, time, … • Store manager: Check item movements from the backroom to different shelves in his store • Region manager: Collapse intra-store movements and look at distribution centers, warehouses, and stores • Query Processing • Support for OLAP: roll-up, drill-down, slice, and dice • Path query: New to RFID-Warehouses, about the structure of paths • What products that go through quality control have shorter paths? • What locations are common to the paths of a set of defective auto-parts? • Identify containers at a port that have deviated from their historic paths
Dynamic Relationship ER Model • Proposed by Wang and Liu from Siemens. • RFID entities are static and are not altered. • RFID relationships: dynamic and change all the time. • Two types of dynamic relationships added: • Event-based dynamic relationship. A timestamp attribute added to represent the occurring timestamp of the event. • State-based dynamic relationship. tstart and tend attributes added to represent the lifespan of a state.
SENSOR (sensor_epc, name, description) • TRANSACTION (transaction_id, transaction_type) • CONTAINMENT(epc, parent_epc, tstart, tend) • SENSORLOCATION(sensor epc, location id,position, tstart, tend) • Static entity table • OBJECT (object_epc, name, description) • LOCATION (location_id, name, owner) • Dynamic relationship tables • OBSERVATION(sensor_epc, value, timestamp) • OBJECTLOCATION(epc, location_id, tstart, tend) • TRANSACTIONITEM(transaction_id, epc, timestamp)
Monitoring. • Missing RFID Object Detection: • Find when and where object holding EPC= `MEPC’ was lost. • select location_id, tstart, tend from objectlocaiton where epc='MEPC' and tstart = ( select max(o.tstart) from objectlocation o where o.epc='MEPC' ) • Check if there are missing objects at current location C, knowing that all objects were complete at previous location L at time T. • select l.epc from objectlocation l where l.location_id = 'L' and l.tstart <= 'T' and l.tend >= 'T' and l.epc not in ( select c.epc from objectlocation c where c.location_id = 'C' )
Tracking • RFID Object Moving Time Inquiry: • Time it takes to supply ‘OEPC’ from location S to location E? • select (e.tstart-s.tstart) as supplying_time from objectlocation e, objectlocation s where e.epc = 'OEPC' and s.epc='OEPC' and s.location_id ='S' and e.locaiton_id='E'
shelf 1 store 1 10 pallets (1000 cases) shelf 2 Dist. Center 1 store 2 … Dist. Center2 Factory … 10 packs (12 sodas) … 20 cases (1000 packs) Compression Idea • Bulky object movements • Objects often move and stay together through the supply chain. • If 1000 packs of product P stay together at the distribution center, register a single record. • (GID, distribution center, time_in, time_out). • GID is a generalized identifier that represents the 1000 packs that stayed together at the distribution center • Analysis usually takes place at a much higher level of abstraction than the one present in raw RFID data
RFID Cuboids • Fact Table: (EPC, location, time_in, time_out). • In supply chain: Items travel through a series of locations. • Query: what is the average time that product P stays at store in Location A? • Traditional cubes miss the path structure of the data • Stay Table: (GIDs, location, time_in, time_out: measures): • Records information on items that stay together at a given location • If using record transitions: difficult to answer queries, lots of intersections needed • Map Table: (GID, <GID1,..,GIDn>) • Links together stages that belong to the same path. Provides additional: compression and query processing efficiency • High level GID points to lower level GIDs • If saving complete EPC Lists: high costs of IO to retrieve long lists, costly query processing • Information Table: (EPC list, attribute 1,...,attribute n) • Records path-independent attributes of the items, e.g., color, manufacturer, price..
EPC Overview • Electronic product code • Standard naming scheme, proposed by Auto-Id Center. • An EPC uniquely identifies an item. • Format: <Header, Manager_No., Object Class, Serial No.> • Header: Identifies the length, type, structure, version and generation of EPC. • Manager Number: Identifies an organizational entity. • Object Class: Identifies a “class”, or type of thing. • Serial Number: Specific instance of the Object Class being tagged. • We will refer to • <Header, Manager No, Object Class>: Prefix • <Serial No.>: Suffix
Use of Bitmap Datatype • Observation: Items move together. • Groups of items in the same proximity - e.g. on a shelf, on a shipment • Groups of items with same property - e.g. Same product • Use a bitmap type for modeling a collection of EPCs that can occur in item tracking applications. • Instead of storing a tuple per item store a tuple for all the items having same prefix. • New extra fields instead of epc: • <Len, Suffix_length, Prefix, suffix_start, Suffix_end, bitmap>
With EPC Collections With epc_bitmaps Example: Product Inventory
Use of Bitmap Datatype Header EPC_Manager Object_Class Serial_Number 2-bits 21-bits 17-bits 24-bits 0x4AA890001F62C160 ………………………… 0x4AA890001FA0B38E
Bitmap Operations • To use this with such datatype in SQL, we need operations on such bitmaps. • Conversion and couting Operations: epc2Bmap, bmap2Epc and bmap2Count • Pairwise Logical Operations: bmapAnd, bmapOr, bmapMinus, and bmapXor • Maintenance Operations: bmapInsert and bmapDelete • Membership Testing Operation: bmapExists • Comparison Operation: bmapEqual
Use of these operations in SQL • Items added to a given shelf between time t1 and t2. • SELECT bmap2Epc(bmapMinus(s2.item_bmap, s1.item_bmap)) FROM Shelf_Inventory s1, Shelf_Inventory s2 WHERE s1.shelf_id = <sid1> AND s1.shelf_id = s2.shelf_id AND s1.time = <t1> AND s2.time = <t2>; • Book store categorizes books in various categories. • Following query determines the shelves where the books with property ’Adventure’ and ’Romance’, are currently present in the store. • SELECT s.shelf_id FROM Shelf_Inventory s WHERE bmap2Count(bmapAnd( s.item_bmap, SELECT bmapAnd(p.Adventure, p.Romance) FROM Propery_Inventory p) ) > 0; AND s.time=<current_date>;
Road Ahead • Extension to bitmap proposal: • Bitmap datatype is more appropriate for initial bulk-load & batch updates. • It performs badly for incremental updates. • A ‘hybrid Scheme’ for incremental Updates: • Maintain inventories periodic checkpoints using bitmaps. • For changes occurring between checkpoints, Maintain a traditional item-level table. • Answer queries by merging the latest checkpoint bitmap with the corresponding duration’s item-level data. • The epc_suffix in the collection may not be contiguous • The bitmap will be sparse- Lot of zeros. • Compress this using some encoding scheme • Good for initial bulk loading and batch updates • May reduce efficiency of bitmap operations.
Open Problems • Efficient methods data mining problems • Trend analysis • Outlier detection • Path clustering • We will try exploring data mining applications to RFID data.
Issues in Data Cleaning • Lack of Completeness • RFID readers capture only 60-70% of all tags that are in the vicinity • Smoothing of data is done to rectify the loss of intermediate messages • Temporal Nature of data or tag dynamics • RFID tags are in motion and that is what makes them more difficult to handle • But motion of a tag causes dropping of messages • RFID data streams are very fast and are huge in number • Hence filtering is important before sending them to database
Current Strategies • Temporal Granule: • Based on the fact that tag data do not differ much over a small time period • Data can be clubbed on a small time frame • Spatial Granule: • Similarly, data from physically close readers are also homogeneous
Stages of ESP • Point: operates over a single value in a sensor stream, filtered by a predicate in the WHERE clause • Smooth: granularity defined by applications to correct for missed readings temporally (over one input only); uses aggregate function over the input. • Merge: granularity specified by the application to correct for missed readings spatially; grouped by the specified spatial granule.
Stages of ESP (contd.) • Arbitrate: deals with conflicts between different spatial granules; grouped by spatial granule first and then uses HAVING construct to determine those conflicts • Virtualize: used for combining data streams from different sources, could also be different devices; join construct is used to combine the different data streams and then filtered using some predicate
Smooth stage • False Positives: (erroneous readings) reporting objects that are not actually present • False Negatives: (missed readings) not reporting objects that actually are present False positives and False Negatives [Jeff06]
Tag List • The reader has an internal table called the Tag List. • An epoch is the smallest unit of interaction between the reader and the middleware. • Every epoch consists of certain number of Interrogation cycles • Interrogation Cycle is one run of the reader protocol to determine all tags • At every epoch the reader sends the tag list to the middleware.
SMURF – Per tag Cleaning • SMURF uses statistical methods to reduce the false negative and false positives happening in the RFID stream. • The goal here is two fold: one is to determine the statistical window size, and secondly, ensuring that the transition of the tags is determined. • To determine the window size we need to fit a probability distribution to the sample size • And to determine the transition of the tag out of the reader's vicinity, we define a 98% confidence interval within that probability distribution function on the sample size |Si|.
SMURF – Per tag Cleaning (contd.) • Using the tag list, per-epoch sampling probability, pi,t is determined,pi,t = number of times tag was read in a epoch / interrogation cycles per epoch • We average this over the sample size |Si| to get the average read rate (piavg) for a tag i. • If same probability of pi is assumed for each epoch throughout the window then each successful observation is like a Bernoulli trail.
SMURF – Per tag Cleaning (contd.) • So, |Si| is the binomial random variable for a sample Si with mean = wi. piavg and variance = wi. piavg. (1-piavg) • Now using this we can express the window size as a limit, • If the current window size is less than the calculated one then the window size is adjusted accordingly. • Similarly using the Central limit theorem for transition detection we get ||Si| - μ| > 2 σ
Normal Sliding window…. • Epoch based mid-point sliding window • Emits a reading with an epoch value corresponding to the middle of the window
Ensuring Completeness • In the first window, piavg demands a larger window • Thus window size is increased
Transition Detection • In the first window the number of readings decreases significantly (and statistically) • Thus a transition is likely to have occurred; so window is halved [Fraklin06]
SMURF – Multi-tag aggregate Cleaning • Similar to per-tag cleaning, the window for multi-tag cleaning is determined by:Here, pavg is the average per-epoch sampling probability over all observed tags. • To detect the transition in population count, we estimate the population count of two windows [t – wi, t] and [t – wi/2, t]; with true populations: Nw & Nw’ • Thus, for a transition to have happened, we need the difference between the two estimates to be within the limit: 2(σw + σw’)
SMURF – Multi-tag aggregate Cleaning • To calculate the estimate of population count, we use π-estimators; The estimated population count is given by: • Similarly by π-estimators, and assuming independence across different tags, the variance of the estimate is estimated as: • Here πi is probability of reading the tag i at least once during the whole window, given by 1 – (1 – piavg)w
The Road ahead… • Applications in RFID do not accept any delays in the data delivery • Data is either present in the cache or the database; data in the database increases processing time and data in cache does not understand SQL like queries • Anomaly detection in object tracking is also an important part of object tracking • Issues like untraceability, forward security, and database desynchronization are still not completely resolved. • One more serious problem with RFID is counterfeiting • In the next stage we expect to look into some of these issues
