140 likes | 248 Views
Feature Selection on Time-Series Cab Data. Yingkit (Keith) Chow. Contents. Introduction Features Considered FCBF (Filter-type feature selection) FCBF-PCA (my variation) Conclusion. All Features Considered. Features = Each time sample consists of the following features
E N D
Feature Selection on Time-Series Cab Data Yingkit (Keith) Chow
Contents • Introduction • Features Considered • FCBF (Filter-type feature selection) • FCBF-PCA (my variation) • Conclusion
All Features Considered • Features = • Each time sample consists of the following features • Day of Week, Time of Day (1st two features) • taxis[t, 6:9], taxis[t-1, 6:9],…, taxis[t-5, 6:9] • [6:9] represents the index to the matrix taxis, which is the cab entering with meter off, cab enter on, cab exit off, cab exit on • Not all features here will be relevant to classifying whether a game is present.
Fast Correlation-Based Filter Algorithm: • Finds features that are relevant ( SU(I, C) > threshold), • where SU is symmetric uncertainty and will be described in the next slide • Remove redundant features by comparing remaining features (after the first step) • Remove feature j if SU(i, j) >= SU(j, C)
Equations[1] • Information Gain (IG) • IG(X|Y) = H(X) – H(X|Y) • Symmetric Uncertainty (SU) • SU(X,Y) = 2 * IG(X|Y) / [H(X)+H(Y)] • SU is used instead of IG because it compensates for features having more values and normalizes data[1]
FCBF • Classifier (MATLAB Classify- Linear) • Number Bins = 96 • Threshold = 0.01 • Accuracy = 91.9%
Choice of Number Bins • Num Bins = 96 results shown in previous slide (red is ground truth of game and blue is my classification) • Num Bins = 20 • Accuracy = 58.6% • Here the algorithm breaks down and only chooses feature 2, the “time of day”. The blue is periodic here, where a certain time segment a day, everyday will be classed as a game.
FCBF - PCA • FCBF compares individual features with each other • We can use PCA to try and capture a group of features. (for example, maybe one eigenvector can capture the shape of the number of cabs incoming with meters on initially before a game or the increase in the number of cabs entering with meters off prior to the end of game) • Example shown in the next slide
Cab Traffic Behavior • Before Start of Game • Cab On Enter and Cab Off Exit are high • Towards End of Game • Cab Off Enter and Cab On Exit are high
FCBF-PCA • Classifier (MATLAB Classify- Linear) • Number Bins = 20 • Threshold = 0.01 • Accuracy = 92.9% • Note: the features here are projections onto the eigenvectors and not the original feature dimension
Conclusions • The choice of number of bins have an enormous impact on the performance. (possibly due to 96 discrete values of time of day variable) • FCBF-PCA was less susceptible to the choice of numBins (10, 20, 100 numBins all resulted in approximately 91% accuracy)
Future Work • Currently using labels of game or not game. • I’ll try to make it work for detecting the first sample of a game and another classifier to detect the last sample of a game since the mid-game generally has an entirely different characteristic from the beginning and end of game. However, I might be limited by the number of samples.
Questions • I’m not currently in NYC so please send questions or comments to: • yingkit.chow@gmail.com
Citations • “Feature Selection for High Dimensional Data: A Fast Correlation-Based Filter Solution”, by Lei Yu and Huan Liu, ICML (2003) • “Efficient Feature Selection via Analysis of Relevance and Redundancy”, by Lei Yu and Huan Liu, Journal of Machine Learning Research 5 (2004)