130 likes | 268 Views
Mtech Projects 2002. Sunita Sarawagi. Sequence mining. Several real-life mining applications on sequence data Classical applications Speech, language, handwritten are all complex sequences Newer applications Bio-informatics: DNA and proteins
E N D
Mtech Projects 2002 Sunita Sarawagi
Sequence mining • Several real-life mining applications on sequence data • Classical applications • Speech, language, handwritten are all complex sequences • Newer applications • Bio-informatics: DNA and proteins • Telecommunication: Network alarms, network packet data • Retail data mining: Customer behavior
Sequence mining: problems • Existing work scattered and application specific • Field in dire need of consolidated algorithms and software solutions • More technical details can be discussed after we finish this topic in class on March 3
Sensor databases and mining • Several distributed sensors that push data to centralized database servers • Example: Automatic Vehicle Location systems consisting of sensors at bus stops, an entry in the server each time a bus passes a stop. • Goal: Build a DBMS for managing this data and supporting queries like “when is the next bus to X going to arrive”?
Problems Cross-disciplinary covering several areas • A mining sub-problem: predicting arrival time based on • Previous arrival patterns of same bus • Traffic conditions derived from other buses with common routes • A database query problem: • Approximate search based on spoken queries
Multi-relational data mining • Existing mining software assume data in a single relation • Real-life data over multiple relations • Existing tools rely on manual preprocessing before commencing mining, this is time-consuming and in-accurate. • Design and implement mining algorithms for multi-relational data
Who should apply • Fascinated by the areas of data mining, data bases, machine learning • Want to get a flavor of cutting-edge research • Enjoyed the courses • Have a knack for algorithm design and implementation • Are wery software savvy • Wants to stretch his learning/knowledge rather than slide through with an “easy” project.
Possible achievements • Understand one topic deeply, learn to innovate • Produce software that several people use • Write papers in really top-quality international conferences • Demo the software in leading international forums
Industries in the area • IBM IRL • Strand Genomics • GE Capital • TCS bio-informatics • PSPL • Startups like Vistaar • Outside india: several
Automatic segmentation of free text records, 2000 Batch • A HMM-based address segmenter • Software licensed by a Data Cleaning company • Paper in one of the two premium database conferences • ACM SIG on Management of Data (SIGMOD) 2001, Santa Barbara USA.
ICUBE – Intelligent Rollups • MTP work integrated in ICube, demo-ed at SIGMOD 2000 held in Texas, USA • Icube software adopted by a startup • Paper at the other premium database conference, VLDB 2001 held in Rome, Italy.
Data deduplication using active learning • Software likely to be transferred to National Informatics Corporation, Pune • Practical application of an interesting idea from machine learning • Paper at KDD 2002 conference held in Canda • Demos at VLDB 2002 Hongkong, ICDE 2003 Bangalore