160 likes | 395 Views
Projects. CS698V: Data Mining. Areas. Web Mining Bioinformatics Multimedia Mining Streaming Data Mining General Data Mining Methodologies. Project 1: Mining the Web Graph. Web as a graph: page is node, link is edge
E N D
Projects CS698V: Data Mining
Areas • Web Mining • Bioinformatics • Multimedia Mining • Streaming Data Mining • General Data Mining Methodologies
Project 1: Mining the Web Graph • Web as a graph: page is node, link is edge • Task 1: Find the subgraph of the entire web consisting of the .ac.in domain • Task 1.1: Write crawlers to walk through the graph and remember paths crawled, eventually building the full graph structure • Task 1.2: Suggest efficient structures for representing the (sparse) graph • Generate statistics about the graph, approximate number of nodes (about 857,000 according to Google), edges, leafs
Task 2: Mine the .ac.in web graph • Cluster the nodes based on link structure • Identify Hubs and Authorities • Report interesting patterns of the cluster structure Benefits: • Statistics of the domain not available, useful for optimization • Study of the evolution pattern (infancy, mature, saturated) • Identify likely “hidden web” in this domain • A search engine for the Indian academic and research network
Further Reading • Publications of the Stanford WebBase Group • UbiCrawler, WebGraph, University of Milano, Italy • The Chilean Web
Project 2: Metasearch • Combines of search results of several search engines • Combination strategy is open to research • Each search engines returns a set of pages ranked according to its relevance to the query • How to get a combined ranking (cranking) • Tasks: Comparative study of different personalized and adaptive combination schemes. Propose new scheme.
Further Reading • Cranking using conditional probabilistic models, Lebanon, Lafferty, ICML 2002 • Rank Aggregation Methods for the Web, Dwork, Ravi Kumar et al, 2001 • Learning to Order things, Cohn, Schapire • Comparing top k lists, Fagin, Ravi Kumar, 2003
Project 3: Intelligent Web Search Agents • WebMate: A web search agent/assistant which uses a proxy to record user browsing pattern, and recommends sites for future visits Task: • Study different agent architectures • Use association rule and other data mining technologies to design more intelligent web agents
Further Reading • WebMate (CMU) • Calvin (U. Leipzig)
Project 4: Hypertext/Text Categorization using Support Vector Machines • Task: Study of different SVM kernels for hypertext/text categorization for large collections • Task: Propose a new kernel which incorporates link information Further Reading: Composite kernels for hypertext categorization, Joachims, Christianinni, ICML 2001
Project 5: Mining Microarray Data • Critical Assessment of Microarray Data Analysis (CAMDA) • Tasks: • Identifying genes responsible for a disease • Gene clustering/association mining • Gene regulatory networks Further Reading: • Papers in CAMDA contest data site
Project 6: Mining Association Rules from Image Database • Perceptual Association Rules • Tesic, Newsam, Manjunath, SIAM data mining conf. 2003 • Image database: NASA Mars images, Corel image database • Task: Study different forms of generalized association rule that can be mined from images • Task: Innovative use of the rules in retrieval, event (e.g., cyclone) detection
Project 7: Privacy Preserving Data Mining • Watermarking relational data – Agrawal, 2003 • Privacy preserving data mining – Agrawal 2000 • Task: Study of different frameworks for preserving privacy, propose new watermarking techniques
Project 8: Mining for Alarming Incidents in Data Streams • Task: Study existing outlier detection algorithms • Task: Use algorithms for clustering data streams to detect outliers which are alarming incidents Further Reading: • MAIDS project, J. Han, UIUC • Clustering data streams, Mishra, Guha, Motwani, FOCS 2000
Project 9: Data Mining Standards • Task: A detailed report on different data mining standards. • Models • Interfaces • Drawbacks • Scope for contribution Further Reading: • Microsoft OLE DB for DM • Oracle PMML • CRISP-DM
Schedule • Projects and Groups formations to be finalized by Monday, Feb. 16, 2004 • Project Plan due by Feb. 23, 2004 • Midterm status check around March 23, 2004. • Final demonstration and documentation due by May 1, 2004.