130 likes | 251 Views
CS240A Project. Project proposal Teams of 2-3 students Interesting parallel application or system Conference-quality paper as a leveraging point (what is the challenge?) ACM/IEEE/SIAM/USENIX conferences (SuperComputing, SIGIR, WSDM, SIGMOD, etc) High performance and measurement is key:
E N D
CS240A Project • Project proposal • Teams of 2-3 students • Interesting parallel application or system • Conference-quality paper as a leveraging point (what is the challenge?) • ACM/IEEE/SIAM/USENIX conferences (SuperComputing, SIGIR, WSDM, SIGMOD, etc) • High performance and measurement is key: • Understanding performance, tuning, scaling, etc. • More important than the difficulty of problem • Leverage • Research projects, Master projects
Some Project Ideas • Examples • Parallel applications • Data mining. Ranking (parallel algorithms). • Duplicate detection • Secure search (encrypted data, slow to process) • Search engines/graph search etc. • Matrix multiplication for similarity/recommendation • Systems • Speedup Mapreduce I/O. Incremental computing • Integrate MPI with Mapreduce • Parallel storage systems.
Timeline • Week of Feb 7: preliminary proposal • Meeting with me now or later • Select paper(s) for reviewing. • Feb 15 week ( probaly delay 1 week): Project progress presentation • Background reviewing and progress report. • March 17 weekFinal project presentation. 5-page report.
Datasets • Wikipedia data sets • http://www.search-engines-book.com/collections/ • 121K documents, 715MB. • NSF Research Award Abstracts 1990-2003 Data Set • (129,000 abstracts) http://archive.ics.uci.edu/ml/datasets/NSF+Research+Award+Abstracts+1990-2003 • Edmunds Car review & Tripadvisor Hotel review • from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews). • http://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset. • Yahoo music • KDD 2011 Cup Yahoo music rating (717M over 600K items). Columbia million song collection. • Yahoo answer dataset • 4.4M questions and their corresponding answers.
Processed Datasets • Microsoft Web page ranking data (LETOR) • Feature vectors extracted from query-url pairs along with relevance judgment training data (10K, 30K). • Bags of word datasets • Musicxmatch (song lyrics, 237,662 tracks). • http://labrosa.ee.columbia.edu/millionsong/musixmatch • Enron emails (39861), PubMed Abstracts (8.2M). NY time articles (0.3M). http://archive.ics.uci.edu/ml/datasets/Bag+of+Words • KDD 2012 datasets for ads click and social networking. • Yahoo front page click data (45Mclicks)
Cache-aware Matrix Multiplication for Similarity Comparison • Similarity computation. • Two items are similar if their vector multiplication value > threadshold. • N- item similarity is a matrix multiplication problem. • Expected work: • Cache analysis and performance tuning with Hadoop C/C++ • Data: • document dataset • More details: • http://cs.ucsb.edu/~xtang/streaming.html • Contact: Maha@cs, xtang@cs
Music Recommendation • Recommendation uses item similarity computation. • Two items are similar if their vector multiplication value > threadshold. • N- item similarity is a matrix multiplication problem. • Expected work: mapreduce Java programming. Data analysis • Data: • Yahoo music • Challenge: mapreduce programming. • Related paper: Fidel Cacheda, Vctor Carneiro, Diego Fernandez, and Vreixo Formoso. Comparison of collaborative Filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web, 5(1), February 2011. • http://kddcup.yahoo.com/workshop.php • Contact: maha@cs
Clustered Storage and Deduplication and in Cloud Backup • Build a parallel cloud backup system with deduplication support with various constraints • Work involved: use distributed file system and build a duplication layer. • Contact: wei@cs • paper: Wei Zhang et al. Multi-level Selective Deduplication for VM Snapshots in Cloud Storage. Slides. In Proc. of IEEE Cloud 2012.
Computation and Communication Re-Scheduling for MapReduce • This project studies alternative task scheduling and communication methods for repetitive MapReduce execution jobs • Scheduling mapper/reducer tasks if we know certain computation and communication patterns and jobs. • Communication can be also be re-arranged. • Paper: A platform for scalable one-pass analytics using MapReduce, SIGMOD 2011 • SIGMOD 2010, ParaTimer: a progress indicator for MapReduce DAGs
High Performance or Secure Inverted Index Search • Develop a prototype inverted index search code on a multi-threaded multicore shared memory system • Input : inverted index for a set of words • Output: find results matching a query • Expected work: algorithm understaning. C/C++ programming using pthreads. • If you are interested in security • Secure search Reference: Privacy-Preserving Multi-keyword Ranked Search over Encrypted Cloud Data. INFOCOM 2011 • If you are interested in algorithm performance: • Culpepper, J., Moat, A.: Effcient set intersection for inverted indexing. ACM • Transactions on Information Systems (TOIS) 29(1) (2010)
Query Log Analysis • AOL has a leaked query log dataset • Compute query similarity using what results users have selected. • Expected work: mapreduce programming. Algorith/data analysis. • Reference: J. Guo, et al. Intent-aware query similarity. CIKM 2011
Parallel Online Search System • Setup searchable datasets with a search engine • For large datasets, setup data partitions and multiple services • Expected work: integrate open source search engine software and run multiple machines. • Partitioning data, built a search engine on a cluster. • Performance metrics: response time, throughout • Document Allocation Policies for Selective Searching of Distributed Indexes (Kulkarni and Callan) - Appears in the Proceedings of the 19th ACM Conference on Information and Knowledge Management, Oct 2010, Toronto, Canada. • Kai Shen, Tao Yang, Lingkun Chu, JoAnne L. Holliday, Doug Kuschner, and Huican Zhu, Neptune: Scalable Replica Management and Programming Support for Cluster-based Network Services. In Proc. of the 3rd USENIX Symposium on Internet Technologies and Systems (USITS'01), Pages 197-208, San Francisco CA
Other parallel application papers • Parallel Boosted Regression Trees for Web Search Ranking Stephen Tyree, Kilian W. Weinberger, Kunal Agrawal and Jennifer Paykin. WWW 2011. • Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training. Journal of Machine Learning Research, 2009.