Open Cirrus Summit

Open Cirrus Summit Indranil Gupta, Roy Campbell, Michael Heath Department of Computer Science University of Illinois, Urbana-Champaign June 8, 2009 http://cloud.cs.illinois.edu

Principal Investigator Michael Heath – parallel algorithms Co-PIs and lead systems researchers Roy Campbell – O/S, file systems, security Indranil Gupta – distributed systems and protocols Lead applications researchers Kevin Chang – search and query processing Jiawei Han – data mining Klara Nahrstedt – multimedia, QoS Dan Roth – machine learning, NLP Cheng Zhai – information retrieval Peter Bacjsy, Rob Kooper - NCSA

128 compute nodes = 64+64 • 500 TB & 1000+ shared cores

Goal: Support both Systems Research and Applications Researchin Data-intensive Distributed Computing

Accessing and Using CCT: Systems Partition (64 nodes): CentOS machines, with sudo access Dedicated access to a subset of machines (~ Emulab) User accounts User requests # machines (<= 64) + storage quota (<= 30 TB) Machine allocation survives for 4 weeks, storage survives for 6 months (both extendible) Hadoop/Pig Partition and Service (64 nodes):

Accessing and Using CCT: Systems Partition (64 nodes): Hadoop/Pig Partition and Service (64 nodes): Looks like a regular shared Hadoop cluster service Users share 64 nodes. Individual nodes not directly reachable. 4 slots per machine Several users report stable operation at 256 instances During Spring 09, 10+ projects running simultaneously User accounts User requests account + storage quota (<= 30 TB) Storage survives for 6 months (extendible)

Some Services running inside CCT ZFS: backend file system. Zenoss: Monitoring. Shared with department’s other computing clusters Hadoop + HDFS Ability to make datasets publicly available How do users request an account: two-stage process User account request – require background check Allocation request

10+ projects inside Computer Science departments Growing number Includes 4 course projects in CS 525 (Advanced Distributed Systems) Research projects in multiple research groups Systems Research primarily led by: Indranil Gupta’s group (DPRG: dprg.cs.uiuc.edu) Roy Campbell’s group (SRG: srg.cs.uiuc.edu) Several NCSA-driven projects Internal UIUC Projects

Abadi (Yale), Madden (MIT), and Naughton (Wisc.) Study trade-offs in performance and scalability between MapReduce and parallel DBMS for large-scale data analysis Baru and Krishnan (SDSC) Study effectiveness of dynamic strategies for provisioning data intensive applications, based on large topographic data sets from airborne LiDAR surveys NSF-Funded External Projects

Hardware received December 2008 Cluster ready for user accounts in February 2009 Yahoo conducted initial training session for 70 users About 215 accounts on cluster to date First two major external NSF-funded user groups now have accounts and we expect more to follow About 50TB of storage has been assigned thus far We run around 50 Hadoop jobs in a typical week Project Timeline and Progress to Date http://cloud.cs.illinois.edu

Backup Slides

Semantic parsing Named entity recognition Identifying relations between entities Paraphrasing and entailment Topic and sentiment analysis Access to Unstructured Information Goal: Accessing information we want, when we want it, in forms we can understand Solution: Understanding meaning of information Key capabilities required:

Port NLP tools to Cloud using MapReduce/Hadoop to enable large-scale NLP analysis Provide research community access to deep analysis of large portion of Web 1 billion pages placed on Cloud, syntactically and semantically parsed, with named entity recognition Develop NLP-enabled applications Semantic search engine: entity and relation search Vertical search services Question answering Information integration and summarization Text mining and pattern discovery Approach to Cloud Implementation

Text Information Management Search Engines Analysis Engines Summarization Visualization Filtering Mining Information Organization Information Access Knowledge Acquisition … Search Extraction Categorization Clustering Natural Language Content Analysis Raw Text

Data-intensive computing will enable large-scale and intelligent text information management Today: Search by query Tomorrow: Personalized intelligent information agent Today: Document as bag of words Tomorrow: Understanding of entities and relations in documents through large-scale semantic analysis Today: Browsing supported only through preset hyperlinks Tomorrow: Browsing enabled through powerful navigation maps Next Generation Text Information Management

UI Urbana-Champaign multi-display 3D rendering 3D camera array Internet2 networking infrastructure edge processors Multi-stream 3D Tele-Immersive (3DTI) Environment 17 UC Berkeley

D C C UC Berkeley D G C G C D C G D C G D C D D UIUC G C G service gateway D display camera

Store 3D multi-view videos in Cloud Provide multi-dimensional search/query for various attributes E.g., search for patient’s arm exercise Cloud Implementation of Tele-immersive Environment

Automation of dynamic resource allocation, scheduling, management, and monitoring Partitioning and sharing of computation, network, and storage resources Analysis of distributed system, network, and application logs Scalability and fault tolerance of distributed file systems Characterization of cloud workloads Security and information assurance Multi-site issues: latency, scalability, etc. System-Level Research

Understanding textual information through large-scale semantic analysis Intelligent browsing through navigation maps Crawling online social networks to understand their dynamic evolution Supporting 3D tele-immersive environments Implementing genetic algorithms via MapReduce Exploring GPUs in cloud environment K-means clustering, Black-Scholes option pricing, etc. Breaking MapReduce barrier through weaker consistency models Applications Research

Open Cirrus Summit