160 likes | 323 Views
Student Workshop Readout. Big Data and Cloud Read By: Parag Deshmukh. Schedule. Sample Research Areas in Big Data and Cloud. Warehouse scale computing Big Data Algorithms and Data Structures by Giridhar Nag Yasa Resource management at scale
E N D
Student Workshop Readout Big Data and Cloud Read By: Parag Deshmukh NetApp Confidential - Internal Use Only
Schedule NetApp Confidential - Internal Use Only
Sample Research Areas in Big Data and Cloud • Warehouse scale computing • Big Data Algorithms and Data Structures by Giridhar Nag Yasa • Resource management at scale • Cloud Resource Management using Machine Learning by P C Nagesh • Issues in multi-tenant environment • Security in Cloud by SrinivasanNarayanamurthy • Reliable computing with unreliable components • Reliability in Cloud by Ranjit Kumar NetApp Confidential - Internal Use Only
Brainstorming Outcome NetApp Confidential - Internal Use Only
Group 1 (Sai Susarla) Sunil Kumar (IISc), Sandeep Kumar (IISc), Shashank Gupta (IIT Bombay) • Size of lemma model for index building grown beyond memory size • Tasks are uneven in there complexities Distribution of work for even utilization of cluster while handling the lemma model which is larger than memory size NetApp Confidential - Internal Use Only
Group 2 (Vipul Mathur) Vineet P (ATG), LavanyaT (IISc), B. Ramakrishna (IIT Delhi) , Nikhil Krishnan (IISc), S.SreeVivek (IIT Chennai) • How do we secure inline-deduped uploads? • A scheme for making sure a user actually has the data, before deduplicating uploads. • Data Redundancy: Dedup, Replication and Erasure Coding • Can we find the appropriate level of redundancy to feed dedup vs. replication vs. erasure coding mechanisms? • Accessing petabytes of data at small block granularity is inefficient. • Can we learn the “appropriate” block size for a file using regression and change dynamically NetApp Confidential - Internal Use Only
Group 3 (Ajay Bakre) BirenjithSasidharan (IISc),ManjeetDahiya (IIT Delhi), PriyankaKumar (IIT Patna) • “Aadhar” dedup problem • What data structures can be used for avoiding perturbations in the finger printing store. • What should be the layout of data store and/or change in dedup algorithm so that • We have a deterministic response time of dedup algorithm irrespective of repository size NetApp Confidential - Internal Use Only
Group 4 (Ameya Usgaonkar) N. Prakash (IISc), V. Lalitha (IISc), PriyankaSingla (IISc) • De‐duplication and RAIDing • Both being at the level of 4K blocks, is there any advantage to jointly design them ? NetApp Confidential - Internal Use Only
Table arrangement for breakout session NetApp Confidential - Internal Use Only
Workshop Readout (Three Ideas) Table 2 Students: Nikhil, Vivek, Lavanya, Ramakrishna NetApp: Vineet,Vipul NetApp Confidential - Internal Use Only
A: Secure Deduped Uploads • How do we secure inline-deduped uploads? • A scheme for making sure a user actually has the data, before deduplicating uploads. • User H1(D) Server match+ dedup is insecure if some malicious person gets hold of H1(D), they can ask for D. • Server generates nonce r User H2(H1(D), r) Server match + dedup is secure as H1(D) is never sent over the network. NetApp Confidential - Internal Use Only
B: Data Redundancy: Dedup, Replication and Erasure Coding • Considerations: • Dedup removes redundancy in data,replication for performance adds redundancy • Replication for reliability vs. erasure coding • Can we find the appropriate level of redundancy to feed dedup vs. replication vs. erasure coding mechanisms? • Ideas: • Learn/ specify an activity level for m users • No dedup, possible replication for active data • Heavy dedup for cold data • Erasure coding for reliability not needed if performance replicas provide reliability too or if non-deduped copies exist • Summary: Derive and use a function f(m) to select the appropriate redundancy level taking into account dedup, replication and erasure coding. NetApp Confidential - Internal Use Only
C: Variable Block Sizes • Accessing petabytes of data at small block granularity is inefficient. • Can we learn the “appropriate” block size for a file using regression based on: • Access patterns: sequential vs. random • File sizes • Duplication factor • Track changes in patterns over time • Vary block size to adapt • Reliability methods affected: “block checksums” • Considerations: • Can a single file have variable block sizes? • Is it possible to change block sizes over time? • Use multiples of single block size. • Start with a prediction based on user’s profiles. • Hot vs. cold data should have different block sizes NetApp Confidential - Internal Use Only