80 likes | 228 Views
WG2: Distributed Analysis Frameworks. Magda, Christos, Ion, Mike, Sharad , Badrish. Areas of Discussion. Query optimization and UDFs In-memory Analytics Machine learning & analytics Impact of networks on distributed analyses Inter-data center analytics. Query Optimization and UDFs.
E N D
WG2: Distributed Analysis Frameworks Magda, Christos, Ion, Mike, Sharad, Badrish
Areas of Discussion • Query optimization and UDFs • In-memory Analytics • Machine learning & analytics • Impact of networks on distributed analyses • Inter-data center analytics
Query Optimization and UDFs • State of the art • Static analysis, collecting statistics • SCOPE/Cosmos query optimization • Parallel DBMSs • Unanswered Questions • Application specific cost functions • Static analysis of UDFs • Interface with the network layer
In-Memory Analytics • State of the art • Spark, distributed streams • Big memory machines (vs. commodity hardware) • Promising solutions • Cost of new memory and hardware • Unanswered questions • Increase network load or reduce it? • Memory blow-up • What is the range of applications covered?
Machine Learning & Analytics • Iterative processing vs. simple one-pass solutions • Throw enough data at the problem use simpler ML techniques • Unanswered question: • Convergence • Quality of results
Impact of Networks • Full bisection bandwidth and network allocation • Unanswered questions • Choose where to locate operators, then tell network to allocate bandwidth • Micro-tasks for load balancing • Rebalancing load • Handling stragglers • Correlated vs. uncorrelated stragglers • “Lose most important data” vs. “Random failures are okay” • Handle data skew by allowing customized partitioning • Important for small jobs: turn jobs below 10 mins into interactive jobs • No fault tolerance for small clusters
Inter Data-Center Analytics • Scope • Analytics highly valuable when done across data sources in different data centers • Users in one data center interact with users in other data centers • Log collection -> centralization? • Bandwidth spread • Local vs. global interactions – sampling? • Current solution: Centralize the data/replicate • Unanswered question: Would cost factors kill this?
Defining Success • End users should ideally not have to care about anything below query specification (e.g., number of map/reduce tasks) • Better optimization techniques in the presence of UDFs • Cross-layer optimization: Networks should expose the right primitives and SLAs to enable optimizations • In-memory analytics – feasibility, understanding locality, types of workloads supported