1 / 8

WG2: Distributed Analysis Frameworks

WG2: Distributed Analysis Frameworks. Magda, Christos, Ion, Mike, Sharad , Badrish. Areas of Discussion. Query optimization and UDFs In-memory Analytics Machine learning & analytics Impact of networks on distributed analyses Inter-data center analytics. Query Optimization and UDFs.

chava
Download Presentation

WG2: Distributed Analysis Frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WG2: Distributed Analysis Frameworks Magda, Christos, Ion, Mike, Sharad, Badrish

  2. Areas of Discussion • Query optimization and UDFs • In-memory Analytics • Machine learning & analytics • Impact of networks on distributed analyses • Inter-data center analytics

  3. Query Optimization and UDFs • State of the art • Static analysis, collecting statistics • SCOPE/Cosmos query optimization • Parallel DBMSs • Unanswered Questions • Application specific cost functions • Static analysis of UDFs • Interface with the network layer

  4. In-Memory Analytics • State of the art • Spark, distributed streams • Big memory machines (vs. commodity hardware) • Promising solutions • Cost of new memory and hardware • Unanswered questions • Increase network load or reduce it? • Memory blow-up • What is the range of applications covered?

  5. Machine Learning & Analytics • Iterative processing vs. simple one-pass solutions • Throw enough data at the problem  use simpler ML techniques • Unanswered question: • Convergence • Quality of results

  6. Impact of Networks • Full bisection bandwidth and network allocation • Unanswered questions • Choose where to locate operators, then tell network to allocate bandwidth • Micro-tasks for load balancing • Rebalancing load • Handling stragglers • Correlated vs. uncorrelated stragglers • “Lose most important data” vs. “Random failures are okay” • Handle data skew by allowing customized partitioning • Important for small jobs: turn jobs below 10 mins into interactive jobs • No fault tolerance for small clusters

  7. Inter Data-Center Analytics • Scope • Analytics highly valuable when done across data sources in different data centers • Users in one data center interact with users in other data centers • Log collection -> centralization? • Bandwidth spread • Local vs. global interactions – sampling? • Current solution: Centralize the data/replicate • Unanswered question: Would cost factors kill this?

  8. Defining Success • End users should ideally not have to care about anything below query specification (e.g., number of map/reduce tasks) • Better optimization techniques in the presence of UDFs • Cross-layer optimization: Networks should expose the right primitives and SLAs to enable optimizations • In-memory analytics – feasibility, understanding locality, types of workloads supported

More Related