160 likes | 175 Views
High Performance Distributed Computing. Sophie Lemaitre Monterey - California July 2007. Database Streams. First Keynote. One of the most interesting talks Database streams http://www.cs.berkeley.edu/~franklin/Talks/HPDC07.ppt. Data Stream Processing Approach.
E N D
High PerformanceDistributed Computing Sophie Lemaitre Monterey - California July 2007
First Keynote • One of the most interesting talks • Database streams • http://www.cs.berkeley.edu/~franklin/Talks/HPDC07.ppt
Data Stream Processing Approach Continuous, Visibility, Alerts Results Data Stream Processor Live Data Streams Always-on data analysis & alerts RT Monitor & Replay to optimize Consistent sub-second response Upside Down Approach Traditional Database Approach Static Batch Reports Queries Results Data Bulk Load Data Warehouse Batch ETL & load, query later Poor RT monitoring, no replay DB size affects query response
The “Jellybean” Argument Reality: With stream query processing, real-time is cheaper than batch. minimize copies & query start-up overhead takes load off expensive back-end systems rapid application dev & maintenance Conventional Wisdom: “can I afford real-time?” Do the benefits justify the cost?
Table Stream Window clause Example 2 - Stream/Table Join Every 3 seconds, compute avg transaction value of high-volume trades on S&P 500 stocks, over a 5 second “sliding window” SELECT T.symbol, AVG(T.price*T.volume) FROM Trades T [RANGE ‘5 sec’ SLIDE ‘3 sec’], SANDP500 S WHERE T.symbol = S.symbol AND T.volume > 5000 GROUP BY T.symbol Note: Output is also a Stream
Stream Processing + Grid? • On-the-fly stream processing required for high-volume data/event generators. • Real-time event detection for coordination of distributed observations. • Wide-area sensing in environmental macroscopes.
Industry session • Most interesting session • eBay • Same talk than at CERN • Huge number of transactions to deal with • Have to be 100% available • Had to do their own database interaction layer at some point to answer their needs • Not interested in Grids, because they want to control the whole infrastructure • Google • Disk crash not correlated with temperature • High number of disk crash when disks “burnt out” at the beginning of their life • Tony Cass - post C5: • “yes, but cooling is important for plugs and fuses”
Scheduling • Possibility for users to give priority to their job is nowadays very limited • “low”, “medium” or “high” • Utility functions • Economics applied to scheduling • Ex: if you go for lunch between 12:00 and 13:00 • Same satisfaction if job finishes at 12:01 or 12:55… • In the next talk • Hypothesis = “jobs are submitted completely randomly”
GridNFS & Direct-pNFS • GridNFS • “Integrates NFSv4 into the ecology of Grid middleware” • Globus GSI support • name space construction and management • fine-grained access control with foreign user support • high performance secure file system access • Andy Adamson was wondering how to integrate VOMS • DPM and dCache are using virtual ids • He is considering doing the same… • Contact: Andy Adamson (andros@umich.edu) • Direct-pNFS • Outperforms pNFS, PVFS • Especially, very good performance for small I/O • Contact: Dean Hildebrand (dhildebz@eecs.umich.edu)
DPM with NFSv4.1 • NFSv4.1 and DPM have similar architectures • Separate metadata server • Direct access to physical files • Easy NFSv4.1 integration
Climate change ? • Concerns about climate change • In several talks • A “solar panel computer” • A new plug to save energy lost in heat (Google)