240 likes | 498 Views
2. What does Condor have?. lots of core technology for building a distributed system. 3. What does Condor have?. lots of core technology for building a distributed systemlots of core technology for monitoring the status of a machine. 4. What does Condor have?. lots of core technology for buil
E N D
1. 1 HawkEyeA Monitoring and Management Tool for Distributed Systems
2. 2 What does Condor have? …lots of core technology for building a distributed system
3. 3 What does Condor have? …lots of core technology for building a distributed system
…lots of core technology for monitoring the status of a machine
4. 4 What does Condor have? …lots of core technology for building a distributed system
…lots of core technology for monitoring the status of a machine
…lots of core technology for managing a work load of tasks
5. 5 What does Condor have? …lots of core technology for building a distributed system
…lots of core technology for monitoring the status of a machine
…lots of core technology for managing a work load of tasks
…lots of really, truly, skilled and experienced developers and researchers at building distributed systems. Some of the best. Standout state employees. Honest.
Email for Wisconsin Gov Scott McCallum: wisgov@gov.state.wi.us
6. 6 One day an avid Condor user asked:
7. 7 One day an avid Condor user asked:
8. 8 Time to think… Gathered up our experiences with our own management tasks, looked at the mature Condor technology available to us, and HawkEye effort was born.
Completely separate from Condor from end user prospective.
Can install HawkEye, or Condor, or both
9. 9 First Component: MONITORING Sysadmins first need information about what is happening on the machines they are responsible for.
Both Current and Past
Information must be consolidated and easily accessible
Information must be dynamic
10. 10 Condor ClassAds Technology for an entity to describe itself
Simple attribute value pairs
11. 11 Condor ClassAds, cont. No fixed schema
Attributes can contain values or expressions
Serialize Ads in XML
Open source libraries on C++ and Java to:
Manipulate Ads and Ad attributes
Store Ads
Query collections of Ads
Bindings for Perl and others on the way…
12. 12 HawkEye Monitoring Agent
13. 13 HawkEye Monitoring Agent
14. 14 HawkEye Monitoring Agent
15. 15 Monitor Agent, cont. Updates are sent periodically
Information does not get stale
Updates also serve as a heartbeat monitor
Know when a machine is down
Out of the box, the update ClassAd has many attributes about the machine of interest for system administration
Current Prototype = 184 attributes
16. 16
17. 17 Custom Attributes
18. 18 Role of HawkEye Manager Store all incoming ClassAds in a indexed resident data structure
Fast response to client tool queries about current state
“Show me all machines with a load average > 10”
Periodically store ClassAd attributes into a Round Robin Database
Store information over time
“Show me a graph with the load average for this machine over the past week”
Speak to clients via CEDAR, HTTP
19. Several different clients Command-line, GUI, Web-based
20. 20 But sysadmins also sometimes have to do work… Task: copy a new library onto the local disk of each machine.
Just a script to copy via rcp/scp to every machine… or is it?
21. 21 Running tasks on behalf of the sysadmin Submit your sysadmin tasks to HawkEye
Tasks are stored in a persistent queue by the Manager
Tasks can leave the queue upon completion, or repeat after specified intervals
Tasks can have complex interdependencies via DAGMan
Records are kept on which task ran where
Sounds like Condor, eh?
Yes, but simpler…
22. 22 Run Tasks in response to monitoring information ClassAd “Requirements” Attribute
Example: Send email if a machine is low on disk space or low on swap space
Submit an email task with an attribute:
Requirements = free_disk < 5 || free_swap < 5
Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape
23. 23 HawkEye Design Goals Monitoring
Reliable presence
Get Data off the node in an extensible, consistent manner
Run Tasks
In response to probe information
Repeat or once-only semantics
Audit Log
Independent and self-contained
Cross-Platform
24. 24 Current Status Just Beginning this project
Initial release early summer
Prototypes already running –
Stop in and see initial HawkEye Work
Rm 3385 on Weds 9am – 12pm
25. 25 Thank you!