300 likes | 476 Views
Isis 2 Runtime Parameters. Cornell University. Ken Birman. Parameters. Many features of Isis 2 depend on parameters you can modify to “shape” the behavior of the platform. They give you very fine control over behavior of Isis 2 There are three main categories of parameters
E N D
Isis2 Runtime Parameters Cornell University Ken Birman
Parameters • Many features of Isis2 depend on parameters you can modify to “shape” the behavior of the platform. • They give you very fine control over behavior of Isis2 • There are three main categories of parameters • Those that determine how the system will start up • Those that determine how it sends messages • Those that control limits, timeouts and other bounds
Startup Parameters What happens when you call IsisSystem.Start()?
How IsisSystem.Start() works • The library initializes itself and determines the IP address of “local host.” If the host has several IP addresses, it picks the last of the IPv4 addresses • The system scans the “environment” variables to read values of the parameters. These will override the default values compiled into Isis2 • In Linux/bash, use “export” to set them, either in .bashrc or in a shell script. Or call setenv(2) • In Windows, use the “set” command, or callEnvironment.SetEnvironmentVariable("something", somevalue);
How IsisSystem.Start() works • Next, the system decides which network interfaces it should use (all of them, unless you tell it otherwise by setting ISIS_NETWORK_INTERFACES) • Do this if you expect to run on machines that have a “production” network and a “management” network • Otherwise leave ISIS_NETWORK_INTERFACES alone • Having done this, it attempts to contact the ORACLE • If the ORACLE isn’t found, it restarts the ORACLE • Otherwise, it asks the ORACLE to let it join the ISISMEMBERS system group
Logging • Normally, upon restart, Isis2 creates a log file for messages printed by the library • You can inhibit this by setting ISIS_MUTE=true • You can also direct that messages be echoed to the Debug stream rather than the Console when calling IsisSystem.Start() • If you allow logging and want to write to the log, call IsisSystem.Write() or IsisSystem.WriteLine() • Output goes to the log plus to Console, or Debug stream
Fast start: But there can only be one… • For extreme speed, you can tell Isis2 not to hunt for the ORACLE (by specifying an argument to IsisSystem.Start) • It will restart instantly. But if you launch two instances this way, they won’t communicate with one-another. • So… do this only in the first instance that you launch
Overwhelming the Membership Oracle • If processes start one by one, no issue…. • But what if you try to start 50 at once, or 500? Hello? Oracle Welcome! Oracle
Master/Worker • If a system will be big, launching hundreds of members can overload the ORACLE. • Better performance: add many all at the same time • In this case use the Master/Worker pattern • Master starts first, collects a list of the workers • Workers start after the master and register with it • Then Master can add a batch of workers to the system, and to any groups that are desired
Master: Accumulates workers, tells them what to do staticvoidbeMaster(string[] args) {IsisSystem.Start();Semaphore waitForWorkers = new Semaphore(0,1);boolfullyStaffed= falseList<Address> myWorkers = newList<Address>();IsisSystem.RegisterAsMaster((NewWorker)delegate(Addressworker) {lock (myWorkers)if(fullyStaffed)IsisSystem.RejectWorker(worker);else {myWorkers.Add(worker); if(myWorkers.Count() == GOAL) {fullyStaffed = true; waitForWorkers.Release(1); } } });waitForWorkers.WaitOne();IsisSystem.BatchStart(myWorkers); // This delays until they have all finished their batch startIsisSystem.WaitForWorkerSetup(myWorkers);Group.MultiJoin(myWorkers, new Group[] { myGroup}); // In front of this next line do whatever you want this application to doIsisSystem.WaitForever(); // If the master shuts down, its workers will tooIsisSystem.Shutdown();} Accumulate workers Main thread waits until enough workers have connected, then starts them all at once… … Then adds them all to groups we may want to use
RunAsWorker: Let Master run the show staticvoidbeWorker(string[] args) {// This next line assumes that argument 0 is the master's Address // You can also use new Address(mastersHost, 0) if you know the host IP // address of the master but don’t know the master’s pid. IsisSystem.RunAsWorker(args[0]); // This line blocks until the master issues the BatchStart() call // Notice that in this one special case we call it AFTER RunAsWorker! IsisSystem.Start(); // Before calling this next line do whatever setup this worker must do: // create your group handles and register callbacks – but don’t call Join // For example, you might call g = new Group(“something”), then call // g.ViewHandlers += myViewHandler; … etc – anything needed to have the // group ready for a Join. But you call SetUp done INSTEAD of g.Join(). IsisSystem.WorkerSetupDone(); // Now, for each group the Master created using a multijoin, you wait // for its first view to be reported. This is one way to do that: foreach (GroupginmyGroups) while (!g.HasFirstView) Thread.Sleep(250); // WaitForever would freeze the main thread but if the worker has joined // groups (or gets added to groups by the master using MultiJoin(), the // worker could be quite active, receiving messages, sending them, etc) IsisSystem.WaitForever(); // If the master shuts down the worker will throw an // IsisException("master termination"); // If this next line actually executes, this particular worker will exit // (in effect, this worker is a normal Isis application by now, except that // if the master terminates, it does too. In particular, it can // deliberately chose to leave the system if it wishes to do so IsisSystem.Shutdown(); }
Master/Worker Timeline • Worker • Master IsisSystem.Start(); . . . Accumulate workers IsisSystem.RunAsWorker(mAddress); IsisSystem.Start(); Oracle Reached goal IsisSystem.BatchStart(myWorkers); Group myGroup= newGroup(“myGroup”); . . . Attach handlers for myGroup, thenmyGroup.Join(); Group g= newGroup(“myGroup”);. . . Attach handlers for g, but don’t call Join IsisSystem.WorkerSetupDone(); IsisSystem.WaitForever(); IsisSystem.WaitForWorkerSetup(myWorkers); Setup done for all workers Group.MultiJoin(myWorkers, new Group[] { myGroup}); IsisSystem.WaitForever(); New view foreach (GroupginmyGroups) while(!g.HasFirstView) Thread.Sleep(250);
Why does this help? • Workers only send one message to Master • Hence it experiences less load • It adds them all at once, first to the system, then to whatever groups the application will use • Hence only one group view needs to be sent, and it can be sent efficiently, using a broadcast • Overall load is much reduced
Messaging Parameters How to control what internet protocols Isis2 uses
IP multicast / ISIS_UNICAST_ONLY • Isis2 will broadcast to find the ORACLE unless you tell it not to do so. • Default: OK to use IP multicast, UDP, broadcast • ISIS_UNICAST_ONLY: don’t use IP multicast. Still requires UDP (older ISIS_TCP_ONLY feature was eliminated starting in Isis v2.1) • You must list the machines on which Isis2 ORACLE will run if you put the system in ISIS_UNICAST_ONLY mode. ISIS_HOSTS=“…”
Normal versus UNICAST_ONLY • With normal IP multicast packets are still sent directly • With ISIS_UNICAST_ONLY, packets travel on a tree of point-to-point links and must be forwarded, perhaps log2(N) times IP multicast Unicast tree: power of 2 “reach”
ISIS_HOSTS • Idea is to list the places where the ORACLE can run ISIS_HOSTS=c1.cs.cornell.edu,c2.cs.cornell.edu … or ISIS_HOSTS=192.167.54.133,192.167.54.134 • Processes running on other machines can join the system but can’t restart it from scratch
ISIS_HOSTS: numerical is best! • We have seen bugs in the Linux DNS when accessed from Mono. Sometimes it hangs • To avoid this, use fully numerical IP addresses when you set the values in ISIS_HOSTS • Use the IPv4 addresses for the machines on which you want the ORACLE to run. In this case DNS never hangs • The “ping” and “traceroute” commands are examples of ways you can look these up. • On Windows, string names are fine. On Linux, they work, but don’t put the DNS under heavy load.
ISIS_PORTp • The system uses two standard IP ports • ISIS_PORTp: for p2p messages • ISIS_PORTa: Set to ISIS_PORTp+1, for acks/nacks • These ports should not be blocked by your firewall • On Linux, also check iptables, which is like a firewall • If two instances of Isis2 use non-overlapping port ranges, they will not notice one-another.
ISIS_MAXIPMCADDRS • When permitted to use IP multicast, Isis2 tries not to overuse that feature: • ISIS_MCRANGE_LOW: low-end of the IPMC address range Isis2 should use. By default, CLASSD+5000, where CLASSD is 244.0.0.0/8 • ISIS_MCRANGE_HIGH: high-end of the IPMC range • ISIS_MAXIPMCADDRS: limit on how many multicast addresses Isis2 can use, system-wide. It is perfectly reasonable to set this to a small number, like 5 or 10. The system should work if ISIS_MAXIPMCADDRS2. • If ISIS_UNICAST_ONLY is true, then no IPMC addresses are used at all.
ISIS_TTL • Broadcast and multicast messages are automatically relayed by routers • Each “hop” causes the “time to live” field in the message to be decremented • If the TTL reaches zero, the router drops the packet • Isis2 initializes the TTL value using ISIS_TTL. • You can set this to 0 or 1 to confine the system to a single segment of your network.
ISIS_MAXMSGLEN • Automatically adjusted but you can provide a recommended value if you wish • Isis2 will override the value in some situations • Normally not something you would need to modify • If a message is too large, Isis2 will automatically fragment it and reassemble it prior to delivery
Other limits and timeouts These are less often changed
ISIS_DEFAULTTIMEOUT • Normally 45secs. OK to reduce if you wish. • Failure detection needs twice this long, hence 90s. • This applies if you kill a process “suddenly” (e.g. ^C) or if the machine on which it was running crashes • 45s is very slow, but on cloud computing systems long delays happen more often than you would expect! • On lightly loaded clusters, you can set ISIS_DEFAULTTIMEOUT much lower, but not less than 2s. • If you design a failure sensing solution of your own, call Isis.ProcessFailed(who) to tell us if a process crashes.
Help! I’ve been poisoned! • If a process throws this exception, it means that some other process thought it had failed • If a dead process reappears, live members send it a “you have been poisoned” message • Prevents system partitioning • Rule in Isis2: Only allow a single partition to remain alive at one time. If a partition forms, immediately shut one side down (the side lacking a majority)
Speeding up failure detection • If a process will exit (rather than crash), call IsisSystem.Shutdown() first. • This rapidly announces the departure and the process will immediately be removed from groups it belongs to • Like a fast failure notification – as if it said “bye!” • You can also eliminate a group rapidly (without killing its members) using g.Terminate()
Hints for EC2 users • On EC2 we recommend using ISIS_UNICAST_ONLY • EC2 gives you a “virtual cluster” with nodes numbered from IP address xxx.xxx.xxx.0. You can use this range to set ISIS_HOSTS even before launching your application • If you use the Master/Worker startup mode, you can tell the system the master is at: • new Address(xxx.xxx.xxx.0, 0); • This works because the master will run on node xxx.xxx.xxx.0 (due to ISIS_HOSTS) and the pid is ignored in the BeWorker call, so using 0 is fine.
Debugging Isis2 issues How can it be done?
Debugging is hard… • … debugging distributed systems even harder • Useful tools • Visual studio. Keep in mind that even an exception thrown inside Isis2 could be caused by a mistake in your code. All those upcalls will be issued from Isis2 stacks! • You can call IsisSystem.GetState() to obtain a string representing the state of the Isis system itself. But you’ll need help from Cornell experts to understand this data. • You can call IsisSystem.RunTimeStatsState() to obtain a self-explanatory string with counts of messages sent and received. The data itself is in IsisSystem.RTS, and you can access this at runtime.
Suggestions • Isis2 is multithreaded. So write thread-safe code. • Don’t block during upcalls from Isis2 into your code. The library assumes that upcalls will complete quickly and could malfunction otherwise. • Isis2 has a lot of threads. Don’t let this worry you. • We gave you the source code. If you notice a bug, post it to isis2.codeplex.com on the “issues” page • Post questions on the codeplex “discussions” page