220 likes | 355 Views
Whither Roaming. TeraGrid Quarterly Meeting Discussion Item Thursday, September 24, 2009 9:00 AM EDT John W. Cobb. “Be always sure you are right - then go ahead.” – Davy Crockett. Welcome to East Tennessee. Fri SNS tour – See me Emergency Exits Conveniences ….
E N D
Whither Roaming TeraGrid Quarterly Meeting Discussion Item Thursday, September 24, 2009 9:00 AM EDT John W. Cobb “Be always sure you are right - then go ahead.” – Davy Crockett
Welcome to East Tennessee • Fri SNS tour – See me • Emergency Exits • Conveniences • …. • If it starts raining again: we will begin loading in back of the hotel.
Issue: Roaming • Current Operational Definition:“A Roaming allocation gives you access to a large subset of TeraGrid compute resources as part of one TeraGrid allocation. Roaming allows a research team to take advantage of multiple compute resources via grid-based software and services, and is useful for porting code, evaluating different architectures, and conducting multi-site, multi-resource runs. For more, see the Roaming Allocations page in the TeraGrid User Support documentation.” • http://www.teragrid.org/userinfo/access/roaming.php (first hit on searching “roaming” on TG site)
Big issues versus Incremental Issues • Initial accounting discussion was a smaller issue about POPS documentation (D. Hart Comment) • “The … matter …is that the advertised concept of …“TeraGrid Roaming” does not live up to user expectations, and we expend lots of effort unwinding those expectations.” • “…We have to admit that we’re not really offering Roaming now and either (a) have all RPs buy in, … or (b) at least change the name to better communicate to users. …the least we can do is change the name from “TeraGrid-Wide Roaming Access” to something that reflects it’s true nature, e.g., “Ad Hoc Subset of TeraGrid Compute Resources.”” • Bigger issue, this is one decision, perhaps the final one, along a larger arc of marginalization of roaming. • Discussion of the former leads to a discussion of the later. • Initial TGF discussion was initially overly narrow.
Wider Discussion • After TGF forum call in early Sept., much more TG-wide discussion has occurred • Accounting WG discussion (prior to call) • Services WG discussion item and lively e-mail thread • AUS WG discussion item and lively e-mail thread • Gateway WG discussion • Agenda Item for Monthly Campus champion call on 9/22 and ensuing e-mail
Stakeholder Analysis • This decision affects entire TG community not just RP’s and TGF and not just one or a few WG’s • “care abouts” include: • Users • RP’s • TG initiatives • Campus Champions • Gateways • Working Groups: • Allocations team • Accounting • Security • Services • AUS
RP Concerns • Explicitly stated • Roaming distorts allocation and impairs planning. • Some users may try to use roaming as backdoor access to “S” allocations • Roaming contributes to large activated user pool at each RP, a security concern • Implicit concerns, perhaps • Desire to distinguish “my site” • Desire to function as a confederation rather than a integrated whole • Jockeying for position in future solicitations/proposals • If the shoe fits … wear it, if not fine. But if there is an elephant in the room, let’s not ignore it.
User Opinions • Roaming is always better than non-roaming. The question is how much more valuable. – Duh! • I have received 72 e-mail items of discussion/opinion since TGF call on this issue. • Summary: Users, as represented by sampled input prefer roaming, find it useful in some cases (exploratory and large distributed computing)
User Support Staff • Roaming is useful for startup and exploration of the most suitable resource for a given problem
Campus Champions • Roaming is key to the CC mission of outreach to new users
Large Allocations Users Using Multiple Resources • Not much feedback yet. • Presumptions (my assumptions) • Large projects • TG familiar • Comfortable with: • multi-platform deployment • Understanding strengths (and weaknesses) or various resources • Familiar with allocations, account provisioning, SU transfer requests. • Large enough effort that the “hurdle” of these administrative concerns are not stifling (if perhaps, they are occasionally a nuisance.)
Misconceptions • Users roam away not too • Some RP’s have expressed concern that Roaming will be used for a backdoor “S” allocation for their resource , “by other means. • Users report their concern for roaming is to roam away from resources that are down or have long queues. • I.E. roaming is tending toward alleviating long queues, not exacerbating them. • Allocation Committee has a persistent misconception about roaming. “For whatever reason, the [LMT]RAC review committees of the last several years has gotten it into their collective brain that roaming is some evil pseudo-resource – and the TeraGrid staff has done little or nothing to dissuade them. Proposals ask for roaming, and the committee usually assigns time on one resource. So, of course, there are few success stories using roaming resources. ”
Experience – We don’t roam far (Thanks Bill Barth and David Walling) • Looking at currently active projects listed as roaming in TGCDB who have usage records on 9/17 • 171 projects • 13 M SU’s used on 36K jobs • 82 used only 1 resource • 150 used a single resource for >50% of usage • Caveats: • TRAC redirects “R” to “S” • Large roamers request multiple “S”’s • Supports Hypothesis that named roaming allocations are exploratory
Situational Analysis • The course of HPC and NSF sponsored CI is a “journey” • There are dead-ends • ECL • Vectors • … • Where does the future path lead: is the NSF center of 1985 (and DOE LCF’s) the model of the future or the past? • Is the grid/cloud/utility model a dead-end or the future? • Can a confident prediction be made today?
Looking for Keys under the Lamppost • Lack of large roaming successes may be absent because we are making it impossible for large roaming to succeed today. • We do not know, empirically where natural user preferences lead because we have not yet conducted the unbiased test. • We do not know the potential of roaming because we have not promoted and facilitated it. We may be missing the boat. • We may or may not have the ability (or luxury) of looking beyond the lamppost
Non TG CI • There is a great deal of CI, HPC, and scientific computing occurring outside of TG (even within NSF) • We need to consider our place in that mix. • Is TG relevant? Today? In the future? • It is not at all uncommon to hear less than complimentary comments from scientists outside of TG community about what TG offers – are we missing opportunity? Are we delibilitating future prospects?
Realities • Current TG Governance: Confederation. Roaming decisions are made locally on a resource by resource basis. TG-wide adherence is only achieved by goodwill, sunshine, shame and/or program office intervention • Activation on request proposals are deemed too complex and in practice have had too much time lag. Statements of TG policy and goals do not match user experiences • TeraGrid is entering a period (at least 12-18 months) where allocation requests will exceed resources by a factor of 3-5 or worse. • TeraGrid Resource pool will change significantly on 3/31/2010
Possible Outcomes • Eliminate roaming • Restore/require universal roaming for all • Middle options: • Allow a roaming startup • Automatically make startup roaming • Allow a special allocation type for Campus Champions • Allow (or automatically arrange) roaming • Explicitly use knowledge of lack of 100% utilization of startups and roaming to better estimate usage • Simplify allocations worksheet calculation of implied roaming • Investigate TG-wide queuing policy to prefer/defer roaming. • …
Decision Communication Pragmatics • Any decision needs to include consideration of impact and perception • Globally, and • Within TG sub-communities • Once TGF reaches a consensus: • Communicate effectively with users and other stakeholders
Possible Outcomes 2 (Personal Preference): • Automatic roaming access for Campus Champions across all TG resources (super-roaming) • Support roaming requests up to sizes of 1% (incl. startup) Roaming must be “chosen” - checked • For roaming-like use cases that request >1% of available resources, ask PI to pre-identify target resources. • Assist large roamers with easy transfers, … • Assist growing roamers to become large roamers and to be familiar and comfortable with suppl. Requests and transfers. • Follow-up any “abuse” if it every occurs with personal communication. • Revisit allocation workflow for simplification