Solving an old problem: How do we get a stack trace in a lazy functional language?

Solving an old problem:How do we get a stack trace in a lazy functional language? Simon Marlow

Motivation

General problem • Maintaining a link between the original source code and the final executable code • In such a way that we can still optimise • Important for profiling: not much point in profiling an unoptimised program • Important (but a bit less so) for debugging

Stack traces • A stack trace is a great point in the design space • Abstraction of execution that is • cheap to maintain (sometimes even free) • gives great insight for debugging: “how did I get here?” • is ideal for profiling too: • divides up the profile as a tree • allows aggregation both bottom-up, and across the tree: “find the total cost of all calls to map and their children”

But.. lazy evaluation? • In Haskell we’re envious of strict languages for their easy access to stack traces • and nothing else  • In Python, Erlang, etc. every exception prints out the stack trace by default • zero effort by the programmer, “always on” • Even if you’re just a visitor to a web site written in that language • Claim: • the stack traces you get for free in a strict functional language are not useful enough • interference from tail calls • functional abstractions lead to strange results • So in practice you want some explicitly-managed stack trace anyway

More motivation • In GHC we had (at least) three users for this functionality: • Profiling • Debugging (in GHCi) • Coverage analysis • (maybe) Flat profiling: “where is the program counter?” • We had two forms of Core annotation • one for profiling (SCC) • one used by Coverage and the debugger (Tick), but it was wrong in the case of the debugger • Goal: use a unified infrastructure

Constraints • Manageable overhead • e.g. 2x for profiling • less for flat profiling, more for coverage • results consistent with optimised compilation – i.e. the profile has the “same shape” • Sensible, predictable semantics • you get the stack trace you expect • simple refactorings don’t have surprising effects • all the usual Core-to-Core transformations apply and don’t mess up the results • Ideally: always-on, low usage barrier

A source-language annotation • “push label L on the stack while evaluating E” • (precise semantics later) • Main point: this is a construct of the source language and the intermediate language (Core) • Compiler can add these automatically, or the user can add them • We get to choose how detailed we want to be: • exported functions only • top-level functions only • all functions (good for profiling) • call sites (good for debugging) • all sub-expressions (fine-grained debugging or profiling) • Similar to SCC in GHC now push L E

Another annotation • “Count evaluations of E; assign the result to label L” • for coverage, we will use tick exclusively. • for debugging and space profiling, just use push. • for time profiling, we often want to count entries too, so we can use bothpush and tick, with "tick L E" meaning 'bump the count of L pushed on the current stack". tick L E

Why separate tick and push? • They have different semantics, and therefore support different transformations.

Semantics of tick • A valid transformation is one that maintains the counts.

Tick is friendly to the optimiser • If we’re definitely going to tick L later, we can do it now. • Allows the optimiser to bring together subexpressionsfor • beta-reduction • case-of-known-constructor (tick L f) x -> tick L (f x) case tick L E of { .. } -> tick L (case E of { .. })

But tick is difficult to get rid of • Cannot reduce tick L x • Cannot do anything with \x. tick L \y. E • Unlike push, tick “sticks” and cannot be completely optimised away, even if it surrounds code with zero cost. • Another reason to separate tick from push: don’t use tick if you don’t have to • We added an option to GHC’s profiler to switch off tick

Changing the number of evals • Does GHC ever change the number of times an expression is evaluated, compared to the standard lazy semantics? • 0 -> 1 • 1 -> many • many -> 1 • 1 -> 0 (oh no!)

0-> 1 and 1 -> many • Cost must be bounded & small • eta expansion is often beneficial • cost of doing “a +# b” is far less than building the closure for “\y. case ..” and applying it. • But if there’s a tick in there, we cannot do this optimisation because it would change the counts • -fno-state-hack! • λ x. case a +# b of c -> λ y. ... • -> λ x . λ y . case a +# b of c -> ...

many -> 1 • Does GHC ever reduce the number of evals? • yes of course: full laziness • We do not automatically disable full laziness when using ticks: • FL can make a big difference to runtime • the user gets to see how many times the expression was really evaluated. • Tradeoff between predictability and being faithful to the real operational behaviour • Right now: -fno-state-hack –fno-full-laziness gives predictable results • Results are always correct for zero/non-zero (coverage works) • λx . f x (tick L (fib y)) • -> let z = tick L (fib y) in λ x . f x z

tick was straightforward, now for push • “push label L on the stack while evaluating E” • So let’s write a semantics for that • Define stacks: push L E • type Stack = [Label] • push :: Label -> Stack -> Stack • call :: Stack -> Stack -> Stack stack for the call stack at the call site stack of the function

Executable semantics current stack • eval :: Stack -> Expr -> E (Stack,Expr) • evalstk (EInt i) = return (stk, EInt i) • evalstk (ELam x e) = return (stk, ELam x e) • evalstk (EPush l e) = eval(push l stk) e • evalstk (ELet (x,e1) e2) = do • insertHeap x (stk,e1) • evalstk e2 • evalstk (EPlus e1 e2) = do • (_,EInt x) <- evalstk e1 • (_,EInt y) <- evalstk e2 • tick stk • return (stk, EInt (x+y)) • evalstk (EApp f x) = do • (lam_stk, ELam y e) <- evalstk f • evallam_stk (subst y x e) E is a State monad containing the Heap: a mapping from Var to (Stack,Expr) Values are straightforward push L on the stack, evaluate the body A side effect, used to record the value of the current stack suspend the computation e1 on the heap, capture the current stack Application continues with the stack returned by evaluating the lambda

Executable semantics (variables) • evalstk (EVar x) = do • r <- lookupHeap x • case r of • (stk', EInt i) -> return(stk', EInt i) • (stk', ELam y e) -> return (call stkstk’, ELam y e) • (stk',e) -> do • deleteHeap x • (stkv, v) <- evalstk' e • insertHeap x (stkv,v) • evalstk (EVar x) Here’s where we are “calling” a function

Given this semantics, define push & call • The problem now is to find suitable definitions of push and call that • Behave like a call stack • Have nice properties: • transformation-friendly • predictable/robust • implementable

Aside: cost-centres • This is a slight generalisation of Sansom’s cost-centre profiling semantics, where • stacks have one element • push L S = [L] • call Sapp Slam = Sapp “lexical scoping” • call Sapp Slam = Slam “evaluation scoping” • (Sansom used a hybrid scheme, choosing between the two alternatives in different situations)

Aside(2): lazy evaluation is not the problem • Lazy evaluation is dealt with by • capturing the current stack when we suspend a computation as a thunk in the heap • temporarily restoring the stack when the thunk is evaluated • Nothing controversial at all – we just need a mechanism for capturing and restoring the stack. evalstk (ELet (x,e1) e2) = do insertHeap x (stk,e1) evalstke2 evalstk (EVar x) = do r <- lookupHeap x case r of ... (stk',e) -> do deleteHeap x (stkv, v) <- evalstk' e insertHeap x (stkv,v) evalstk (EVar x)

Aside(3): tail calls • The semantics says nothing about tail calls – there is no call stack, so no way to express the difference. • Any implementation should respect the semantics.

Examples • f = λ x. push “f” x+x • main = λ x. push “main” • let y = 1 in f y • The heap is initialised with the top-level bindings (give each the stack <CAF>) • When we get to (f y), current stack is <main> • f is already evaluated • call <main> <CAF> = <main> • eval <main> (push f y+y) • eval <main,f> (y+y) • at the +, the current stack is <main,f> Let’s assume, for now, call Sapp Slam = Sapp

Use the call-site stack? • call sapp slam = sapp • Previous example suggests this might be a good choice? • After all, this gives exactly the call stack you would get in a strict language

But we have to be careful • f = λ x. push “f” x+x • main = λ x. push “main” • let y = 1 in f y • If instead of this: • We wrote this: • Now it doesn’t work so well: the “f” label is lost. • In this semantics, the scope of push does not extend into lambdas • f = push “f” (λ x . x+x) • main = λ x. push “main” • let y = 1 in f y

Just label all the lambdas? • Idea: make the compiler label all the lambdas automatically • e.g. the compiler inserts a push inside any lambda: • Now we get a useful stack again: <main,f1> • f = push “f” (λ x . push “f1” x+x) • main = λ x. push “main” • let y = 1 in f y

Properties • This semantics has some nice properties. since the stack attached to a lambda is irrelevant (except for heap profiling) • push L x => x • push L (λ x . e) => λ x . e • push L (C x1 .. xn) => C x1 .. xn • let x = λ y . e in push L e' • => push L (let x = λ y . e in e') • push L (let x = e in e') • => let x = push L e in push L e O(1) change to cost attribution, no change to profile shape Note if e is a value, the push L will disappear

Inlining • We expect to be able to substitute a function’s definition for its name without affecting the stack. e.g. • should be the same as • and indeed it is in this semantics. • (inlining functions is crucial for optimisation in GHC) • f = λ x . push “f1” x+x • main = λ x. push “main” • let y = 1 in f y • main = λ x. push “main” • let y = 1 in • (λ x . push “f1” x+x) y

More properties • Adding an extra binding doesn’t change the stack • recall ‘push L x == x’ • arguably useful: the stack is robust with respect to this transformation (by the compiler or user) • f = push “f” (λ x . push “f1” x+x) • g = push “g” f • main = λ x. push “main” • let y = 1 in gy

But... • eta-expansion loses this property • Now the stack at the + will be <main,g,f> • f = push “f” (λ x . push “f1” x+x) • g =λx . push “g” f x • main = λ x. push “main” • let y = 1 in g y

Concrete example • When we tried this for real, we found that in functions like • h does not appear on the stack, although in • now it does. This is surprising and undesirable. • h = f . g • h x = (f . g) x

Worse... • Let’s make a state monad: • newtype M s a = M { unM :: s -> (s,a) } • instance Monad (M s) where • (M m) >>= k = M $ λ s -> case m s of • (s',a) -> unM (k a) s' • return a = M $ λ s -> (s,a) • errorM :: String -> M s a • errorM s = M $ λ _ -> error s • runM :: M s a -> s -> a • runM (M m) s = case m s of (_,a) -> a Suppose we want the stack when error is called, for debugging

Using a monad • Simple example: • We are looking for a stack like <main,runM,bar,mapM,foo,errorM> • Stack we get: <runM> • main = print (runM (bar ["a","b"]) "state") • bar :: [String] -> M s [String] • bar xs = mapM foo xs • foo :: String -> M s String • foo x = errorM x

Why? • Take a typical monadic function: • Desuraging gives • Adding push: • Expanding out (>>): • recall that push L (λ x . e) = λ x. e • f = do p; q • f = p >> q • f = push “f” (p >> q) • f = push “f” (λs -> case p s of (a,s’) -> b s’)

The IO monad • In GHC the IO monad is defined like the state monad given earlier. • We found that with this stack semantics, we get no useful stacks for IO monad code at all. • When profiling, all the costs were attributed to main.

call Sapp Slam = Sapp? • Summary: • attractive: easy to explain, easy to implement • but results are sometimes surprising, at worst useless • Note: nothing to do with tail-calls or laziness • don’t strict languages have the same problem? Seems to be a consequence of functional abstractions.

Can we find a better semantics? • call Sapp Slam= ? • non-starter: call Sapp Slam = Slam • ignores the calling context • gives a purely lexical stack, not a call stack • (possibly useful for flat profiling though) • Clearly we want to take into account both Sappand Slamsomehow.

Think about what properties we want • Push inside lambda: • (recall that the previous semantics allowed dropping the push here) • This will give us a push that scopes over the inside of lambdas, not just outside. • which will in turn give us that stacks are robust to eta-expansion/contraction • push L (λ x. e) == λ x. push L e

What does it take to make this true? • Consider • If we work through the details, we find that we need • Not difficult: e.g. let f = λ x . push “f” e in ... f ... let f = push “f” λ x . e in ... f ... call S (push L Sf) == push L (call S Sf) like flip (++), but useful to define it this way type Stack = [Label] push = (:) call = foldr push

Recursion? • We do want finite stacks • the mutator is using tail recursion • Simplest approach: push is a no-op if the label is already on the stack somewhere: • still satisfies the push-inside-lambda property • but: not so good for profiling or debugging • the label on top of the stack is not necessarily where the program counter is push l s | l `elem` s = s | otherwise = l : s

Inlining of functions • (remember, allowing inlining is crucial) • Consider • Work through the details, and we need that • interesting: calling a function whose stack is a prefix of the current stack should not change the stack. let g = λ x.ein let f = push “f” g in f y let f = push “f” λ x.ein f y call (push L S) S == push L S

Inlining call (push L S) S == push L S • if push == (:), this doesn’t hold • does hold if push ignores duplicate labels • The definitions I wantto use: • more useful for profiling/debugging: the top-of-stack label is always correct, we just truncate the stack on recursion. call Sapp Slam = foldr push Spre Slam’ where (Spre, Sapp’, Slam’) = commonPrefix Sapp Slam push l s | l `elem` s = dropWhile (/= l) s | otherwise = l : s

Break out the proof tools • QuickCheck. • but this corresponds to something very strange: prop_append2 = forAllShrink stacks shrinkstack $ \s -> forAllMain.labels $ \x -> call (s `push` x) s == s `push` x *** Failed! Falsifiable (after 8 tests and 2 shrinks): (E :> "e") :> "b" "e" push “f” ... let g = λ x.ein let f = push “f” g in f y

A more restricted property prop_stack2a = forAllShrink stacks shrinkstack $ \s -> forAllMain.labels $ \x -> x `elemstack` s || call (s `push` x) s == s `push` x *Main> quickCheckprop_stack2a +++ OK, passed 100 tests. • This is a limited form of the real property we need for inlining • The push-inside-lambda property behaves similarly: we need to restrict the use of duplicate labels to make it go through.

Open questions • Can we formulate the properties in such a way that admits a useful definition of call/push? • Are there other, more detailed, definitions of call/push that we can use? • e.g. Allwood’s “finding the needle”, where recursion results in part of the stack being elided as “...”

Status • GHC 7.4.1 has a new implementation of profiling and coverage using push & tick • Optimiser assumes that inlining and push-inside-lambda are safe, along with a collection of other transformations • compile-time switchable between two semantics for call/push: • drop duplicate labels • (default) call using “common-prefix”, push truncates on recursion • a bit naughty: we know that (1) is correct but not very useful, (2) is more useful and we think correct in the cases we care about • Results look good • we’ve already found and fixed performance bugs in GHC using the new profiler • Have a bunch of test cases to verify that we get the same stacks with and without optimisation • Profiling now works on parallel programstoo

Programmatic access to stack trace • The GHC.Stack module provides runtime access to the stack trace • On top of which is built this: • e.g. now when GHC panics it emits a stack trace (if it was compiled with profiling) -- | like 'trace', but additionally prints a call stack if one is -- available. traceStack:: String -> a -> a

Solving an old problem: How do we get a stack trace in a lazy functional language?