What’s optimal about N choices?

What’s optimal about N choices? Tyler McMillen & Phil Holmes, PACM/CSBMB/Conte Center, Princeton University. Banbury, Bunbury, May 2005 at CSH. Thanks to NSF & NIMH.

Neuro-inspired decision-making models* 1. The two-alternative forced-choice task (2-AFC). Optimal decisions: SPRT, LAM and DDM*. 2. Optimal performance curves. 3. MSPRT: an asymptotically optimal scheme for n > 2 choices (Dragalin et al., 1990-2000) . 4. LAM realizations of n-AFC; mean RT vs ER; Hick’s law. 5. Summary (the maximal order statistics) * Optimality viewpoint: maybe animals can’t do it, but they can’t do better. ** Sequential probability ratio test, leaky accumulator model, drift-diffusion model.

2-AFC, SPRT, LAM & DDM Choosing between 2 alternatives with noisy incoming data p2(x) p1(x) Set thresholds +Z, -Z and form running product of likelihood ratios: Decide 1 (resp. 2) when Rn first falls below -Z (resp. exceeds +Z). Theorem (Wald, 1947; Barnard, 1946): SPRT is optimal among fixed or variable sample size tests in the sense that, for a given error rate (ER), expected # samples to decide is minimal. (Or, for given # samples, ER is minimal.)

DDM is the continuum limit of SPRT. Let +Z Drift, a -Z Extensive modeling of behavioral data(Stone, Laming, Ratcliff et al., ~1960-2005).

There’s also increasing neural evidence for DDM: FEF: Schall, Stuphorn & Brown, Neuron, 2002. LIP: Gold & Shadlen, Neuron, 2002.

Balanced LAM reduces to DDM on invariant line: (linearized: race model if a = b = 0). Uncouple via stable OU flow in y1 if a, b large, DD in y2 if a = b. Absolute thresholds in (x1, x2) become relative (x2 - x1)! +Z -Z

LAM sample paths collapse towards an attracting invariant manifold. (cf. C. Brody: Machens et al., Science, 2005) t t First passage across threshold determines choice.

Simple expressions for first passage times and ERs: Redn to 2 params: Can compute thresholds that maximize reward rate: (1) (Gold-Shadlen, 2002; Bogacz et al., 2004-5) This leads to …

Optimal performance curves (OPCs): Human behavioral data: the best are optimal, but what about the rest? Bad objective function, or bad learners? Increasing acc. wt. Left: RR defined previously; Right: a family of RR’s weighted for accuracy. Learning not considered here. (Bogacz et al., 2004; Simen, 2005.)

N-AFC: MSPRT & LAM MSPRT chooses among n alternatives by a max vs. next test: MSPRT is asymptotically optimal in the sense that # samples is minimal in the limit of low ERs(Dragalin et al, IEEE trans., 1999-2000). A LAM realization of MSPRT (Usher-McClelland 2001) asymptotically predicts (cf. Usher et al, 2002)

The log(n-1) dependence is similar to Hick’s Law: RT = A + B log n or RT = B log (n+1). W.E. Hick, Q.J. Exp. Psych, 1952. We can provide a theoretical basis and predict explicit SNR and ER dependence in the coefficients A, B.

Multiplicative constants blow up log-ly as ER -> 0. Behavior for small and larger ERs: (2) (2) Empirical formula, generalizes (1),

But a running max vs next test is computationally costly (?). LAM can approximately execute a max vs average test via absolute thresholds. n-unit LAM decoupled by: y1attracted to hyperplaney1 = A, so max vs average becomes an absolute test! DD on hyperplane Attraction is faster for larger n: stable eigenvaluel1 ~ n.

Max vs average is not optimal, but it’s not so bad: Unbalanced LAMs - OU processes absolute max vs average max vs next absolute max vs average max vs next Max vs next and max vs ave coincide for n=2. As n increases, max vs ave deteriorates, approaching absolute test performance. But it’s still better for n < 8-10!

Simple LAM/DD predicts log (n-1), not log n or log (n+1) as in Hick’s law: but a distribution of starting points gives approx log n scaling for 2 < n < 8, and ER and SNR effects may also enter.

The effect of nonlinear activation functions, bounded below, is to shift scaling toward linear in n: The limited dynamic range degrades performance, but can be offset by suitable bias (recentering). Nonlinear LAMs Linearized LAM

Summary: N-AFC • MSPRT max vs next test is asymptotically optimal in low ER limit. • LAM (& race model) can perform max vs next test. • Hick’s law: emerges for max vs next, max vs ave & absolute tests. A, B smallest for max vs next, OK for max vs ave. • LAM executes a max vs average test on its attracting hyperplane using absolute thresholds. • Variable start points give log n scaling for `small n.’ • Nonlinear LAMs degrade performance: RT ~ n for sufficiently small dynamic range. More info: http://mae.princeton.edu/people/e21/holmes/profile.html

What’s optimal about N choices?