310 likes | 501 Views
Hardware Acceleration of Monte-Carlo Structural Financial Instrument Pricing Using a Gaussian Copula Model. Presenter: Alexander Kaganov Supervisor: Paul Chow. Increasing Computational Requirements (1/3). In recent years the financial industry has seen:
E N D
Hardware Acceleration ofMonte-Carlo Structural Financial Instrument Pricing Using a Gaussian Copula Model Presenter: Alexander Kaganov Supervisor: Paul Chow
Increasing Computational Requirements (1/3) In recent years the financial industry has seen: 1. Increasing contract/model complexity • Every year new models are developed • Unavailability of closed-form solution • Necessitate Monte-Carlo pricing
N instruments Y time 3xN instruments 3xY time (at least) Increasing Computational Requirements (2/3) 2. Increasing portfolio sizes • Increase in simple to price instruments • Increase in complex derivate security • CDO issuance has increased from $157 billion in 2004 to $507 billion in 2007 (>3x)¹ ¹ SIFMA
Increasing Computational Requirements (3/3) 3. Ever-present need to make real-time decisions • Market trends can change quickly • Instruments traded electronically 1 ms in Latency is Worth $100 M in Stock Trading Business Value (AMD Analyst Day-26 july 2007)
Coarse-Grain Fine-Grain Typical MC Financial simulation Trends in Financial Monte-Carlo Algorithms 1. Computationally intensive • Accuracy Time2 2. Highly repetitive • A large portion of the calculation time is spent in a small portion of the code • (~90% of the time is spent in ~10% of the code) 3. High degree of coarse and fine-grain parallelism
CDO Problem: • Banks typically hold portfolios with highly volatile assets. Solution: • Sell assets to an outside entity (SPV), which combines the different assets together into one collateral pool • Repackage the pool as CDO tranches. • Sell tranches as form of protection to investors in return for premium payments
CDO Structure/Pricing (1/2) • Each tranche has attachment and detachment points • Losses below attachment point → the tranche is unaffected • Losses above the detachment point → the tranche becomes inactive • Investor premium is paid based on the tranche width minus tranche losses Investors Borrowers Super Senior: 12%-100% Bonds Senior: 6% -12% Collateral Pool Loans (Credit Default Swap) Mezzanine: 3% -6% CDS SPV CDOs Equity: 0% -3% Sponsor (Bank) Tranches
CDO Structure/Pricing (2/2) Mezzanine Tranche: Detachment (6%) • Paid premium based on the full investment Investor Premium Payments Investor Premium Payments • Losses 1/3 of the principal investment. Premium paid based on 2/3 of the original investment 4% Tranche Losses Attachment (3%) CDO Tranche Value = Premium Leg – Default Leg S =tranche thickness si= Premium di= Discount factorLi= Tranche loses at time interval i
1. Generate: or Li’s Gaussian Copula Model • Monte-Carlo (MC) pricing model • For each path: One-Factor Multi-Factor 2. Compare: 3. Record losses:
Multi-Core Architecture • Three portions: Distributor, Pricing cores, and Collector. • Coarse Grain Parallelism: MC paths divided among OFGC cores • All cores have the same input data except for random generator seeds • Data transfer occurs in parallel to calculations • Double Buffering • Minimal required data transfer rate with an external processor of: 11.3MBytes/sec • 1-Lane PCI express- 250 MBytes/sec • Data transfer latency can be hidden
Pricing Core Design Phase 1: Generate Yi Phase 2: Compare Yi<Φ-1[P(τi<t)]. Record partial losses Phase 3: Combine the partial sums, L(ti)’s. Phase 4: Convert collateral pool losses to tranche losses Phase 5: Accumulate tranche losses
Phase 1 • Four factors accumulated per clock cycle • Accumulation is performed in parallel to Phase 2 execution One-Factor Multi-Factor
Pricing Core Design Phase 1: Generate Yi Phase 2: Compare Yi<Φ-1[P(τi<t)]. Record partial losses Phase 3: Combine the partial sums, L(ti)’s. Phase 4: Convert collateral pool losses to tranche losses Phase 5: Accumulate tranche losses
Phase 2 • Compare Yi<Φ-1[P(τi<t)]. Record Losses • Fine-grain parallelism: parallelize over time • 8 replicas • More replicas → higher speedup (potentially) • However, large portions of the hardware become underutilized • Pipelined adder latency creates multiple partial sums
Phase 3: Combine the partial sums, L(ti)’s. Phase 4: Convert collateral pool losses to tranche losses Phase 5: Accumulate tranche losses Pricing Core Design Phase 1: Generate Yi Phase 2: Compare Yi<Φ-1[P(τi<t)]. Record partial losses Phase 3: Combine the partial sums, L(ti)’s. Phase 4: Convert collateral pool losses to tranche losses Phase 5: Accumulate tranche losses
One-Factor Utilization* *Utilization for Virtex 5 SX50T chip
Performance: Benchmarks • Credit rating and number of instruments are based on Dow Jones CDX • Notionals obtained from Moody’s, range from $600,000 to $6.6 billion • α: uniformly distributed in [0, 1] • Recovery rate: Normally distributed, N (0.4,0.15) • # of Time Steps: Normally distributed, N (20,10) • # of Systemic Factors: varied from 1 to 16
Processor vs. FPGA setup • 3.4 GHz Intel Xeon Processor • 3GB RAM • C++ program • 100,000 Monte-Carlo paths • Virtex 5 SX50T speed grade -3 • Connected to host through PCI express • 100,000 Monte-Carlo paths
Performance: One-Factor Single Core Results (3/3) Single Core Average Acceleration: Double Precision: 10.8 X Single Precision: 14.0 X Single/Double Hybrid: 13.8 X Integer: 15.8 X
Performance: One-Factor Multi-Core Results • Monte-Carlo paths independence allows for a linear speedup as more pricing cores are incorporated.
Producer: Phase 1 Ratep: 1/SFRF Consumer: Phase 2 Ratec: 1/TRF Performance: Multi-Factor Results (1/3) • SFRF is the number of cycles before Phase 1 generates a new Yi • TRF is the number of cycles before Phase 2 requires a new Yi
Performance: Multi-Factor Results (2/3) Ratec<= Ratep Ratec>Ratep
Summary • Presented a hardware architecture for pricing Collateralized Debt Obligations using Li’s models • Demonstrating the flexibility of an FPGA in exploiting both coarse- and fine-grain parallelism • Established that either a single/double hybrid or “integer” representations could be used to balance resource utilization and accuracy • Achieved a maximal acceleration of over 64-fold and 71-fold, for the one-factor and multi-factor models, respectively
Future Work • Expand to a multi-FPGA system • Attempt the algorithm on a different accelerator architectures • GPU • 3. Incorporate into a financial software suite
Thank You (Questions?)