250 likes | 387 Views
HPC in linguistic research. Andrew Meade University Of Reading a.meade@reading.ac.uk. HPC use in linguistic research. Linguistic and biological models Phylogenies Linguistic data Models of evolution Parallelism Scaling Results On going work Key challenges.
E N D
HPC in linguistic research Andrew Meade University Of Reading a.meade@reading.ac.uk
HPC use in linguistic research • Linguistic and biological models • Phylogenies • Linguistic data • Models of evolution • Parallelism • Scaling • Results • On going work • Key challenges
Inferring evolutionary histories form linguistic data • Evolutionary histories, phylogenies • Tools for understand evolution • Depicts relationships between languages • Identify groups which share a common ancestor • Calculate timing events • Account for lack of independence in the data • Inferred from data, taken from different languages • Using an explicate statistical model of evolution • Problem is NP-hard, growth is a double factorial. • Markov chain Monte Carlo search methods, heuristic search, hill climber • Product of Data + Model
Greek Indo-Iranian Slavic Celtic Germanic Romance
The Data • Swadesh list, Morris Swadesh 1940, onwards • 200 meaning, present in all languages (all most) • Chosen to be stable, slowly evolving and resistant to borrowing • Some what of a language “gene”
Cognate classes • Word with a common evolutionary ancestry and meaning English Fish Danish Fisk Dutch Visch Czech Ryba Russian Ryba Bulgarian Riba Fish Ryba 34other languages 23 other languages
Data coding, Cognates • Cognates, words and meaning what are derived from a common ancestor • Languages evolve by a processes of descent with modification “When” 1 cognate “Water” 3 cognates Englishwhen water Germanwannwasser Frenchquandeau Italianquandoacqua Greekqotenero Hittitekuwapiwatar English11 0 0 German 1 1 0 0 French10 1 0 Italian1 0 1 0 Greek10 0 1 Hittite11 0 0
Continuous-time Markov Model Q10 0 Non cognate 1 Cognate Q01 Q01 Rate at which cognates are gained Q10 Rate at which cognates are lost
The Likelihood Model • Calculates the probability of a tree (T), given the data (D) and model of evolution (M). Fitness / evaluation • Accounts for > 99% of the run time Product over the model 1 – 12 categories Product over the data 200 – 100,000 sites
Level of parallelism Data – Analysis of multiple datasets (3-5) Model – Test a range of models (10-20) Trivially parallel Run – Stochastic process multiple runs (5-10) Code – individual run can still take years
The problem • 2003 – 16 taxa, 125 sites, 1 x model • 2005 – 87 taxa, 2450 sites, 4 x model • 2007 – 400 taxa, 34,440 sites, 100 x model • Complexity 700,000x, 5-6 order of magnitude • 4.8 years per run, typically 5 publication quality runs + 10 model tests • 4.8 years < attention span of academics • results are required in days
Parallel method 1Distribute the data (MPI) Cognates Data ……………………..…………….. Languages ……………………..…………….. Core 1 Core 2 Core 3
Parallel method 2 Distribute the model (OpenMP) Pass 1 Pass 2 Pass 3 Pass 4 Data Data Data Data Core 1 Core 2 Core 3 Core 4
Distribute the data and the model (MPI + OpenMP) Pass 1 Pass 2 Pass 4 Pass 3 Data Data Data Data Core 1 Core 7 Core 3 Core 5 Core 6 Core 4 Core 2 Core 8
Seconds - log 10 Cores
Efficiency Cores
Results • Runtime reduced from 4.8 years to • Good scaling, but not sustainable • HPC has allowed for the accurate analysis of large complex data sets with statistically justifiable models.
Current work • Phoneme data • Modelling sound utterances • Better resolution than cogency data • Relevant linguistics patterns are emerging • 120 phonemes, 2 cogency judgments • Another 3 order of magnitude complexity • Accelerator implementation CUDA / OpenCL
Scalable computing • Last 10 years, 5-6 order of magnate increase in complexity • Reasonably scalable code redesign needed. • Need to change the how not the what • What – statistical framework, realistic models • How – algorithm, language, parallelisation method, hardware • Scalable algorithms
Convergence Parallel Burn in Serial
Key challenges • Computing is a rate limiting step • Trending water / drowning • Widening gap between computing power and data models complexity • Data set size and model complexity restricted • 20-30 year old methods, which are less accurate and non statistical are returning • Connecting researchers with results not HPC • HPC is a nuisance in science • Steep learning curve • High cost. Hardware, running costs and personnel • Access and flexibility • Not one off activity, thousands of data sets are produced each year, 3000+ published in 2011
Acknowledgments Mark Pagel