1 / 25

HPC in linguistic research

HPC in linguistic research. Andrew Meade University Of Reading a.meade@reading.ac.uk. HPC use in linguistic research. Linguistic and biological models Phylogenies Linguistic data Models of evolution Parallelism Scaling Results On going work Key challenges.

Download Presentation

HPC in linguistic research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HPC in linguistic research Andrew Meade University Of Reading a.meade@reading.ac.uk

  2. HPC use in linguistic research • Linguistic and biological models • Phylogenies • Linguistic data • Models of evolution • Parallelism • Scaling • Results • On going work • Key challenges

  3. Linguistic and biological systems

  4. Inferring evolutionary histories form linguistic data • Evolutionary histories, phylogenies • Tools for understand evolution • Depicts relationships between languages • Identify groups which share a common ancestor • Calculate timing events • Account for lack of independence in the data • Inferred from data, taken from different languages • Using an explicate statistical model of evolution • Problem is NP-hard, growth is a double factorial. • Markov chain Monte Carlo search methods, heuristic search, hill climber • Product of Data + Model

  5. Greek Indo-Iranian Slavic Celtic Germanic Romance

  6. The Data • Swadesh list, Morris Swadesh 1940, onwards • 200 meaning, present in all languages (all most) • Chosen to be stable, slowly evolving and resistant to borrowing • Some what of a language “gene”

  7. Cognate classes • Word with a common evolutionary ancestry and meaning English Fish Danish Fisk Dutch Visch Czech Ryba Russian Ryba Bulgarian Riba Fish Ryba 34other languages 23 other languages

  8. Data coding, Cognates • Cognates, words and meaning what are derived from a common ancestor • Languages evolve by a processes of descent with modification “When” 1 cognate “Water” 3 cognates Englishwhen water Germanwannwasser Frenchquandeau Italianquandoacqua Greekqotenero Hittitekuwapiwatar English11 0 0 German 1 1 0 0 French10 1 0 Italian1 0 1 0 Greek10 0 1 Hittite11 0 0

  9. Continuous-time Markov Model Q10 0 Non cognate 1 Cognate Q01 Q01 Rate at which cognates are gained Q10 Rate at which cognates are lost

  10. The Likelihood Model • Calculates the probability of a tree (T), given the data (D) and model of evolution (M). Fitness / evaluation • Accounts for > 99% of the run time Product over the model 1 – 12 categories Product over the data 200 – 100,000 sites

  11. Level of parallelism Data – Analysis of multiple datasets (3-5) Model – Test a range of models (10-20) Trivially parallel Run – Stochastic process multiple runs (5-10) Code – individual run can still take years

  12. The problem • 2003 – 16 taxa, 125 sites, 1 x model • 2005 – 87 taxa, 2450 sites, 4 x model • 2007 – 400 taxa, 34,440 sites, 100 x model • Complexity 700,000x, 5-6 order of magnitude • 4.8 years per run, typically 5 publication quality runs + 10 model tests • 4.8 years < attention span of academics • results are required in days

  13. Parallel method 1Distribute the data (MPI) Cognates Data ……………………..…………….. Languages ……………………..…………….. Core 1 Core 2 Core 3

  14. Parallel method 2 Distribute the model (OpenMP) Pass 1 Pass 2 Pass 3 Pass 4 Data Data Data Data Core 1 Core 2 Core 3 Core 4

  15. Distribute the data and the model (MPI + OpenMP) Pass 1 Pass 2 Pass 4 Pass 3 Data Data Data Data Core 1 Core 7 Core 3 Core 5 Core 6 Core 4 Core 2 Core 8

  16. Seconds - log 10 Cores

  17. Efficiency Cores

  18. Results • Runtime reduced from 4.8 years to • Good scaling, but not sustainable • HPC has allowed for the accurate analysis of large complex data sets with statistically justifiable models.

  19. Current work • Phoneme data • Modelling sound utterances • Better resolution than cogency data • Relevant linguistics patterns are emerging • 120 phonemes, 2 cogency judgments • Another 3 order of magnitude complexity • Accelerator implementation CUDA / OpenCL

  20. Scalable computing • Last 10 years, 5-6 order of magnate increase in complexity • Reasonably scalable code redesign needed. • Need to change the how not the what • What – statistical framework, realistic models • How – algorithm, language, parallelisation method, hardware • Scalable algorithms

  21. Convergence Parallel Burn in Serial

  22. Parallel sampling using multiple chains

  23. Key challenges • Computing is a rate limiting step • Trending water / drowning • Widening gap between computing power and data models complexity • Data set size and model complexity restricted • 20-30 year old methods, which are less accurate and non statistical are returning • Connecting researchers with results not HPC • HPC is a nuisance in science • Steep learning curve • High cost. Hardware, running costs and personnel • Access and flexibility • Not one off activity, thousands of data sets are produced each year, 3000+ published in 2011

  24. Acknowledgments Mark Pagel

More Related