1 / 26

Ontologically-based Searching for Jobs in Linguistics

Ontologically-based Searching for Jobs in Linguistics. Deryle Lonsdale lonz@byu.edu. Funded by:. The BYU Data Extraction Group. Group of faculty (5) and students (15) from CS, Linguistics, SOAIS Goal: ontology-based data extraction NSF funding: CISE/IIS/IDM TIDIE Website: www.deg.byu.edu/

molly
Download Presentation

Ontologically-based Searching for Jobs in Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale lonz@byu.edu Funded by:

  2. The BYU Data Extraction Group • Group of faculty (5) and students (15) from CS, Linguistics, SOAIS • Goal: ontology-based data extraction • NSF funding: CISE/IIS/IDM TIDIE • Website: www.deg.byu.edu/ • Papers, presentations • Tools • Demos

  3. The BYU Data Extraction Group

  4. Overview • Ontology-based extraction • Building knowledge sources • Jobs in linguistics (Sproat) • Putting it all together • Some sample results

  5. Ontologies and IE Source Target

  6. Document-based IE

  7. Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension Conceptual modeling (OSM)

  8. Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction

  9. Car-Ads Ontology (textual) Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End;

  10. The data-frame library • Low-level patterns implemented as regular expressions • Match items such as email addresses, phone numbers, names, etc. Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; }, { extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b"; end;

  11. Lexicons • Repositories of enumerable classes of lexical information • FirstNames, LastNames, USstates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc.

  12. Accessing the output • Extracted information is stored in a relational database • Results can be queried using SQL • Wide range of views is possible

  13. Finding jobs in linguistics • Linguistlist.org, LSA • Email distribution lists (corpora, langage naturelle, CAAL/ACLA, etc.) • Usual commercial sites (monster.com, flipdog.com, dice.com) • Word-of-mouth sources

  14. Sproat’s analysis • Random sample (224/2250) of LinguistList postings, 1994-2001 • Development vs. research, academic vs. industrial • Linguists are most often (approx. 80% of the time) offered development jobs • Linguists hired more for specific tasks (e.g. grammar, lexicon development) rather than for more general research-oriented tasks (e.g. creating new technological approaches.)

  15. The banner years Year Academia Industry % Industry 1994 27 2 7% 1995 45 5 10% 1996 52 3 5% 1997 48 3 6% 1998 57 3 5% 1999 56 14 20% 2000 55 43 39% 2001 (mid) 22 10 31% • Dramatic rise in 1999, 2000 • Steep drop-off since 2001 • Rising demand for technical, computational skills

  16. Linguistic jobs ontology • Why? • user-specifiable constraints • Somewhat closely follows existing ontologies (e.g. jobs, software)

  17. Data frames and lexicons • Language names • ethnologue • (sub)fields of linguistics • Linguistlist.org • Tools, toolkits • Software components, programming languages • Linguistics-related job titles • Activities • Responsibilities • Country names

  18. The corpus • 3237 postings (LinguistList, Corpora, LN, WoM): 1998 541 1999 575 2000 871 2001 952 2002 788 • Some noise (non-English, factored, program descriptions, attachments, etc.) • Semi-automatic edits (boilerplate, publicity blurbs about institutions, etc.)

  19. Sample output • Here

  20. Observations • 270 don’t have linguist* (!) • Demand for knowledge of English equals that for all other languages combined (G, F, S, J, C) • Computer/computational background required for almost 1/3 (1116) • Noticeable amount of headhunting, particularly in Seattle, DC areas

  21. Programming languages

  22. Popular subfields

  23. Subfields (another perspective)

  24. An engineering discipline? • 160 linguistics jobs ending in “engineer” • Software development cycle • research e., software design e. • development e., software e. • software quality e., linguistic test e., linguistic quality e. • linguistic support e., user experience e. • presales e., technical sales e. • Specific subfields • web site e. • speech e., voice recognition e., speech recognition application e., speech e., ASR tuning e., audio e. • dialog e. • tools e. • AI e., NLP e. • knowledge e. • linguist e., natural language e. • staff e. • human factors e., user interface e.

  25. Paradigms

  26. Other observations • Often a job title is not even listed (!) • More in18 of data frames (e.g. email, ph. #) • Great need for (preferably hierarchical) lexical repositories related to linguistics • job titles • theoretical frameworks, subfields • typical linguist job activities • linguistic research/development venues

More Related