260 likes | 345 Views
Ontologically-based Searching for Jobs in Linguistics. Deryle Lonsdale lonz@byu.edu. Funded by:. The BYU Data Extraction Group. Group of faculty (5) and students (15) from CS, Linguistics, SOAIS Goal: ontology-based data extraction NSF funding: CISE/IIS/IDM TIDIE Website: www.deg.byu.edu/
E N D
Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale lonz@byu.edu Funded by:
The BYU Data Extraction Group • Group of faculty (5) and students (15) from CS, Linguistics, SOAIS • Goal: ontology-based data extraction • NSF funding: CISE/IIS/IDM TIDIE • Website: www.deg.byu.edu/ • Papers, presentations • Tools • Demos
Overview • Ontology-based extraction • Building knowledge sources • Jobs in linguistics (Sproat) • Putting it all together • Some sample results
Ontologies and IE Source Target
Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension Conceptual modeling (OSM)
Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction
Car-Ads Ontology (textual) Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End;
The data-frame library • Low-level patterns implemented as regular expressions • Match items such as email addresses, phone numbers, names, etc. Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; }, { extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b"; end;
Lexicons • Repositories of enumerable classes of lexical information • FirstNames, LastNames, USstates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc.
Accessing the output • Extracted information is stored in a relational database • Results can be queried using SQL • Wide range of views is possible
Finding jobs in linguistics • Linguistlist.org, LSA • Email distribution lists (corpora, langage naturelle, CAAL/ACLA, etc.) • Usual commercial sites (monster.com, flipdog.com, dice.com) • Word-of-mouth sources
Sproat’s analysis • Random sample (224/2250) of LinguistList postings, 1994-2001 • Development vs. research, academic vs. industrial • Linguists are most often (approx. 80% of the time) offered development jobs • Linguists hired more for specific tasks (e.g. grammar, lexicon development) rather than for more general research-oriented tasks (e.g. creating new technological approaches.)
The banner years Year Academia Industry % Industry 1994 27 2 7% 1995 45 5 10% 1996 52 3 5% 1997 48 3 6% 1998 57 3 5% 1999 56 14 20% 2000 55 43 39% 2001 (mid) 22 10 31% • Dramatic rise in 1999, 2000 • Steep drop-off since 2001 • Rising demand for technical, computational skills
Linguistic jobs ontology • Why? • user-specifiable constraints • Somewhat closely follows existing ontologies (e.g. jobs, software)
Data frames and lexicons • Language names • ethnologue • (sub)fields of linguistics • Linguistlist.org • Tools, toolkits • Software components, programming languages • Linguistics-related job titles • Activities • Responsibilities • Country names
The corpus • 3237 postings (LinguistList, Corpora, LN, WoM): 1998 541 1999 575 2000 871 2001 952 2002 788 • Some noise (non-English, factored, program descriptions, attachments, etc.) • Semi-automatic edits (boilerplate, publicity blurbs about institutions, etc.)
Sample output • Here
Observations • 270 don’t have linguist* (!) • Demand for knowledge of English equals that for all other languages combined (G, F, S, J, C) • Computer/computational background required for almost 1/3 (1116) • Noticeable amount of headhunting, particularly in Seattle, DC areas
An engineering discipline? • 160 linguistics jobs ending in “engineer” • Software development cycle • research e., software design e. • development e., software e. • software quality e., linguistic test e., linguistic quality e. • linguistic support e., user experience e. • presales e., technical sales e. • Specific subfields • web site e. • speech e., voice recognition e., speech recognition application e., speech e., ASR tuning e., audio e. • dialog e. • tools e. • AI e., NLP e. • knowledge e. • linguist e., natural language e. • staff e. • human factors e., user interface e.
Other observations • Often a job title is not even listed (!) • More in18 of data frames (e.g. email, ph. #) • Great need for (preferably hierarchical) lexical repositories related to linguistics • job titles • theoretical frameworks, subfields • typical linguist job activities • linguistic research/development venues