260 likes | 402 Views
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies . Why should we analyse Online Job vacancies? Because Internet is growingly serving as a new source of data : it has a large potential Because Data collection from the Internet is cost-effective
E N D
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Why should we analyse Online Job vacancies? • Because Internet is growingly serving as a new source of data: it has a large potential • Because Data collection from the Internet is cost-effective • Because it allows covering important gaps in data and knowledge such as wage and understanding employer’s demand • Because it allows matching demand with offer
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Why should we analyse Online Job vacancies? A wide range of policy applicability: • Demand-led approach to labor market policy: what types of skills should be ‘given’ to persons with disadvantaged situation in the labour market. • Education policy - Second chance education, curricula formation: If conducted over time, could help track what skills are on the rise or in decline • Social policy: labour market discrimination
Web Crawling: a tool for Analysing the Labour Market using Online Job Vacancies II. An example of what you can obtain with online job vacancies:
Web Crawling: a tool for Analysing the Labour Market using Online Job Vacancies What’s a Web Crawler? Web Crawler [or Web Spider or Scraper] is a computer program that automatically gathers, analyses and files information from the internet at many times the speed of a human. (based on Wikipedia) • Pulls information from a website
Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies III. What’s a Web Crawler? A Web Crawler [or Web Spider or Scraper] is a computer program that automatically gathers, analyses and files information from the internet at many times the speed of a human. (based on Wikipedia) • Pulls information from a website
Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies III. What’s a Web Crawler: Computer program = a set of instructions for a computer communicated by a certain programming language (eg. C++, Pascal or Perl) using a software environment (operating system, compiler, interface) How to write a computer program? use ‘R’ = a language and software environment
Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies IV. Why we should use a Web Crawler • It’s free. Only cost = time of computer person • Allows for greater specificity: it can go to a level of detail normally not viable with paper surveys • It’s in real time: NOW! You get the latest data: it’s labour market analysis for NOW. • Even if it is in a country not in a position to use it now, soon or late it will be able to use it.
Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies V. How it works Generally, the web crawling program instructs the computer to: • Visit a list of URLs (“Uniform Resource Locator”: www.google.com) - Seeds • Identify further hyperlinks (“reference to further data”: www.google.com/answers) - Crawl frontier You can then instruct the computer to: • Gather information from the hyperlinks; such as job descriptions and save it in eg ACCESS file.
Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies • You can download free web crawlers from the web and customize; i.e. http://en.wikipedia.org/wiki/Category:Free_web_crawlers • Buy a customized web crawler; i.e. http://ficstar.com/web-data-mining-web-scraping/?gclid=CKeYtKGi87oCFWmWtAodEB4AIQ • or write it yourself - which I will tell you about in the next 8 minutes
Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies VI. How to write your own simple web crawler: • Download & Install open source (‘for free’) software environment ‘R’http://www.r-project.org/ • http://www.r-project.org/
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Open the EURES website http://ec.europa.eu/eures/main.jsp?acro=job&lang=en&catId=482&parentCategory=482 • Select the ”Hotel, Catering and Personal Services staff" occupation in the field “Select an occupation from the drop-down menus:” Do not choose subcategories. • Select "Austria" in the field “Single-country selection”. • Press “search”.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Select 10 in the “Results per page”. • Press “refine”. • Click on the link to the job ads: for instance “Austria: 8336 job(s) matched (9858 post(s))”. • Click on “Next page”. • Right click on “Previous page”. Select “open in a new tab”
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Copy the URL from the new tab. For instance: http://ec.europa.eu/eures/eures- • searchengine/servlet/BrowseCountryJVsServlet?lg=EN&isco=51&country=AT&multipleCountries =AT- %25&multipleRegions=%25&date=01%2F01%2F1975&title=&durex=&exp=&qual=&pageSize=10 &totalCount=8336&startIndexes=0-1o1-1o2-1I0-1o1-11o2-1I0-1o1-21o2-1I&page=1 • Now you have the correct URL to paste into the R program.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies For the EURES website this pattern needs to be repeated • twice: once for Page URLS (for instance 1734 for Belgium) and once for Job ads (for instance 52 000 for Belgium) • and in form of loops: for each page {i in 1:N} and each job ad {j in 1:N} In the following I show one step of the loop for finding characteristics of job ads. The commands cannot be copy-pasted but need to be adjusted.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Open R. You have the ‘Console’ to show your commands and output and a ‘Script’ to write your program. Open a new script. • Write your own little programme: It consists of 4 main steps: a. Find and read a webpage in HTML code b. Find a pattern in HTML code: eg. “Job Title”. c. Extract information from identified HTML tag: eg. “Car mechanic”. d. Save the data in an ACCESS or EXCEL file. HTML =“HyperText Markup Language” to write a web page
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies a. Find and read a webpage in HTML: first_url <- ‘PASTE YOUR URL IN HERE’ first_page <- readLines(‘first_url’) The command ‘readLines’ reads text from specified connection, where a connection can be a URL and can be opened by the computer. EASY.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies b. Find a pattern in the HTML text: Pattern <- <thcolspan=“1”> Job Title: </th> location <- grep(Pattern, first_page) The command ‘grep’ matches strings of text using regular expressions: searches for “Job Title” on page of interest and returns line numbers from HTML text. Regular expression = concise and flexible language that is understood by a regular expression processor; eg. [^abc] matches anything in a text except a,b,c; [^0-9] matches only non-numeric
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies c. Extract information from HTML tags: Job_title <- gsub('\t\t\t\t<tdcolspan=\"3\"><span>|</span </td>','\\1',job_ad) The command ‘gsub’ returns a character vector from ‘beginning of text’ to (|) ‘end of text’ using regular expressions, where the text is job_ad.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies d. Save the data: output_new <- as.data.frame(job_title) eures <- "C:/Documents and Settings/thum/My Documents/eures16.csv" write.csv(output_new,eures) The command ‘write.csv’ allows to save the output from ‘output_new’ to the locaton ‘eures’. EASY.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies VII. What do we obtain with it (1): See officeUK.csv (open with ACCESS)
Web Crawling for Job Ads VII. What do we obtain with it (2): See next slide
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies VII. What do we obtain with it (2): See Slide 3.
Web Crawling for Job Ads A last word on the ‘politeness policy’ Web crawlers can decrease the functionality of websites as they search in greater depth and speed than a human. They can • Crash servers • Cause server overload • Disrupt networks So, always ask the webmaster for permission.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies VIII. Limitations: • Availability of data on the internet. But this is changing • Learning curve: getting the person to learn how to use it.
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies If all this sounded very confusing: http://statistics.berkeley.edu/computing/r-reading-webpages More references: • Spector, P. (2011): Reading Data from Web Pages with R, University of Berkeley, Class Notes s133. • Spector, P. (2011): Stat 133 Class Notes – Spring, 2011; pp 70-88 • Jockers, M. (2013): Text Analysis with R, under review with Springer • R help, ETH Zurich, https://stat.ethz.ch/mailman/listinfo/r-help
Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies Prepared by Yamina Guidoum from the work of : Dr. Anna-Elisabeth Thum (anna.thum@ceps.eu) Associate Research Fellow at CEPS and Economic Analyst at DG ECFIN, European Commission "The views in this presentation do not reflect the views of the European Commission” And of: Dr. Lucia MytnaKureková Slovak Governance institute & Central European University From the INGRID Winter School “Skills and Occupations in Europe”, November 2013, CEPS, Brussels