270 likes | 399 Views
Scraping the Glass: NBA Rebounding A nalysis. Web Scraping and Predictive Analytics with Alteryx. Greg Murray Hector Amaya. Use Alteryx to scrape BasketballReference.com Return a list of all basketball players in catalog Return season statistics for all players in 2000 or greater
E N D
Scraping the Glass: NBA Rebounding Analysis Web Scraping and Predictive Analytics with Alteryx Greg Murray Hector Amaya
Use Alteryx to scrape BasketballReference.com • Return a list of all basketball players in catalog • Return season statistics for all players in 2000 or greater • Use the resulting data set to build a predictive model Objectives
Batch Macros • Download Tool • XML parse Tool • Regex Tool and Regex Parsing • Alteryx Predictive Suite Key Concepts
Step 1: Examine the webpage(s) that will be scraped Objective 1a:Collect list of Players from BR Catalog
Step 2: Examine the structure of HTML for the desired section • Use your web browsers inspect function (right click Inspect) Objective 1a:Collect list of Players from BR Catalog
Basic structure for HTML tables Objective 1a:Collect list of Players from BR Catalog <Table> - Table Tag <thead> - Table Heading Tag <tr> - Table Row Tag <th>Column 1</th> - Column Headers <th>Column 2</th> <th>Column 3</th> </tr>- Table Row End Tag </thead>- Table Heading End Tag <tbody> - Table Body Tag <tr> <td> data 1</td> - Table Detail (cell) <td> data 2</td> <td> data 3</td> </tr> </tbody>-Table Body End Tag </Table> - Table End Tag
Step 3: Use Alteryx to retrieve the HTML • Pass first URL to download tool using a text input tool • Use the Text columns tool parsing the HTML into rows • Add record IDs to rows Objective 1a:Collect list of Players from BR Catalog
Step 4: Isolate code containing table data • Challenge: We need to identify the table dynamically • Table can be located on different lines on each page • Table can have different # of rows depending on # of players Objective 1a:Collect list of Players from BR Catalog
Step 5: Isolate and parse the table headers Objective 1a:Collect list of Players from BR Catalog • Isolate table headers by reusing the section of tools we used to isolate the table • Parse the table headers using an XML Parse tool
Step 6: Isolate and parse the table detail Objective 1a:Collect list of Players from BR Catalog • Isolate table detail by reusing the section of tools we used to isolate the table and table headers
Step 6: Isolate and Parse Table Detail Objective 1a:Collect list of Players from BR Catalog • Parse the table detail using a combination of XML Parsing and Regex
Step 7: Union headers to detail Objective 1a:Collect list of Players from BR Catalog • Use the Autoc0nfig by Position option • Ensure the header data stream in first in the Output Order
Step 8: Validate the output Objective 1a:Collect list of Players from BR Catalog • Compare data in the workflow to the webpage
Step 9: Convert the workflow into a batch macro Objective 1a:Collect list of Players from BR Catalog • Add a Control Parameter tool to Text input tool at the beginning of the workflow • Add a Macro Output tool the end of the workflow
Step 9: Convert the workflow into a batch macro Objective 1a:Collect list of Players from BR Catalog • Save the workflow as a Batch Macro into your macro repository • If necessary add a new macro repository through the user settings
Step 10: Create a new workflow using the new macro Objective 1a:Collect list of Players from BR Catalog • Pass the macro the player catalog URLs to download and parse all of the Player Catalog webpages
Step 1: Examine the webpage(s) that will be scraped Objective 1b:Return Season Statistics for individual Players
Step 2: Examine the structure of HTML for the desired section • Use your web browsers inspect function (right click Inspect) Objective 1b:Return Season Statistics for Individual Players
Step 3: Use Alteryx to retrieve the HTML • Pass first URL to download tool using a text input tool • Use the Text columns tool parsing the HTML into rows • Add record IDs to rows Objective 1b:Return Season Statistics for Individual Players
Step 4: Isolate code containing table data • Challenge: We need to identify the table dynamically • Table can be located on different lines on each page • Table can have different # of rows depending on seasons played • There are multiple tables on a player’s page Objective 1b:Return Season Statistics for Individual Players
Step 5: Isolate and parse the table headers Objective 1b:Return Season Statistics for Individual Players • Isolate table headers by reusing the section of tools we used to isolate the table • Parse the table headers using an XML Parse tool
Step 6: Isolate and the parse table detail Objective 1b:Return Season Statistics for Individual Players • Isolate table detail by reusing the section of tools we used to isolate the table and table headers • Parse the table detail using a combination of XML Parsing and Regex
Step 7: Union headers to detail/Append URL suffix Objective 1b:Return Season Statistics for Individual Players • Use the Autoc0nfig by Position option • Ensure the header data stream in first in the Output Order
Step 8: Convert the workflow into a batch macro Objective 1b:Return Season Statistics for Individual Players • Add a Control Parameter tool to Text input tool at the beginning of the workflow • Add a Macro Output tool the end of the workflow
Step 8: Convert the workflow into a batch macro Objective 1b:Return Season Statistics for Individual Players • Open the Interface Designer (CTRL +ALT + D) • Change Batch Macro Output mode to : ‘Auto Configure by Name’ • Save the workflow as a Batch Macro into your macro repository
Step 9: Use the batch macro to parse the player pages Objective 1b:Return Season Statistics for Individual Players • Open a New workflow • Use the Player Catalog list as your input data • Concatenate ‘https://www.basketball-reference.com’ with the Player URL suffix • Pass the Concatenated URL into the batch macro • Limit the number of records per run using sample or filter tool
Resources • Regex : www.regexr.com • Batch Macros: https://community.alteryx.com/t5/Live-Training/Live-Training-Build-Your-First-Batch-Macro/m-p/52900 • Alteryx Tool Mastery: • Download • Regex