Webscraping
Documentation / Tutorials
Python
https://instaloader.github.io/index.html
Google sheets
Why scrape?
Application
General: Competitive and price analysis, Lead generation, Keyword research
Research: Science Product reattach, Finding / filling a job, Government oversight
Financial: Stock analysis, Insurance and risk management, News gathering and analysis
Scraping for a car scrape the car buying websites to find all the teslas Evaluate the price to find great deals Scrape airfares and adjust the deals for airfare expenses Send an email digest of the top deal each day
Python libraries • beautiful soup • Scrape • Selenium
Process • set a start_url variable • download html • parse html • extract useful information • transform or aggregate • save the data • Go the the next url
HTTP overview Request > response via https Web address / verb / user agent GET - retrieves data POST - sends data to the server User agent identifies the browser or web scraper
HTML & CSS Selectors css=> ‘title’ h1 li - list item Ul - unordered list ol - ordered list
CSS Selectors
XPath
Legal Risks
getting sued for copyright infringement / private websites
Safe
public government website
scraping for personal use
aggregated data project and research
terms & conditions that allows
More risky
scraping for personal use even though prohibited by the terms and conditions
scraping data you don't own while logged in
large scale scraping to publish widely promoted "news" reports
Relatively risky
large scale scraping for profit
create a commercial product
scraping large company websites for profit
creating and selling derivatives works
scraping personally identifiable data (PII)
case-study: hiQ vs LinkedIn
Python
Scraping environment with JupyterLab
Last updated