Webscraping

Documentation / Tutorials

Python

    https://instaloader.github.io/index.html

Google sheets

Why scrape?

Application

    General: Competitive and price analysis, Lead generation, Keyword research
    Research: Science Product reattach, Finding / filling a job, Government oversight
    Financial: Stock analysis, Insurance and risk management, News gathering and analysis
Scraping for a car scrape the car buying websites to find all the teslas Evaluate the price to find great deals Scrape airfares and adjust the deals for airfare expenses Send an email digest of the top deal each day
Python libraries β€’ beautiful soup β€’ Scrape β€’ Selenium
Process β€’ set a start_url variable β€’ download html β€’ parse html β€’ extract useful information β€’ transform or aggregate β€’ save the data β€’ Go the the next url
HTTP overview Request > response via https Web address / verb / user agent GET - retrieves data POST - sends data to the server User agent identifies the browser or web scraper
1
# URL hacking query string
2
# Python URL strings
3
host =β€˜www.iseecars.com’
4
path = β€˜/used-cars/used-tesla-for-sale’
5
location = ’66592’
6
query_string = f#LOcation={location}&Radius =all&Make = Tesla&Model=Model+3’ start_url = f’
7
http://{host}{path}{query_string}’
8
​
9
# Python requests
10
import requests
11
start_url = β€˜β€™ downloaded_page requests.get(start_url)
12
print(downloaded_page.text)
Copied!
HTML & CSS Selectors css=> β€˜title’ h1 li - list item Ul - unordered list ol - ordered list

CSS Selectors

1
li
2
title
3
h1
4
'#vin3827' // HTML ID
5
',auto-listing'
6
​
7
ul li //
8
//class selectors
9
'ul.listings li#vin3827'
10
​
Copied!
1
exmaple = open("example.html","r")
2
html = example.read()
3
#hyml = requests.get(url).text
4
example.close
5
​
6
from bs4 import BeautifulSoup
7
soup = BeautifulSoup(html)
8
print(soup.prettify())
9
​
10
soup.title
11
soup.li
12
spup.find_all('li')
Copied!

XPath

1
//ul[@class='listings"]/li[@id="vin3827']
Copied!

Legal Risks

getting sued for copyright infringement / private websites
    1.
    Safe
      1.
      public government website
      2.
      scraping for personal use
      3.
      aggregated data project and research
      4.
      terms & conditions that allows
    2.
    More risky
      1.
      scraping for personal use even though prohibited by the terms and conditions
      2.
      scraping data you don't own while logged in
      3.
      large scale scraping to publish widely promoted "news" reports
    3.
    Relatively risky
      1.
      large scale scraping for profit
      2.
      create a commercial product
      3.
      scraping large company websites for profit
      4.
      creating and selling derivatives works
      5.
      scraping personally identifiable data (PII)
case-study: hiQ vs LinkedIn

Python

Scraping environment with JupyterLab

DL page
JupterlabEnv
pandas
1
#Import packages
2
import requests
3
from bs4 import Beautiful Soup
4
import pandas as pd
5
​
6
# Download and parse the HTML
7
start_url = 'https://...'
8
​
9
# Download the HTML from start_url
10
downloaded_html - requests.get(start_url)
11
​
12
# Parse the HTML with BeautifulSoup and create a soup object
13
soup = BeautifulSoup(downlaoded_html.text)
14
​
15
#save a local copy
16
with open('downloaded.html', 'w) as file:
17
file.write(soup.prettify())
Copied!
1
# Setup & Install Packages
2
#pyenv install 3.74
3
#pyenv local 3.74
4
​
5
#pipenv --python 3.7.4
6
#pipenv install requests
7
#pipenv install beautifulsoup4
8
#pipenv install pandas
9
​
10
#pipenv intall jupyterlab
11
​
12
# Check Installation
13
# !pyenv local
14
# !python - V
15
# !pip list
Copied!
1
# SEelct table.wiki table
2
​
3
table_head = full_table.select('tr th')
4
​
5
print('----------')
6
​
Copied!

​

Last modified 7mo ago