NoSleepCreative Wiki
  • Welcome to NoSleepCreative
  • After Effects
    • Getting Started with Expressions
    • Expressions & Snippets
      • JSX Cheatsheet
      • Expression Troubleshooting
      • Utilities
      • Shape & Mask
      • Type & Text
    • Cookbook
      • Algorithmic
      • Random properties
      • Harmonic Motion
      • Staggering
      • Tessellation & Tiling
      • Type animators
      • Speed lines
      • Radial Array
      • Orb & Trails
      • Shading & Texturing
      • Responsive
      • Automation
      • Setup & Rigs
    • Getting started with Scripting
    • Scripting
      • Utilities
      • Master Properties
    • ScriptUI
  • Studio Ops
    • Tooling
    • Toolkitting
    • Knowledge Base
    • Naming Convention
    • DAM
  • Cinema 4D
    • Formulas
    • Python Cheat Sheet
      • For Artists
      • Maya Environment
      • Maya snippets
      • VSFX 705
    • Cookbook
  • Info
    • About
    • Portfolio
    • Course
    • YouTube
    • Gumroad
    • GitHub
  • Dev
    • archive
      • Webscraping
      • Google Sheets Formulas
      • SQL
      • Terminal
      • C++
      • Unreal Engine
      • Concert Visualization
      • Dome-projection
      • UI UX
      • Professional Etiquettes
      • Woes
      • How to get better
        • Portfolio / Showreel
        • Design with cooking
      • Media theories
        • Post Cinematic Affect
        • Marxism, Reproduction and Aura
        • Heuristics & Authorship
        • 02 Semiotics
        • 3 Process?
        • 05
        • 06 Technology & Mediation
        • Formalism
        • Simulation
        • The Gaze & Media Critique
        • Import
        • 10-12
      • Recommended books
        • 🔴Things I learned
      • Mac Superuser
        • Applescript
      • InDesign
      • Illustrator
      • Blender
      • Premiere Pro
      • Mathematics
        • Probability
        • Linear Algebra
      • Shader Dev
      • Getting Started with After Effects
        • Best Practices
        • Pimping up AE
        • Environment
      • Houdini
        • Cheatsheet
        • Cookbook
        • Techniques
        • Dynamic
        • Rendering & Lighting
        • Animation
        • Particles
        • Others
          • Modeling
          • Fluids - Pyro & Smoke
          • Rendering
      • REGEX
    • Sandbox
      • Nexrender
        • Terminology
        • Project Files Preparation
Powered by GitBook
On this page
  • Why scrape?
  • Application
  • CSS Selectors
  • XPath
  • Legal Risks
  • Python
  • Scraping environment with JupyterLab

Was this helpful?

  1. Dev
  2. archive

Webscraping

PreviousarchiveNextGoogle Sheets Formulas

Last updated 1 year ago

Was this helpful?

Documentation / Tutorials

Python

  • https://instaloader.github.io/index.html

Google sheets

Why scrape?

Application

  • General: Competitive and price analysis, Lead generation, Keyword research

  • Research: Science Product reattach, Finding / filling a job, Government oversight

  • Financial: Stock analysis, Insurance and risk management, News gathering and analysis

Scraping for a car scrape the car buying websites to find all the teslas Evaluate the price to find great deals Scrape airfares and adjust the deals for airfare expenses Send an email digest of the top deal each day

Python libraries • beautiful soup • Scrape • Selenium

Process • set a start_url variable • download html • parse html • extract useful information • transform or aggregate • save the data • Go the the next url

HTTP overview Request > response via https Web address / verb / user agent GET - retrieves data POST - sends data to the server User agent identifies the browser or web scraper

# URL hacking query string
# Python URL strings 
host =‘www.iseecars.com’ 
path = ‘/used-cars/used-tesla-for-sale’ 
location = ’66592’ 
query_string = f#LOcation={location}&Radius =all&Make = Tesla&Model=Model+3’ start_url = f’
http://{host}{path}{query_string}’

# Python requests 
import requests 
start_url = ‘’ downloaded_page requests.get(start_url)
print(downloaded_page.text)

HTML & CSS Selectors css=> ‘title’ h1 li - list item Ul - unordered list ol - ordered list

CSS Selectors

li
title
h1 
'#vin3827' // HTML ID
',auto-listing'

ul li // 
//class selectors
'ul.listings li#vin3827'
exmaple = open("example.html","r")
html = example.read()
#hyml = requests.get(url).text
example.close

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.prettify())

soup.title
soup.li
spup.find_all('li')

XPath

//ul[@class='listings"]/li[@id="vin3827']

Legal Risks

getting sued for copyright infringement / private websites

  1. Safe

    1. public government website

    2. scraping for personal use

    3. aggregated data project and research

    4. terms & conditions that allows

  2. More risky

    1. scraping for personal use even though prohibited by the terms and conditions

    2. scraping data you don't own while logged in

    3. large scale scraping to publish widely promoted "news" reports

  3. Relatively risky

    1. large scale scraping for profit

    2. create a commercial product

    3. scraping large company websites for profit

    4. creating and selling derivatives works

    5. scraping personally identifiable data (PII)

case-study: hiQ vs LinkedIn

Python

Scraping environment with JupyterLab

#Import packages
import requests
from bs4 import Beautiful Soup
import pandas as pd

# Download and parse the HTML
start_url = 'https://...'

# Download the HTML from start_url
downloaded_html - requests.get(start_url)

# Parse the HTML with BeautifulSoup and create a soup object 
soup = BeautifulSoup(downlaoded_html.text)

#save a local copy
with open('downloaded.html', 'w) as file:
    file.write(soup.prettify())
# Setup & Install Packages
#pyenv install 3.74
#pyenv local 3.74

#pipenv --python 3.7.4
#pipenv install requests
#pipenv install beautifulsoup4
#pipenv install pandas

#pipenv intall jupyterlab

# Check Installation
# !pyenv local
# !python - V
# !pip list
# SEelct table.wiki table

table_head = full_table.select('tr th')

print('----------')

XPath Syntax
Import XML Guide
Monitor YouTube and Instagram data