to do

How do script args work in Python?
Set up rotating user agents
Develop UI that can be run locally using Streamlit and add a .bat file for non-tech users to easily run shell commands

problem

My colleague asked me to find all mentions of “OIS” and “Office of Information Security” in all UW-related websites outside of UW-IT. I do not know what all of the UW subdomains are, and I did not feel particularly inclined to do manual data entry.

motivation

I had already been intent on learning Python and web scraping tools, so I saw this as the perfect opportunity to practice.

process

Heard about BeautifulSoup through textbook
Wasn’t really sure what the difference between Scrapy and BeautifulSoup was, so I went with the solution that looked simpler.
Logic errors that I got a second eye on.
429 errors
Inefficient
- Sequential processing
- Redundant requests

solution

Scrapy Spider

limitations

JavaScript-based pages

takeaways

Python getters
XPaths selectors
How to not get banned
Scraping etiquette
It would have been tragic if I had been banned from the UW website on my own IP address

bellalee

Recent Notes

essays and writings

The Autistic's Guide to Self Discovery

digital garden

Gender roles in Korea

National apologies and reparations

Mention finder - Scrapy