Github repo

to do

  • How do script args work in Python?
  • Set up rotating user agents
  • Develop UI that can be run locally using Streamlit and add a .bat file for non-tech users to easily run shell commands

problem

My colleague asked me to find all mentions of “OIS” and “Office of Information Security” in all UW-related websites outside of UW-IT. I do not know what all of the UW subdomains are, and I did not feel particularly inclined to do manual data entry.

motivation

I had already been intent on learning Python and web scraping tools, so I saw this as the perfect opportunity to practice.

process

  • Heard about BeautifulSoup through textbook
  • Wasn’t really sure what the difference between Scrapy and BeautifulSoup was, so I went with the solution that looked simpler.
  • Logic errors that I got a second eye on.
  • 429 errors
  • Inefficient
    • Sequential processing
    • Redundant requests

solution

Scrapy Spider

limitations

  • JavaScript-based pages

takeaways

  • Python getters
  • XPaths selectors
  • How to not get banned
  • Scraping etiquette
  • It would have been tragic if I had been banned from the UW website on my own IP address