to do
- How do script args work in Python?
- Set up rotating user agents
- Develop UI that can be run locally using Streamlit and add a .bat file for non-tech users to easily run shell commands
problem
My colleague asked me to find all mentions of “OIS” and “Office of Information Security” in all UW-related websites outside of UW-IT. I do not know what all of the UW subdomains are, and I did not feel particularly inclined to do manual data entry.
motivation
I had already been intent on learning Python and web scraping tools, so I saw this as the perfect opportunity to practice.
process
- Heard about BeautifulSoup through textbook
- Wasn’t really sure what the difference between Scrapy and BeautifulSoup was, so I went with the solution that looked simpler.
- Logic errors that I got a second eye on.
- 429 errors
- Inefficient
- Sequential processing
- Redundant requests
solution
Scrapy Spider
limitations
- JavaScript-based pages
takeaways
- Python getters
- XPaths selectors
- How to not get banned
- Scraping etiquette
- It would have been tragic if I had been banned from the UW website on my own IP address