I always try to be a good citizen for web scraping; I pull the minimum information I need, close connections once I get the response, insert delays between multiple page views, etc. I always try to put only as much load on a service as a regular user would when web browsing.
"What does that have to do with Stargates?"
I really like Stargate. SG-1, Atlantis, or Discovery, doesn't matter (except the animated series...I pretend that doesn't exist.)
Some people hate it when geeks watch movies and get nitpicky about details. "CAN'T YOU JUST ENJOY THE MOVIE?!"
Not always, no. When I enjoy something, I'm the type of person who enjoys not just the story, but the universe in which it is set; this means learning about the feasibility of that story universe. Oh, sure, there are some rules you have to accept in order for that story to work (such as faster than light travel magic handwaving or using lightsabers and not having them vaporize anything too close to the wielder since, you know, REALLY HOT PLASMA...)
One of the key bits to Stargate involves using the Stargate; the dial home device for Earth's portal was not found with the gate. The device can, however, be manually "dialed", which is what SG command does...they have a computer control massive motors that sets each of the chevrons into a lock position, as well as reading diagnostic signals from the gate.
The show handwaves a lot of this process away, but I think it's implied that someone had to program the computer to attempt dialing control and reading (and sending) signals to control the gate. It's a black box; they needed to figure out "If I do X, do I get Y?" and more importantly, "Do I get Y consistently?" (Then maybe figure out what Y means. I mean, you're screwing around with an alien device that connects to other worlds, after all...) I like to think about what it took for that person to approach that black box and coax information out of it in a way that was useful.
Getting information from these websites, designed for human interaction using a web client, is like trying to programmatically poke a stargate. In the process I've discovered that our many websites are frustrating and inconsistent (I sometimes wonder, when I just want to get a list of text to parse, how many common websites are compliant for devices used by people with poor eyesight or braille systems.)
For example, I tried looking at a way to query the status of my orders from a frequently used store site. I thought it would be simple...log in and pull the orders page. Nope. If you order too many items, you might have to query another page with more order details. Sometimes order statuses change in unexpected ways. The sort order of your items isn't always consistent, either. And those were the simpler problems I encountered...figuring out consistency in delivery estimate
I tried a similar quick command line checker for a computer parts company. Turned out they had far more order statuses than I thought they did, and alerting me to changes in that order status was an interesting exercise in false alarms when they'd abruptly change from shipped to unknown and back again.
Another mini-utility I worked on was checking validity of town locations. Pray you never have to work with FIPS...
The website I chose seemed to be fairly consistent in the format of the information. Turns out I was naive in how various towns are designated, and this website was not internally consistent in showing information in a particular order. I get all sorts of interesting but very weird results for different areas around the country.
I'm sure that if I had a dial-home device (in this case, a clear API to the websites or access to an internal database) these lookups would be more straightforward. As it stands, the closest API I can use is the same as anyone with a mouse and keyboard...parsing the web page.
While frustrating at times, I am thankful that these mini-projects have taught me a few things.
- Websites, some of which I've routinely used, are not as standardized as I thought within their own site. I just hadn't noticed when I'm searching for particular information the items I click on to get what I'm searching for.
- I end up rethinking a lot of parsing logic when digging and sorting through human language.
- Web sites implement some seemingly convoluted logic for interacting with clients and I now have a new appreciation for web browsers.
- I also have a new appreciation for the usefulness of a good API. If I start a business and there's anything that can be exposed through API, I'm making it available through an API.
No comments:
Post a Comment