You Shall Not Parse

If you’re a web developer who’s responsible for implementing obfuscation of HTML source of web pages, then hear me out: to hell with you.

Part of the obfuscated HTML of The Economist web page, seen in web developer tools

Part of the obfuscated HTML of The Economist web page, seen in web developer tools.

Easier said than done? I know all the arguments: we did it on management’s request, we must eat too, I can’t do any other work than making other people lifes more miseable, blablabla.

This is bullshit. I don’t care for your excuses. They are bullshit. Your work is bullshit and this whole practice of obfuscating every piece of publicly available information is unethical and bullshit, and you are part of it. Go to hell.

If you haven’t closed this page yet, you’re probably asking: “why do you think it’s unethical, you seemingly all-knowing piece of garbage?” I’m happy to explain!

It’s exactly the same as with Free Software. Application developers have unfair advantage over their users, so users must have a minimum set of rights regarding the software that they run on their computers. In this case you remotely send a piece of code to my computer and try to make my life harder by refusing me any rights to inspect and modify the source code of the page. And what’s worse, for no good reason! Maybe you want to make my task of removing ads harder. Or to prevent printing the page without these nonsense sidebars? Maybe you were told that this is a security measure1, which it really isn’t. No matter what, you try to dictate me how I am supposed to do computing on my machine.

And this is, my friend, unethical. We have never even met after all, so you’re not in position of telling me what I can or cannot do on my computer.

Unfortunately for you, the language of the web is nothing like the native language of computers, the fact which is used by all proprietary software. It’s all plain text (instead of binary), created in better times, to be readable by humans. You can’t change it too much, because then web browsers won’t understand it and won’t show it to me. So you’re trying to be clever, but with the constraints imposed by HTML itself, your cleverness can only go so far: you change names of CSS classes to random strings. A child’s play, which makes searching for the data I’m interested in a little harder, but not even close to impossible.

And thankfully, all your “hard” work is futile, because de-obfuscation of your web page for my particluar need only took a few minutes. In fact, writing this rant took me longer than changing the parser for your page after you have implemented the obfuscation. It’s not that it is hard, because it isn’t. It’s that it is necessary at all.

But fear not, foul web developer, I have a solution for you! If you really don’t want me to parse your web page, then… fanfairs… don’t send it to me. Next time your web server, which you probably use because it is a reliable Free Software, receives a GET request from me, simply respond to it with 404. Or hide it behind a paywall, so I won’t be bothered to come back ever again. You won’t capitalise on me and others of my kind anyway.

I fear, however, that one day the majority of humankind will agree on a binary web protocol with built-in obfuscation capabilities. DRM for text. This will be the end of hobby web scraping. The end of web for me.

  1. Banks often obfuscate their web pages for this reason, I believe. They refuse us rights to any kind of automation other than the braindead ones developed by them. It is real pain in butt if you want to automate downloading transaction history.