Read This If You Are Using a Script to Pull Data From This Site

Published: 2017-05-10
Last Updated: 2017-05-10 14:05:53 UTC
by Johannes Ullrich (Version: 1)
2 comment(s)

I love it when people write tools to pull data from this site, and we try to accommodate automated tools like this with our API. but sometimes, scripts go bad and we keep having cases were scripts pull the same data several times a second. I would love to let the owner of the script know, but often this is hard.

To prevent some of these issues, I am going to enforce a new rule going forward: Your User-Agent has to include a contact for the script. I prefer a simple e-mail address. A URL will do if that is easier for you. The data will exclusively be used to contact you in case of a problem.

To enforce this, generic user agents will be blocked (like "Python-urllib/2.7", "Wget/1.12 (linux-gnu)", "curl/7.38.0"). I will start doing so with older pages that should no longer be used by automated scripts anyway (as they are not designed for automation like our API), and initially only block specific User Agents.

If you hit the page with a blocked User Agent, a "403" error will be returned (Forbidden) and a simple text message pointing to this post [1]. 

[1] https://tools.ietf.org/html/rfc7231#section-6.5.3

---
Johannes B. Ullrich, Ph.D., Dean of Research, SANS Technology Institute
STI|Twitter|

Keywords: api rate limit
2 comment(s)

Comments

I recently decided to take a similar approach to SSH bots. If you're not using openssh or putty, it's getting blocked by my firewall.
Being someone who just wrote a script to use the API, I have done as requested and put email in. Since it was a learning script, that was a new thing to learn!

Diary Archives