First published
File this one under sysadmin tricks for journalists.
Tuesday the JFK files — a gaggle of PDFs (that's what a bunch of PDFs are called) — were supposed to land. Supposedly in the afternoon. And maybe they would land at https://www.archives.gov/research/jfk/jfkbulkdownload.
The goal: Get an alert when that URL updates.
The enemy here is refreshing a URL by hand until you see the change you wish to see on the page. And in this little article will show you how to leverage shell scripts to defeat that enemy and make your life easier. Note that this won't help you if you're on a Windows machine, I'm on a Macintosh, this procedure is aimed at that and maybe linux boxes.
There are a few ways to do this, and I'll go through two slightly different ways. In addition to assuming you’re on a Macintosh, it assumes you’re within earshot of the computer.
Method 1: Look for changes in the HTML document
This approach will alert you if any changes in an HTML document are seen. It's a little bit of a sledgehammer and can produce false positives, but this is the essence of it: Save a representation of the contents of the HTML document once, and then every ten seconds re-download the document and see if anything's changed.
How you do it
- Download the web page with
curl
- Hash the output (i.e. algorithmically convert it) with
md5
, save that as your "base" copy. - Then do that again and again and compare the md5 hashes, and if any subsequent hash you take doesn't match with the base then you know something on the web page has changed.
- When you ascertain something's changed, speak out an alert (in this case it’s 'oh boy oh boy oh boy').
The code you use to monitor a web page for changes
This code will do all the steps above, monitoring a web URL for changes and alerting you when something changes. This is good for situations where you're expecting a change in the near future and the alert only works if you're in the same room as your computer, though it could probably be adjusted for other situations.
To use this code, copy it and edit the URL so it pulls from the URL you want to monitor, then paste it into your terminal program and press return.
curl --silent https://www.archives.gov/research/jfk/jfkbulkdownload | md5 > base; watch -n 10 'curl --silent https://www.archives.gov/research/jfk/jfkbulkdownload | md5 > new; if ! cmp base new > /dev/null; then say 'oh boy oh boy oh boy'; fi'
As they say, there is another way.
Method 2: Look for changes in the HTTP request headers
Every request on the web has two parts: The actual document, and the meta information (i.e. "headers") attached to that document.
For example, this is what the headers on the JFK files page look like:
HTTP/2 200 content-type: text/html; charset=utf-8 content-length: 34220 date: Wed, 19 Mar 2025 22:20:58 GMT content-language: en last-modified: Wed, 19 Mar 2025 19:48:23 GMT x-content-type-options: nosniff etag: W/"1742413703-0-gzip" v-ttl: 77245 cache-control: public, max-age=60, s-maxage=86400 v-cache-ttl: 77245 strict-transport-security: max-age=31536000; includeSubDomains; preload x-frame-options: SAMEORIGIN accept-ranges: bytes vary: Cookie,Accept-Encoding x-cache: Miss from cloudfront via: 1.1 801c4cdd177872a11b03f54a2b3b464e.cloudfront.net (CloudFront) x-amz-cf-pop: SFO53-C1 x-amz-cf-id: 2d4HH0RiR3c8Y2nd553ocDfimjWP-bJexT_6uYtKHZRgitblsRVeHw==
Most of that is gobbledegook. But there is one piece we care about: That last-modified timestamp. Not every website will reliably use this, but if your target is one that does, here's how to monitor and alert for changes in the last-modified timestamp.
The how-to is pretty much the same as above, and the code is similar but not identical.
curl --silent --head https://www.archives.gov/research/jfk/jfkbulkdownload | grep modified | md5 > base; watch -n 10 'curl --silent --head https://www.archives.gov/research/jfk/jfkbulkdownload | grep modified | md5 > new; if ! cmp base new > /dev/null; then say 'oh boy oh boy oh boy'; fi'
You can email me at joe.murphy@gmail.com.