Friday, July 5, 2013

Nerd alert: scraping TA trail notes from the web

Half of the fun of these long trips is the planning.  I used to think it was bizarre to enjoy the complicated planning and logistics that go into any long trip (hiking or not), but I've since realized that it's just part of my nature as a computer scientist/engineer, and I've embraced it.

One of the things I've been working on lately is gathering the trail notes for the TA into a compact form.  The trust that created and maintains the TA has a fantastic website with a lot of essential information.  One of the most important things available there are the trail notes - essentially, directions for the entire 3,000 km route.  They are available on the web as nicely formatted pages like this one.  They are also available for download in PDF format.

This is great!  Unfortunately, if you print out the PDFs, it ends up being close to 250 pages of notes.  Even if you split that up and only carry the notes for one section of the trail at a time, it's way more paper than we want to carry, and it's not as convenient to reference while we're on the move.


So, I set out to consolidate these.  Other people have done this by hand, and posted the result for others to use.  Unfortunately, the notes on the website are updated on a regular basis - trail sections open and close, and the route itself is still changing on a regular basis, e.g. as road walks are replaced with new trails.  This means that, if you want the most up to date notes in compact form, you have to wait until just before you leave, and compact them by hand.  Sounds tedious, right?

Computers to the rescue!  I set out to write some code that would pull the data from somewhere, and put it together in a compact, readable form.  Before I describe the torturous path I took to get there, I'll show you the results: Compact Te Araroa trail notes.  If you don't care about the details of how I got there: stop here!  Just print that spreadsheet and start hiking.  If you want to nerd out a little, keep reading for the gory details on how I did this, including full source code if you want to mess around with it yourself.

The road to that spreadsheet was way longer than I expected it to be when I set out to do this.  I did this all with a Google Apps Script.  It's JavaScript run "in the cloud" (so it must be good, right?), and it presents some pretty decent APIs for working with Google Docs.  I've also managed to avoid using JavaScript very much up until this point, so this was a chance to play around with it.

I started by downloading the PDFs from the TA website, uploading them as Google Documents, and writing some code to parse the text from the documents into a spreadsheet.  Turns out there are a few serious problems with doing it this way:

  • Google will only convert 10 pages of a PDF at once, which means I had to manually split the PDFs up into 10 page segments before uploading.
  • Google converts the PDFs to text using OCR, which is great and magical, but sometimes gets things wrong, and does strange things with formatting (spurious newlines, inconsistent spacing, etc) - which makes programmatically parsing the resulting docs more difficult
  • The PDFs themselves are not totally consistent - some sections are named differently from file to file, sometimes the distance for a segment ends up in the "time" section, etc.

I got fairly far into the process before realizing these problems - they are hard to spot without trying to convert everything, then scrolling through to look for weirdness.  I eventually scrapped this approach for something superior: scraping the trail notes web pages directly.  This requires parsing the html, but the data is much cleaner, and it removes the manual PDF conversion step.

So I ended up with a script that crawls the set of trails from each of the 10 major sections, then scrapes the individual trail note pages, and dumps it all to a spreadsheet in a very compact form.  You can see the source for the script here, with some light comments.

Sadly, actually using that code is a little more complicated than I'd like - there doesn't seem to be a great way to share these scripts.  Google does have some sort of script gallery that makes the code available to everyone, but it requires approval, and I'm still waiting for that.  So, until then, here are the steps if you want to run it yourself:

  1. Create a new Google Spreadsheet, and open it
  2. Click on Tools->Script Editor
  3. If you get a pop-up asking what kind of script you want to create, click "Blank Project" 
  4. Paste the source code into the window, click the "Save" button, and give your project a name (it doesn't matter much what you name it)
  5. Back in your spreadsheet, click Tools->Script Manager.  You should see a single option called 'make_all_notes'.
  6. Click 'Run', and then wait for 5 minutes or so as it populates your spreadsheet.
If you want to tweak the format, check out the dump_rows_() function in the script.  It takes a single map with all the information for a single section in it.  The most relevant spreadsheet API docs are here.

The first thing the script does is clear out your spreadsheet, so it's easy to quickly iterate - run it for a few seconds, check the format, stop the script, make some changes, repeat.  If you come up with a prettier or more compact format, please let me know!

Along the way, I learned some awesome things about JavaScript:

  • Semicolons are optional, you only really need to use them if you want to feel cool.
  • Always, always prefix your variables with 'var'.  Apparently IE will complain if you omit it, but for everything else, unexpected things will happen, because you are essentially declaring a global variable.
  • JavaScript pretends to have the 'for(x in y)' construct for iterating over arrays, but it doesn't do what you think it does.  So you're stuck with 'for(i = 0; i < foo.length; i++)', which, after 2 years of Python development, is amazingly tedious to type.
I think that about wraps this up.  It took me a few days of hacking around, but I'm pretty happy with what I ended up with - the resulting spreadsheet is 60 pages (so, only 30 double-sided sheets), and has all the info included in the 250 PDF pages.  And, I can update it with the latest notes in a few clicks and 5 minutes!  Now here's hoping that there isn't a site re-design right around the corner for http://www.teararoa.org.nz...

4 comments:

  1. impressive work! thanks a lot for sharing. greets from zermatt

    ReplyDelete
  2. Thanks for this amazing resource. Do you know whether your code still works with the teararoa.org.nz website? I've followed your instructions but the resulting spreadsheet only has two rows. Thanks again either way!

    ReplyDelete
  3. Fixed this up, the new code is here -> https://pastebin.com/ZZr9c6Nv

    ReplyDelete