Monday, September 19, 2016

Ruby Poltergeist gem the best way to scrape data

Over the years I have used several different gems to scrape data.  My 2 favorites being:
Nokogiri and Mechanize.  Both are very similar, but recently I had a challenge that neither Nokogiri or Mechanize could handle.





Here's the situation:

I needed to make a HTTP POST, passing basic auth to a login form, then go to another URL and scrape some data.  All of this can be done using the Mechanize gem.  The problem is after making the POST the site used AUTH0 for authentication, which was implemented using Javascript.  The Javascript redirects to another URL looking for the successful login code from AUTH0.

THE PROBLEM???

Mechanize and Nokogiri don't handle Javascript.  The good news is Poltergeist can easily handle Javascript, no sweat!  After using Poltergeist one time to solve this challenge, it has become my "go to" gem for anything and everything!

Poltergeist uses PhantomJS to run as a headless browser, I can still use the awesome Ruby gem Crack gem to parse any JSON or XML.  I can't show you the exact example I was working on as I am not allowed, but I can show you something similar.

There is as an old web based game called Hyperiums II I honestly don't play the game, but my friend does :-)  This isn't a post about how to cheat the game ( although you could ),  I want you to fall in love with this gem!  Once you use it, it will become the gem you grab if you need to scrape data or have a simple task that you want to automate.  Poltergeist is my secret weapon when doing any web scraping!

Here is a sample Poltergeist script of logging into Hyperiums II and navigating to build factories.  You can modify this code to do most any small task or test that you need!

Hyperiums II script