Mining the Web: Ruby + Tableau = Data Gems from Raw Content

Mountains of information.

The web is chock full of information. Completely, utterly stuffed full of interesting facts about pretty much everything and anything. Most of the information is not in the traditional record-set form of data.

Tableau and the Web

Tableau is a superb tool for exploring record-set data, and for communicating the interesting insights that can be refined from it.

Happily, Tableau is able to recognize some of the Web's information as data and make it available for examination, inspection and exploration.

Sadly, Tableau's abilities in this area are pretty limited, e.g. copying and pasting the contents of HTML tables—and not any old HTML tables, only those that are "well-formed" in a narrow sense particular to Tableau–see here

Mining the Web

Fortunately, pretty much anything on the Web that's regularly formed can be mined and refined into data that Tableau can understand. Tunneling through the Web and extracting the information gems and processing them into data is much easier than is commonly thought, if the right tools for the job are used.

Enter Ruby

Ruby is a terrific tool, as proficient in its domain as Tableau is in its. Ruby was designed to remove the barriers between programmers and their ideas and goals in very much the same way Tableau was designed to remove the barriers to data visualization.

Accessing Web content with Ruby is simple and easy, and transforming the content into data is similarly straightforward, dependent of course upon the complexity and regularity of the content.

"Show Me," you say?

You want proof. Fair enough.

A while back I was listening to some music and browsing through Rolling Stone's 500 GREATEST SONGS OF ALL TIME for inspiration and reminiscing. Each song has its own page, with lots of interesting information and buttons to the previous and next songs in the list.

Here's song #500: link, and below.

As I was browsing through the songs I started keeping a list of them, copying and pasting their names, rank, and artist into a spreadsheet so that I could use Tableau to organize them and answer questions like: which artists had the most songs in the list; how many songs did the Stones have in it; and so on and so forth, and such like.

I wanted to be able to use Tableau to help me answer these questions, along the lines of:

It didn't take long before I was annoyed with all the copying and pasting. It was a huge pain—boring, mechanical, slow; error prone, and unrewarding. Worse, it detracted from my listening enjoyment.

I needed a better way.

What if, I thought to myself, the information in the individual pages is coded consistently across the pages? If so, it should be possible to dig into them and extract the songs' information and capture it as data in a form that Tableau can recognize.

As it turns out, the individual songs' pages are coded consistently and it was pretty easy to write a Ruby program to do exactly what I needed. Here it is:


require 'rubygems' require 'nokogiri' # Nokogiri in an add-on, installed with: gem install nokogiri require 'open-uri' $root = 'http://www.rollingstone.com' csvHeader = 'Record #,Reference,Rank,Artist,Song,Link' $recordNum = 0 HEADERS_HASH = {"User-Agent" => "Ruby/1.9.3"} topOfList = 'http://www.rollingstone.com/music/lists'\ '/the-500-greatest-songs-of-all-time-20110407'\ '/smokey-robinson-and-the-miracles-shop-around-19691231' def pull_song songPage doc = Nokogiri::HTML(open(songPage, HEADERS_HASH)) reference = doc.xpath("//div[@class=\"listItemDescriptonDiv\"]/h3").text artist, song = split_title reference place = doc.xpath("//span[@class=\"ListItemNumber\"]").text csvRecord = "\"#{$recordNum += 1}\","\ "\"#{reference}\","\ "#{place},"\ "\"#{artist}\","\ "\"#{song}\","\ "#{songPage}" $f.puts csvRecord unless $f.nil? get_next doc end def split_title reference parts = reference.split(/, [\']?/,2) artist = parts[0] if parts[1] =~ /.*'$/ song = parts[1].chop else song = parts[1] end return artist, song end def get_next doc nextNode = doc.xpath("//a[@class=\"listPaginationControls next\"]") nextHref = nextNode.xpath("@href") nextItem = $root + nextHref.text if nextItem == $root || $recordNum > 501 nextItem = nil end return nextItem end puts "Starting with: #{topOfList}" $f = File.open('RSSongs.csv','w') $f.puts csvHeader unless $f.nil? nextItem = pull_song topOfList while nextItem nextItem = pull_song nextItem end $f.close unless $f.nil?

The Ruby - Tableau Partnership

OK, it's not really a partnership. Neither of the actors knows of the other, so it's more like a happy (from my perspective) convergence of functionality, but that makes a lousy section heading.


The common element in this data mining system is the CSV file "RSSongs.csv". The Ruby script creates it. Tableau accesses it and makes its contents available for analysis. There's nothing magical about the name "RSSongs.csv" either, any other would do but I had a bigger plan and sensible naming is always a good policy.

Go ahead – give it a go.

I hereby release the Ruby code into the wild. It's free, as in speech and beer. Take it. Run it and get your own data set of the Rolling Stone 500 GREATEST SONGS OF ALL TIME.

The usual caveats.

RSSongs.rb works but it's definitely not bulletproof and hardened. Running from my home it purrs along perfectly. Running it from my hotel the data before TCC 2012 it grabs a bunch of the songs and then fails, with an error I think has to do with proxies or the specification of a User Agent in the URL open (see here), but I don't have the time to dig into it.

I hope it works for you, but make no guarantees. If you do use it and make improvements I hope that you'll post them back here as comments so I can learn from them, and hopefully other people can benefit from them too.

The Rest Of The Story

Mining these songs was a really interesting exercise, one that freed me from the shackles of a too-rigid conception of what Tableau-analyzable data is. Once I got this little project working I found myself wanting more and richer information to explore and investigate.

There will be more articles in this vein, starting with an examination of this Ruby code and a description of the process of writing it.

No comments:

Post a Comment