Profile

wshaffer: (Default)
wshaffer

September 2021

S M T W T F S
   123 4
56789 1011
12131415161718
19202122232425
2627282930  

Custom Text

Most Popular Tags

I'm working on a project where we're trying to get a bunch of data that has been kept on internal wiki pages into a database, so that it can be searchable, we can have automatic detection of duplicates, various other stuff.

Part of my contribution to this effort is to get the data off these wiki pages and into CSV files that can be imported into the database. It's a pretty trivial effort if you've got the Ruby gem Nokogiri (which parses HTML and XML files).

Well, it's sort of trivial. So far, about 20% of my time has been spent writing the part of the script that does the real work, and 80% has been spent dealing with oddities caused by unexpected white space, white space that Ruby does not recognize as white space by default ( ), and quirks of people's wiki markup.

My guess is that this is probably par for the course when web scraping.

Also, I wrote documentation for my homebrew hacky script that probably 2 other people besides me are ever gonna use, because that's how I roll.

When I'm done with this project, I'm considering switching from Ruby to Python. I like working in Ruby, but Python is quite literally what all the cool kids are using, since it seems to be the current language of choice for teaching children to program.
My efforts to teach myself Ruby programming have not advanced much in the past few months, having been displaced by acquisition of other skills more immediately of need in Ye Daye Jobbe. However, I have a solid skill base in writing scripts that take information in one format, slice it and dice it a bit, and spit it out in another format. This is surprisingly useful.

This morning I was in a meeting where we found ourselves saying, "We have a pile of XML files with a ton of information - if only we could turn them into a CSV file containing that information." I have now written a script that does just that.

Using the Nokogiri gem actually makes it pretty stupidly easy to parse XML/HTML. One of these days I'll set myself a programming challenge in Ruby that does not basically reduce to 1) Find the gem that provides the correct objects and methods to handle your data set. 2) Perform basic string and array manipulations. 3) Profit!

Today is not that day.
I don't remember how I did programming in the days before Google and Stack Overflow. I think I spent a lot of time flipping through books. Or possibly actually memorized language syntax, which seems like a horrible waste of neurons that might be more profitably put to other uses.

However, if you want to get good results out of Google, you've got to know how to search for the right thing. There is a thing that I often want to do in Ruby, where I've got a variable that holds a string, and I want to put the value of that variable into another string. And somehow I always end up searching for something like "ruby string substitution" and getting pages and pages of stuff on the sub method, which is nice but not what I want at all.

So this is a note to myself: the thing you want to do is called "variable interpolation" and it works a little something like this:

thingy = "variable interpolation"
puts ("When you want to put a variable into a string, that's called #{thingy}.")
Today's little Ruby conundrum that I'm noting for future reference: I've written a script that takes two CSV files, File A and File B, and removes entries from File A if they match up with certain things in File B. The script is invoked at the command line by typing ruby my_script.rb filenameA filenameB.

Having done this, I wanted to ask the user if they wanted to do additional filtering based on an additional criterion, so I added the following.


puts "If you want to do additional filtering, enter a filter string. Otherwise press Enter."
response = gets.chomp!


To my bewilderment, when I ran the script, it blew through that gets statement without pausing for input and proceeded to filter the output in an unexpected way.

I did eventually find the answer on Stack Overflow, but it was obscure enough that I'm blogging here for my own reference. Basically, it turns out that gets doesn't read from standard input by default. It reads from ARGF, which is an array containing the filenames passed to the script as command line arguments. So, my script as written grabbed the first line of File A and filtered based on that.

To make gets grab input from standard input, I should have used $stdin.gets.

I assume that this isn't more widely documented because few people mix command-line arguments and interactive input in a single script. In fact, I'm not sure I should really be mixing command-line arguments and interactive input in a single script, but I'm already asking the user to type a hell of a lot at the command-line, and this seems like a reasonable hack to get something useable while I decide on a better way to handle this.
After a bit of research, I found a much better way to do the file name manipulation I was talking about in my previous post.

Basically, it boils down to:

require "pathname"
input_file = Pathname.new(ARGV[0])
new_base = (input_file.basename(input_file.extname)).to_s + "_counts" + input_file.extname.to_s
output_file = input_file.dirname + Pathname.new(new_base)


I used Pathname instead of File because the documentation suggests that it's more robust at dealing with different file pathing conventions on different OSes.

I'm a little dubious about the dance I had to do there of converting path fragments to strings, concatenating them, and then converting back to a pathname, but trying to concatenate the path fragments directly kept giving me extra / in the path.
I'm doing a project at work where I've got a bunch of CSV files with several thousands of lines of data. I need to slice and dice this data in various ways, mostly by pulling out subsets of lines with certain strings occurring in them, counting the number of times certain values occur, and so on.

Looking at this data, it became clear that I could either a) become a serious Microsoft Excel power user, or b) put my slowly growing Ruby scripting skills to work. That wasn't much of a contest.

I actually managed to knock together the skeleton of a useful script pretty quickly. Now I'm polishing it up to make it useable and adding a bit of basic error-checking. I encountered two little issues that strike me as the kind of thing that I'm likely to forget about and then encounter again at some point in the future. So, blogging for my own reference, and because it might possibly be useful to some other Ruby newbie.
How do I find the file name extension? )
How do I unfreeze my string? )
Actually, as I was checking in my most recent changes, it occurred to me that Ruby probably has a class with built-in methods for doing things like handling file name extensions. But reinventing the occasional wheel is educational.
A couple of months ago, I started learning to program in Ruby. Largely because I somehow stumbled across Brian Marick's book, Everyday Scripting with Ruby, which poses the questions: "Do you spend too much of your working life copying and pasting or otherwise manually manipulating data? Wouldn't you like to be able to get a computer to do that for you?" To which I answered, "Yes!"

So, I've been slowly working my way through Marick's book, and playing around with ruby koans, and reading bits of why's poignant guide to ruby, which has got to be one of the most extraordinary pieces of technical writing ever committed.

And I've written some little toy scripts, but today I wrote my very first properly useful script. To give a bit of background: For FOGcon, we keep the code for our website backend in a source code repository called Github. We use Github, among other things, to track the bugs and issues that people report in using the database. And one of the things I periodically have to do is get information about issues out of Github and pass it on to people who don't have Github accounts and so don't have the ability to access our issues tracker directly. Up until today, I have been doing this by clicking around in Github and copying and pasting.

But today, I thought, "I should be able to script this." And lo and behold, with a bit of tinkering, I now have a script that logs into Github, downloads all the open issues in a given milestone, and prints out the issue number, title, and description for each one.

Interestingly, the breakdown of how I developed the script went something like this:
~30 minutes: Determining that there is a Github API, that a suitable Ruby wrapper for the API exists, and researching whether this wrapper can access the information I want.
~1 hour 30 minutes: Installing various necessary Ruby 'gems' (a.k.a libraries) and dinking around trying to resolve various dependency issues.
15 minutes: Writing and testing the script.

This was pretty much tinker toy programming, in that the Octokit Ruby library that I used did all the hard work. All I had to know was how to do some basic operations on an array, and how to print out the information. The entire script is 13 lines long.

So, yeah, this decision to learn a bit of Ruby scripting seems to be paying off.