I've been writing a database importer plugin for a Rails application that needs to data on some "legacy" production databases (well, not really legacy, but the schema differs from ActiveRecord conventions) with the intention of scheduling a cron job to run the imports. Why not connect the Rails app to the legacy databases? Hmm, let's see:

  • the records don't have to be up to date (so I can afford to, say, import yesterday's records today),
  • less jumping through hoops molding ActiveRecord models to the legacy databases,
  • the production database schema is liable to change - but this should not affect my Rails application,
  • there will be lower loads on the legacy databases which are in full-blown production use, and
  • most importantly, it gives me an excuse to figure out writing a data importer for a Rails application.

And I am surprised that it actually was rather fun writing the importer plugin (data importing stuff is normally one of the most unexciting things a programmer can do, right next to writing lengthy requirements documentation and any kind of contact sport). It's basically a plugin that defines ActiveRecord models on the source (legacy) databases and then creates our Rails app's models from these. Importer classes allow me to then run the imports using script/runner like so:

script/runner "HotelsImporter.import :start => 2.days.ago.to_date, :end => 1.day.ago.to_date" -e production

Put that in a cron job and there you go, scheduled daily (or hourly, whatever) imports.

But I digress. What was I actually going to talk about? Oh yes, checking for duplicate ActiveRecord objects. Now, the importers I wrote were run daily but there was the risk of re-importing the same data again (due to failed cron jobs, running the same job twice, acts of god, etc.). To be defensive, I needed to check that there were no existing records before importing them from the legacy databases.

At this point I could decide to run uniqueness checks on any natural keys of each table (and Rails makes this really easy with AR validations, as we all know), or rely on a more convenient "the whole hog" field-by-field comparison. I settled on doing a field-by-field comparison after realizing that:

  • it's easier and I don't have to specify which natural fields constitute the natural keys, and
  • there are some tables which don't really have a natural keys (these generally belong to has_many side of an association).

Update: As choonkeat pointed out in a comment below, I can simply use Post.find(:all, :conditions => new_post.attributes) since that stood out very clearly as the way to do it. This was actually the first way I tried to do this but it didn't work in the importer - I must have been doing something stupid! Doh! Thanks choonkeat for pointing out my blooper. Anyway you can mostly ignore what follows below but I'll keep it here to remind myself of my error.

So I went looking for an easy way or a Railism to check whether an existing new ActiveRecord object already exists in the database. Hmm, I couldn't find anything helpful - I guess everyone is relying on AR validations. Still, I went ahead and mixed in a to_conditions instance method to ActiveRecord::Base - looks like my answer to everything nowadays is to re-open existing classes.

module Bezurk #:nodoc:
  module ActiveRecord #:nodoc:
    module Extensions
      def to_conditions
        attributes.inject({}) do |hash, (name, value)|
          hash.merge(name.intern => value)
      alias :to_conditions_hash :to_conditions

# ...
ActiveRecord::Base.send(:include, Bezurk::ActiveRecord::Extensions)

So now in my importers I can easily check for potential duplicate entries:

new_post.save! if Post.find(:all, :conditions => new_post.to_conditions).empty?

Now, I just have this nagging suspicion that there is a better way to do this...