Refactoring Ruby programmatically.

Published on February 13, 2018. ruby (4), devtools (2)

When you join a new team, one of of the first things you’ll figure out is their preferred coding style. They probably have a linter like rubocop or flake8 to delegate style arguments to computers who are the supreme pedantics. Sometimes though, you’ll find reasons to change a repository’s coding style or to merge in another code base with different style choices.

At a certain scale, you’ll probably just fixing things by by hand, but for projects that span thousands of files, no amount of caffeine can mask the pain.

Linting is just one example of a broader category of problems: how do you refactor large codebases? Although a common answer is simply not to refactor at scale, that tends to cause codebases to degrade rapidly over time. It can be better!

Most languages come with libraries to represent source code as s-expressions, which you can then modify in new ways to generate modified source code. For Ruby, two libraries that do just that are:

ruby_parser which parses a Ruby file and outputs its s-expressions,
ruby2ruby which translates s-expressions into Ruby code.

Let’s try them out. (Full source code is available on Github.)

Imagine we have a method incr which used to require two parameters, but most invocations incremented by 1, so we added 1 as the default value for the second parameter. Now we want to rewrite all calls to incr to only pass a second value if it is different from the default.

So we could imagine some code looking like this:

def incr(x, i = 1)
  x + i
end
incr(5, 100)
incr(3, 1)
incr(10, 1)
incr(17, 17)

That we want to rewrite to look like this:

def incr(x, i = 1)
  x + i
end
incr(5, 100)
incr(3)
incr(10)
incr(17, 17)

The interesting parts of our code will be in a rewrite function, so first let’s write the scaffolding that function will live within, and then rush to the fun parts.

The scaffolding is thankfully pretty short, requiring a few libraries, parsing stdin to s-expressions, rewriting those s-expressions into Ruby code, and then outputting to stdout.

require 'ruby_parser'
require 'ruby2ruby'

def rewrite(expr)
  expr
end

parsed =  RubyParser.new.parse(ARGF.read)
puts Ruby2Ruby.new.process rewrite(parsed)

Assuming you have the example input above in a file named refactor.input and you name this file refactor.rb, then you can run it using:

ruby refactor.rb < refactor.input

This is actually pretty cool, because we’re taking in some code, parsing it, and then recombining it, but the really fun part is what comes next: modifying it!

Astute readers will notice the output version has some extraneous parentheses. I’m skimming over because it’s equivalent Ruby code, but it’s a bit annoying, and perhaps an astute reader will propose a non-regex based solution.

The rewrite function gets called by ruby_parser on the top-level s-expression, from which you can recursively explore all the program’s s-expressions. To explore the structure of individual s-expressions a bit, consider the input:

incr(3, 1)

Which is represented by a Ruby object whose structure is:

s(:call, nil, :incr, s(:lit, 1), s(:lit, 2))

In order, these values are:

:call is the kind of s-expression (some other common kinds are :block, :lasgn and :defn) for invoking a function,
the second value, nil, doesn’t contain anything interesting for :call, although it does for other kinds,
:incr is the name of the function invoked,
remaining values are the parameters passed to invoked function.

Reminding ourselves of our original problem statement: can we remove the second parameter of calls to the :incr function if they specify the same value as the default parameter? Yup, we now know enough to write that function:

def rewrite(expr)
  if expr.is_a? Sexp
    if expr[0] == :call &&
       expr[2] == :incr &&
       expr.size == 5 &&
       expr[4][0] == :lit &&
       expr[4][1] == 1

      # remove the second parameter                                                                                                                               
      expr.pop()
    end

    # descend into children                                                                                                                                       
    expr.each { |x| rewrite(x) }
  end
  expr
end

There are three interesting parts here:

We should only rewrite objects that are s-expressions!
If we’re calling incr and the second parameter is the new default parameter, a lit of value 1, then we should remove it.
Recursively descend into the contents of each s-expression. Otherwise you’ll only see the top-level :block s-expression which is pretty boring.

Stepping back, I think this is pretty awesome! We’re now programmatically rewriting code. We can use this to maintain even large codebases without doing huge amounts of manual toil.

Let’s try it again, doing something a bit more ambitious. Imagine you’ve hired a bunch of Python programmers on our team who keep writing Python-style for loops instead of learning Ruby’s each idiom, and that we want to rewrite them to use each.

Your input might be something like:

def count(lst)
  i = 0
  for ele in lst
    i += 1
  end
end

And you’d want this output:

def count(lst)
  i = 0
  lst.each { |ele| i = i + 1 }
end

Taking another stab at our rewrite function, this is a bit messier:

def rewrite(expr)
  if expr.is_a? Sexp
    if expr[0] == :for
      lst = expr[1]
      param = expr[2]
      func = expr[3]
      expr.clear
      expr[0] = :iter
      expr[1] = Sexp.new(:call, lst, :each)
      expr[2] = Sexp.new(:args, param[1])
      expr[3] = func
    end

    # descend into children                                                                                                                                       
    expr.each { |x| rewrite(x) }
  end
  expr
end

A bit messier, but also a pretty neat demonstration of what you can do is once you start playing around with this technique. For example, you could imagine only doing this if the complexity of the refactored for loop is low enough.

What next?

These are very contrived examples, but I think are enough to let you start dreaming about ways this technique could be applied usefully to your work, particularly if your work involves migrating large codebases to new implementations. Google’s Large-Scale Automated Refactoring Using ClangMR is an interesting case study of doing that at immense scale, and Source Code Rejuvenation is Not Refactoring is another exploration of this topic.

Most importantly, I think this is a good reminder to avoid falling into the “I’ll just work through it” mindset for large migrations, which I believe can become the limit on your company’s overall throughput.

Thanks to Ingrid and KF for shaping this post.