Reading Files in Clojure

11/15/2009

After spending some time other this weekend getting started with Clojure (using the excellent Setting Up Clojure for Mac OS X Leopard to guide me through setting up my development environment), it still took me a bit longer than expected to find a concise answer to reading in files, so here is a stab at collecting some thoughts on that topic.

slurp

The simplest approach to reading in a file is to use simply use slurp.

user> (slurp "tokenize.clj")
"(ns tokenize\n  (:import (java.io BufferedReader FileR....\n"

For simple scripts, this may be all our need, but it suffers from a few issues:

  • it reads the entire file into memory, making it unsuitable for large files,
  • and it doesn't break contents by line (or any other delimiter for that matter) which is usually what you want.

java.io.BufferedReader

For a more scalable streaming approach, java.io.BufferedReader is a simple but efficient bet.

(ns tokenize
  (:import (java.io BufferedReader FileReader)))

(defn process-file [file-name]
  (with-open [rdr (BufferedReader. (FileReader. file-name))]
    (doseq [line (line-seq rdr)] (println line))))

(process-file "tokenize.clj")

The BufferedReader wraps the FileReader to provide a lazy interface to the file, and line-seq allows us to treat the stream as a list, while behind the scene it calls the readLine method on BufferedReader.

Modifying the above code you could write a generic function for reducing lines in a file.

(ns tokenize
  (:import (java.io BufferedReader FileReader)))

(defn process-file [file-name line-func line-acc]
  (with-open [rdr (BufferedReader. (FileReader. file-name))]
    (reduce line-func line-acc (line-seq rdr))))

(defn process-line [acc line]
  (+ acc 1))

(prn (process-file "tokenize.clj" process-line 0))

The above snippet only counts lines in a file, but you could rewrite process-line to perform detect tokens from the file as well.

(defn process-line [acc	line]
  (reduce #(assoc %1 %2	(+ (get	%1 %2 0) 1))	acc (.split line " ")))

At which point, running the script:

bash-3.2$ clj read1.clj
{"" 20, "rdr))))" 1, "*command-line-args*))))" 1, ...}

In general, BufferedReader and line-seq should be adequate for most file reading (BufferedReader is the goto way to read large files in Java), but there are always more ways to do things.

clojure.contrib.duck-streams

If you have clojure.contrib installed, then a slightly more compact approach to parsing files as streams is available: duck-streams.

Modifying the BufferedReader example to use duck-streams, we get this code:

(ns tokenize
  (:use	[clojure.contrib.duck-streams :only (read-lines)]))

(defn process-file [file-name line-func line-acc]
  (reduce line-func line-acc (read-lines file-name)))

(defn process-line [acc line]
  (reduce #(assoc %1 %2 (+ (get %1 %2 0) 1)) acc (.split line " ")))

(prn (process-file "tokenize.clj" process-line (hash-map)))

It's a bit more compact, but probably you'd be using duck-streams.read-lines because you were already using other duck-streams functionality (like spit for writing out files, append-spit for appending to files).

There are many other approaches to reading in files in Clojure, but these should be enough to get started.

All Rights Reserved, Will Larson 2007 - 2014.