Irrational Exuberance!

Reading Files in Clojure

November 15, 2009. Filed under clojure

After spending some time other this weekend getting started with Clojure (using the excellent Setting Up Clojure for Mac OS X Leopard to guide me through setting up my development environment), it still took me a bit longer than expected to find a concise answer to reading in files, so here is a stab at collecting some thoughts on that topic.

slurp

The simplest approach to reading in a file is to use simply use slurp.

user> (slurp "tokenize.clj")
"(ns tokenize\n  (:import (java.io BufferedReader FileR....\n"  

For simple scripts, this may be all our need, but it suffers from a few issues:

  • it reads the entire file into memory, making it unsuitable for large files,
  • and it doesn't break contents by line (or any other delimiter for that matter) which is usually what you want.

java.io.BufferedReader

For a more scalable streaming approach, java.io.BufferedReader is a simple but efficient bet.

(ns tokenize
  (:import (java.io BufferedReader FileReader)))

(defn process-file [file-name]
  (with-open [rdr (BufferedReader. (FileReader. file-name))]
    (doseq [line (line-seq rdr)] (println line))))

(process-file "tokenize.clj")

The BufferedReader wraps the FileReader to provide a lazy interface to the file, and line-seq allows us to treat the stream as a list, while behind the scene it calls the readLine method on BufferedReader.

Modifying the above code you could write a generic function for reducing lines in a file.

(ns tokenize
  (:import (java.io BufferedReader FileReader)))

(defn process-file [file-name line-func line-acc]
  (with-open [rdr (BufferedReader. (FileReader. file-name))]
    (reduce line-func line-acc (line-seq rdr))))

(defn process-line [acc line]
  (+ acc 1))

(prn (process-file "tokenize.clj" process-line 0))

The above snippet only counts lines in a file, but you could rewrite process-line to perform detect tokens from the file as well.

(defn process-line [acc	line]
  (reduce #(assoc %1 %2	(+ (get	%1 %2 0) 1))	acc (.split line " ")))

At which point, running the script:

bash-3.2$ clj read1.clj
{"" 20, "rdr))))" 1, "*command-line-args*))))" 1, ...}

In general, BufferedReader and line-seq should be adequate for most file reading (BufferedReader is the goto way to read large files in Java), but there are always more ways to do things.

clojure.contrib.duck-streams

If you have clojure.contrib installed, then a slightly more compact approach to parsing files as streams is available: duck-streams.

Modifying the BufferedReader example to use duck-streams, we get this code:

(ns tokenize
  (:use	[clojure.contrib.duck-streams :only (read-lines)]))

(defn process-file [file-name line-func line-acc]
  (reduce line-func line-acc (read-lines file-name)))

(defn process-line [acc line]
  (reduce #(assoc %1 %2 (+ (get %1 %2 0) 1)) acc (.split line " ")))

(prn (process-file "tokenize.clj" process-line (hash-map)))

It's a bit more compact, but probably you'd be using duck-streams.read-lines because you were already using other duck-streams functionality (like spit for writing out files, append-spit for appending to files).

There are many other approaches to reading in files in Clojure, but these should be enough to get started.