Reading Files in Clojure
After spending some time other this weekend getting started with Clojure (using the excellent Setting Up Clojure for Mac OS X Leopard to guide me through setting up my development environment), it still took me a bit longer than expected to find a concise answer to reading in files, so here is a stab at collecting some thoughts on that topic.
slurp
The simplest approach to reading in a file is to use simply use slurp
.
user> (slurp "tokenize.clj")
"(ns tokenize\n (:import (java.io BufferedReader FileR....\n"
For simple scripts, this may be all our need, but it suffers from a few issues:
- it reads the entire file into memory, making it unsuitable for large files,
- and it doesn't break contents by line (or any other delimiter for that matter) which is usually what you want.
java.io.BufferedReader
For a more scalable streaming approach, java.io.BufferedReader is a simple but efficient bet.
(ns tokenize
(:import (java.io BufferedReader FileReader)))
(defn process-file [file-name]
(with-open [rdr (BufferedReader. (FileReader. file-name))]
(doseq [line (line-seq rdr)] (println line))))
(process-file "tokenize.clj")
The BufferedReader
wraps the FileReader
to provide a lazy interface to the file,
and line-seq allows us to treat the stream as
a list, while behind the scene it calls the readLine
method on BufferedReader
.
Modifying the above code you could write a generic function for reducing lines in a file.
(ns tokenize
(:import (java.io BufferedReader FileReader)))
(defn process-file [file-name line-func line-acc]
(with-open [rdr (BufferedReader. (FileReader. file-name))]
(reduce line-func line-acc (line-seq rdr))))
(defn process-line [acc line]
(+ acc 1))
(prn (process-file "tokenize.clj" process-line 0))
The above snippet only counts lines in a file, but you could rewrite process-line
to perform detect tokens from the file as well.
(defn process-line [acc line]
(reduce #(assoc %1 %2 (+ (get %1 %2 0) 1)) acc (.split line " ")))
At which point, running the script:
bash-3.2$ clj read1.clj
{"" 20, "rdr))))" 1, "*command-line-args*))))" 1, ...}
In general, BufferedReader
and line-seq
should be adequate for
most file reading (BufferedReader
is the goto way to read large files
in Java), but there are always more ways to do things.
clojure.contrib.duck-streams
If you have clojure.contrib installed, then a slightly more compact approach to parsing files as streams is available: duck-streams.
Modifying the BufferedReader
example to use duck-streams
, we get this code:
(ns tokenize
(:use [clojure.contrib.duck-streams :only (read-lines)]))
(defn process-file [file-name line-func line-acc]
(reduce line-func line-acc (read-lines file-name)))
(defn process-line [acc line]
(reduce #(assoc %1 %2 (+ (get %1 %2 0) 1)) acc (.split line " ")))
(prn (process-file "tokenize.clj" process-line (hash-map)))
It's a bit more compact, but probably you'd be using duck-streams.read-lines
because you were already using other duck-streams
functionality (like
spit for
writing out files, append-spit
for appending to files).
There are many other approaches to reading in files in Clojure, but these should be enough to get started.