After spending some time other this weekend getting started with Clojure (using the excellent Setting Up Clojure for Mac OS X Leopard to guide me through setting up my development environment), it still took me a bit longer than expected to find a concise answer to reading in files, so here is a stab at collecting some thoughts on that topic.
slurp
The simplest approach to reading in a file is to use simply use slurp.
user> (slurp "tokenize.clj")
"(ns tokenize\n (:import (java.io BufferedReader FileR....\n"
For simple scripts, this may be all our need, but it suffers from a few issues:
- it reads the entire file into memory, making it unsuitable for large files,
- and it doesn't break contents by line (or any other delimiter for that matter) which is usually what you want.
java.io.BufferedReader
For a more scalable streaming approach, java.io.BufferedReader is a simple but efficient bet.
(ns tokenize
(:import (java.io BufferedReader FileReader)))
(defn process-file [file-name]
(with-open [rdr (BufferedReader. (FileReader. file-name))]
(doseq [line (line-seq rdr)] (println line))))
(process-file "tokenize.clj")
The BufferedReader wraps the FileReader to provide a lazy interface to the file,
and line-seq allows us to treat the stream as
a list, while behind the scene it calls the readLine method on BufferedReader.
Modifying the above code you could write a generic function for reducing lines in a file.
(ns tokenize
(:import (java.io BufferedReader FileReader)))
(defn process-file [file-name line-func line-acc]
(with-open [rdr (BufferedReader. (FileReader. file-name))]
(reduce line-func line-acc (line-seq rdr))))
(defn process-line [acc line]
(+ acc 1))
(prn (process-file "tokenize.clj" process-line 0))
The above snippet only counts lines in a file, but you could rewrite process-line
to perform detect tokens from the file as well.
(defn process-line [acc line]
(reduce #(assoc %1 %2 (+ (get %1 %2 0) 1)) acc (.split line " ")))
At which point, running the script:
bash-3.2$ clj read1.clj
{"" 20, "rdr))))" 1, "*command-line-args*))))" 1, ...}
In general, BufferedReader and line-seq should be adequate for
most file reading (BufferedReader is the goto way to read large files
in Java), but there are always more ways to do things.
clojure.contrib.duck-streams
If you have clojure.contrib installed, then a slightly more compact approach to parsing files as streams is available: duck-streams.
Modifying the BufferedReader example to use duck-streams, we get this code:
(ns tokenize
(:use [clojure.contrib.duck-streams :only (read-lines)]))
(defn process-file [file-name line-func line-acc]
(reduce line-func line-acc (read-lines file-name)))
(defn process-line [acc line]
(reduce #(assoc %1 %2 (+ (get %1 %2 0) 1)) acc (.split line " ")))
(prn (process-file "tokenize.clj" process-line (hash-map)))
It's a bit more compact, but probably you'd be using duck-streams.read-lines
because you were already using other duck-streams functionality (like
spit for
writing out files, append-spit
for appending to files).
There are many other approaches to reading in files in Clojure, but these should be enough to get started.
Something to remember in clojure is that sequences are by default lazy. So if you produce a sequence from reading a file, e.g.
you have to wrap the sequence producing code in a doall:
This will make sure that all the side-effects in the sequence, i.e. reading from the file, are all realised while the file is still open.
doall is not necessary for the reduce examples, because reduce will realise the whole sequence before returning.
-- Lauri
Thanks for your comment, I'll readily admit some confusion about the distinction between
doallanddoseq. In the REPL, the code withdoseqworked as I imagined it would (read the entire file), but I guess I'll need to look into this a bit more.Thanks a lot for your post! It was helpfull to me to get started with basic tasks in clojure!
Reply to this entry