[email protected]
[Top] [All Lists]

[Haskell-cafe] Spelling checker exercise

Subject: [Haskell-cafe] Spelling checker exercise
From: Matthew Phillips
Date: Fri, 22 Jan 2010 17:21:27 +1030
Hello all,

sorry to bring up an old chestnut, but I’m trying to improve my Haskell-fu by 
writing a small program, and chose Peter Norvig’s spelling checker as my 
exercise material (http://norvig.com/spell-correct.html).

While I’ve gotten it working, and found it quite illuminating, I also found it 
to to be very slow. And then I discovered that, of course, others have been 
here well before me ([1] and [2]). Those discussions were very interesting, but 
the code they refer to is mostly not available, so the most I’ve gotten out of 
it so far is that:

  (a) I should be using strict left folds and strict Map insertions for the 
word frequency map (this shaved off about a second: ~5s -> ~4s for a single 
word on my 2.4GHz MacBook Pro, GHC 6.10.4)
  (b) I should probably be using ByteString’s
  (c) Using Set’s for the edit permutations probably isn’t worth it (although I 
found using plain lists made it about a second slower)

(b) is difficult because I’ve used matching patterns plus list comprehensions 
to generate the potential edits, and I really like how elegantly it pans out 
that way. Because ByteString’s are not lists, I can’t see a way to keep the 
current structure and use them.

The code is at [3] (link to version at time of post). Profiling [4] shows:

  $ ./spelling becuase +RTS -p 
  becuase -> because
  $ cat spelling.prof
    total time =  4.02 secs (201 ticks @ 20 ms) 
    total alloc = 1,544,257,792 bytes (excludes profiling overheads) 

  COST CENTRE                    MODULE               %time %alloc

  train                          Main                  52.7   19.7
  readFile                       Main                  28.9    8.6
  wordsBy                        Main                  10.9   49.5
  toLower                        Main                   7.0   21.8
  ...

So it appears that “train" (building the freq map) and “readFile” in “nwords" 
are the places to hone. I will look at using Bloom Filters or Trie’s instead of 
Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, 
even for a 6MB file?

Sorry to dump such a long post on the list — I’ll understand if no one can be 
bothered rehashing this. But, in summary I’d like to know:

  (a) how could I use ByteString’s for this to speed up I/O and reduce memory 
usage without losing the nice readability?

  (b) should readFile be so slow?

  (c) any other tips

Possibly all my questions could be answered if someone has the code from the 
old posts.

Cheers,

Matthew.

[1]: 
http://haskell.markmail.org/search/?q=norvig%20spelling#query:norvig%20spelling+page:1+mid:jrvpyjpms5dumd3k+state:results

[2]: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/21780

[3]: 
http://github.com/scramjet/spelling/blob/44fd336ef4f62d49c7087dbc1ffb9c009b78cec0/spelling.hs

[4]: 
http://github.com/scramjet/spelling/blob/f5146b24b77443c22975ff69fb91aa922cc1d4a3/spelling.prof

_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

<Prev in Thread] Current Thread [Next in Thread>