[email protected]
[Top] [All Lists]

Re: [Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

Subject: Re: [Haskell-cafe] Space Efficiency When Sorting a List of Many Lists
From: Peter Green
Date: Sat, 02 Jan 2010 22:33:02 +0800

On 31/12/2009, at 6:38 PM, Luke Palmer wrote:

On Wed, Dec 30, 2009 at 9:39 PM, Peter Green <[email protected]> wrote:
I can guess that there might be be less laziness and more instantiation when
 sorting is introduced,

Yes, by a lot.  Sorting requires keeping the entire list in memory.
And Haskell lists, unfortunately, are not that cheap in terms of space
usage (I think [Int] uses 3 words per element).

but my questions are:
       (1) Am I doing anything terribly stupid/naive here?
       (2) If so, what can I do to improve space efficiency?


import Data.List (sort)
import Data.List.Split (splitOn)

-- Cartesian Product over a List of Lists
-- cp [[1,2],[3],[4,5,6]] -->
cp          :: [[a]] -> [[a]]
cp []       =  [[]]
cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]

This cartesian product varies in its tail faster than its head, so
every head gets its own unique tail.  If you reverse the order of the
bindings so that it varies in its head faster, then tails are shared.
If my quick and dirty reasoning is correct, it improves the space
usage by a factor of the number of sublists.

cp' [] = [[]]
cp' (xs:xss) = [y:ys | ys <- cp' xss, y <- xs]

But if you're serious, you can probably do better than just generating
them all and passing them to sort.  I get the impression that there is
some structure here that can be taken advantage of.

-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
-- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
fromCSV :: [String] -> [[[Int]]]
fromCSV = map parseOneLine
   where parseOneLine = map parseGroup . splitOn ","
             where parseGroup = map read . splitOn "+"

-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
-- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],[2,15,16]]
explode :: [[[a]]] -> [[a]]
explode =  concatMap cp

-- toCSV [[8,11,7,10,12],[8,11,7,10,12],[8,11,7,10,12]] -->
-- ["8,11,7,10,12","8,11,7,10,12","8,11,7,10,12"]
toCSV :: (Show a) => [[a]] -> [String]
toCSV = map $ tail . init . show
--toCSV = map (intercalate "," . map show)

main = interact (unlines . toCSV . sort . explode . fromCSV . lines)

Thank you everyone for the very helpful suggestions so far!

I think I should re-state the problem and provide more background info. In answer to the poster who suggested using a Trie, there *is* a real world application for what I'm attempting to do. I'll also hint at that below:

I have text files containing 'compressed lists' Compressed lists  
look like this:


Sublists are comma-delimited, and sublist elements are separated by  
'+' character. Sublists contain integers in the range 1..20.

I parse these to look like so:


I need to explode these and produce a lex-sorted list of exploded lists:


I then output this data as comma-delimited lists:


It is a property of the underlying problem 'structure' that none of these comma-delimited lists in the exploded data is repeated. i.e. we can never see two instances of (say) 1,1,3,4.

{ Begin Aside on Why I Would Want to do Such a Silly Thing:

'Compressed lists' are in fact compressed wager combinations for multi-leg exotic horse racing events. 'Exploded lists' are the single wager combinations covered by the grouped combinations.

Why use compressed combinations? Well, it's a lot easier to submit 5,000 compressed tickets (whether physically or electronically) than 1M single tickets for the individual combinations we have somehow managed to represent by the 5,000 compressed tickets. 

However, single combinations are the currency of the realm. These are the underlying wagers and compression is merely a convenience/necessity to enable single wagers to be placed.

It's  not uncommon to generate large numbers of single combinations: Imagine 6 races with 15 runners in each race, and a wager requiring one to select first place getters in all 6 legs. The number of potential outcomes is 15^6 = 11,390,625. One might decide (somehow) that it makes sense in a positive expectation way to wager (say) 500K of these potential outcomes.

Getting from theoretically desirable single combinations to optimally compressed wagers is actually NP-Complete and another story for another day. Suffice it to say that one might use some nasty hacks and heuristics to arrive at compressed wagers - but in the process doing violence to one's precise coverage of the theoretically desirable single combinations. That's one reason why I need to recover (by 'exploding') the *actual* wagered single combinations from the compressed/ wagers.

I need to explode these compressed wagers back to single combinations because, I need to (eventually) do set comparisons on collections  of single combinations. i.e. given File A and File B of compressed wagers, I need to be able to  answer questions like:

(1) How many single combinations are common to File A and File B
(2) How many single combinations in File A are missing from File B
etc. i.e a few of the common set comparison operations.

And the dumbest, quickest and dirtiest and most idiot-proof way of doing this as a baseline approach before I start using maps, tries, etc... is to explode Files A and B into lex sorted single combinations order and then use diff -y with judicious grepping for '>' and <''

End Aside }

I'm probably going to go with an external sort for starters to keep memory usage down. For most use-cases, I can get away with my current in-memory sort - just have to beware edge cases.

However, later on, I'm keen to do everything with set theoretic operations and avoid writing large files of single combinations to disk.

So, I guess the things I'd like to get a feel for are:

(1) Is it realistic to expect to do in-memory set comparisons on sets with ~1M elements where each element is a (however-encoded) list of (say) 6 integers? I mean realistic in terms of execution time and space usage, of course.

(2) What would be a good space-efficient encoding for these single combinations? I know that there will never be more than 8 integers in a combination, and none of these values will be < 1 or > 20. So perhaps I should map them to ByteString library Word8 strings? Presumably sorting a big list of ByteStrings is going to be faster than sorting a big list of lists of int?

Haskell-Cafe mailing list
[email protected]
<Prev in Thread] Current Thread [Next in Thread>