[email protected]
[Top] [All Lists]

Re: [Haskell-cafe] Re: FASTER primes

Subject: Re: [Haskell-cafe] Re: FASTER primes
From: Daniel Fischer
Date: Mon, 4 Jan 2010 01:29:22 +0100

Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:


> Exactly the point I tried to make. :)



> again, yes. :)



> yes.



> yes, that's what I meant - the cost of calling all the fuctions that - we

> know in advance will - have nothing to do eventually.



> >

> > But there's a lot of list constructuion and deconstruction necessary for

> > the Euler sieve.


> yes. Unless, of course, s smart compiler recognizes there's only one

> consumer for the values each multiples-list is producing, and keeps it

> stripped down to a generator function, and its current value. I keep

> thinkig a smart compiler could eliminate all these "span" calls and replace

> them with just some pointers manipulating...


Of course I'm no smart compiler, but I don't see how it could be even possible to replace the span calls with pointer manipulation when dealing with lazily generated (infinite, if we're really mean) lists. Even when you're dealing only with strict finite lists, it's not trivial to do efficiently.

> > That may be more work than the multiple hits cause.


> so that too would make filters win; only _if_ the cheap primality test

> existed!..


> > > > the multiples of p with 2*p, you have

> > > > ...

> > >

> > > There are two questions here - where to start crossing off numbers, and

> > > when. If you'd start at 2*p maybe the overall complexity would remain

> > > the same but it'll add enormous overhead with all those duplicate

> > > multiples.

> >

> > The additional duplicate multiples aren't the problem. Sure, the numbers

> > having a prime divisor larger than the square root would be crossed off

> > one additional time, but that isn't so much per se. The additional

> > crossings off are O(bound), so they don't harm the time complexity. But

> > they can have effects which multiply the running time by a large

> > constant.


> yes, exactly what I wanted to say. :)


*g* I wondered what you meant by the distinction between where and when to cross off.

> > > No, the question is not where to start, but when. PQ might hide the

> > > problem until the memory blows up. Anything that we add that won't have

> > > any chance of contributing to the final result, is added for nothing

> > > and only drives the total cost up needlessly.

> >

> > Well, if you start crossing off at p^2 and feed your PQ from a separate

> > prime generator, the memory blow-up is far away. E.g., when the main

> > sieve is at 10^12, the PQ contains less than 80000 multiples. No space

> > problem in sight. At 10^14, the PQ's size is still less than 700000.

> > That's noticeable, but there's still plenty of space.


> again, what I mean is, not _where_ I start crossing them off in a PQ, but

> _when_. The article's code starts crossing them off _at_ p^2 - by adding

> p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface

> until p^2 will be considered for a prime; it'll lay dormant deep inside the

> queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries

> will hang out inside the PQ - instead of just 24. A memory blowup. (this is

> of course fixed in Melissa's ZIP package). Of course due to the nature of

> PQ it might actually not hurt the performance for a while, depending on

> partcular PQ implementation. Complexity _and_ constant factors.


It will start to have an impact pretty soon. Assuming at least one of the relevant PQ operations to be Theta(log size), each composite between ~400 and ~400000 (rough estimates) will take something like twice as long to handle. It will start to hurt really badly only a while later, though, as a guesstimate, with more than a million primes in the PQ, memory will have a hard time.

> > If, on the other hand, you start crossung off at 2*p, when the main sieve

> > is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more

> > than 5.5 million. That starts to become a memory problem rather soon.


> here you don't have a choice or when to add it - you have to add it at p

> itself - so the problem is clear. But even when you cross at p^2, the

> question remains, of when you add the p^2 entry into the PQ. That was my

> point.


> Postponed Filters code makes this clear, and thus hard to err on.

> Unfortunately, it wasn't present the article.


> > > > > I was under impression that the first shows O(n^2) approx., and the

> > > > > second one O(n^1.5) (for n primes produced).

> > > >

> > > > In Turner/postponed filters, things are really different. Actually,

> > > > Ms. O'Neill is right, it is a different algorithm. In the above, we

> > > > match each

> > >

> > > what _is_ different is divisibility testing vs composites removal,

> > > which follows from her in-depth analysis although is never quite

> > > formulated in such words in her article. But nothing matters until the

> > > premature starting up is eliminated, and that key observation is

> > > missing for the article either - worse, it is brushed off with the

> > > casual remark that it is "a pleasing but minor optimization". Which

> > > remark, as you show here, is true in the imperative, mutable-storage

> > > setting, but is made in an article abut functional code, in regard to

> > > the functional code of Turner's sieve.

> >

> > I think that remark was meant to apply to composite removal, not Turner's

> > sieve.


> It is right there on page 2, right when the Turner's sieve is presented and

> discussed. The only explanation that I see is that she thought of it in

> regards to the imperative code, just as her analysis concentrates only on

> calculation aspects of the imperative code itself.


To be fair, she writes:

"Let us ïrst describe the original âby handâ sieve algorithm as practiced by



The starting point of p^2 is a pleasing but minor optimization, which can be made

because lower multiples will have already been crossed oï when we found the primes

prior to p.

... (This optimization does not aïect the

time complexity of the sieve, however, so its absence from the code in Section 1 is

not our cause for worry.)"

So it's in context of the imperative code (although rather numbers on paper than bits in RAM).

For imperative (be it array or paper), it is indeed minor; for her PQ sieve, it can be considered minor from a theoretical point of view (doesn't change time complexity), but practically, anything that changes performance by a factor of two or more is only 'minor' for hopelessly inadequate algorithms (whether a computation would take ten or twenty billion years is indeed a minor difference; one hour or two is a major difference).

However, as you say, the important point is not whether p's multiples get crossed off starting from 2*p or from p^2, but whether p's multiples get inserted into the queue when you look at p or at p^2.

> > But even then, the 'minor' isn't adequate considering her PQ approach

> > where, although it doesn't change time complexity (in theory), it changes

> > space complexity - and that may affect running time drastically. Probably

> > she was thinking of the typical array sieve when she made that remark.


> In short, that remark was not referring to what it seemingly refers to, in

> the article, and as such was very confusing. It was a big STOP sign on the

> way to Postponed Filters - Euler's - Bird's merged multiples - tree-merging

> (with wheel) road of little steps, and used as a justification for her to

> make a big leap across the chasm towards the PQ code.


> So to speak. :)


> > > So the key understanding is overlooked.

> > >

> > > And what we're after here is the insight anyway. Skipping steps of

> > > natural development does not contribute to garnering an insight.


> .. and that is of course a matter of personal preference. A genius is

> perfectly capable of making big leaps across chasms. Heck they might even

> be able to fly :)


> > > > In the postponed filters, we match each prime p with all primes <=

> > > > sqrt p (again, the composites contribute less). I think that gives a

> > > > complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log

> > > > n)) for n primes produced.

> > >

> > > Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged

> > > multiples removal around 1.25..1.17 .

> >

> > I should really stop doing those calculations in my head, it seems I'm

> > getting too old for that. Doing it properly, on paper, the cost for

> > identifying p_k is pi(sqrt p_k) [trial division by all primes less than

> > sqrt p_k].

> > p_k is approximately k*log k, so pi(sqrt p_k) is approximately

> > (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).

> > Summing that, the cost for identifying the first n primes is

> > Theta(n^1.5/log n).


> that would correspond to what I've seen much better.


It should. Though empirical measurements don't always adhere exactly to the theoretical results, the deviation is usually small (for large enough input).

> > Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where

> > feederprimes are generated as above, to reduce memory pressure and GC

> > impact, that fits pretty well with my measurements of the time to find

> > p_k for several k between 500,000 and 20,000,000.


> The quesion of a memory blowup with the treefolding merge still remains.

> For some reason using its second copy for a feeder doesn't reduce the

> memory (as reported by standalone compiled program, GHCi reported values

> are useless) - it causes it to increase twice.....


I have a partial solution. The big problem is that the feeder holds on to the beginning of comps while the main runner holds on to a later part. Thus the entire segment between these two points must be in memory.

So have two lists of composites (yeah, you know that, but it didn't work so far).

But you have to force the compiler not to share them: enter -fno-cse.

The attached code does that (I've also expanded the wheel), it reduces the memory requirements much (a small part is due to the larger wheel, a factor of ~5 due to the non-sharing).

It still uses much more memory than the PQ, and while the PQ's memory requirements grow very slowly, the tree-fold merge's still grow rather fast (don't go much beyond the 10,000,000th prime), I'm not sure why.

Attachment: V13Primes.hs
Description: Text Data

Haskell-Cafe mailing list
[email protected]
<Prev in Thread] Current Thread [Next in Thread>