zodb-dev@zope.org
[Top] [All Lists]

Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage (even with cache-size-b

Subject: Re: [ZODB-Dev] ZODB Ever-Increasing Memory Usage even with cache-size-bytes
From: Ryan Noon
Date: Mon, 10 May 2010 17:16:20 -0700
Hi all,

I've incorporated everybody's advice, but I still can't get memory to obey cache-size-bytes.  I'm using the new 3.10 from pypi (but the same behavior happens on the server where I was using 3.10 from the new lucid apt repos).

I'm going through a mapping where we take one long integer "docid" and map it to a collection of long integers ("wordset") and trying to invert it into a mapping for each '"wordid" in those wordsets to a set of the original docids ("docset").

I've even tried calling cacheMinimize after every single docset append, but reported memory to the OS never goes down and the process continues to allocate like crazy.

I'm wrapping ZODB in a "ZMap" class that just forwards all the dictionary methods to the ZODB root and allows easy interchangeability with my old sqlite OODB abstraction.

Here's the latest version of my code, (minorly instrumented...see below):

        try:
            max_docset_size = 0
            for docid, wordset in docid_to_wordset.iteritems():
                for wordid in wordset:
                    if wordid_to_docset.has_key(wordid):
                        docset = wordid_to_docset[wordid]
                    else:
                        docset = array('L')
                    docset.append(docid)
                    if len(docset) > max_docset_size:
                        max_docset_size = len(docset)
                        print 'Max docset is now %d (owned by wordid %d)' % (max_docset_size, wordid)
                    wordid_to_docset[wordid] = docset
                    wordid_to_docset.garbage_collect()
                    wordid_to_docset.connection.cacheMinimize()
                
                n_docs_traversed += 1

                    
                if n_docs_traversed % 100 == 1:
                    status_tick()
                if n_docs_traversed % 50000 == 1:
                    self.do_commit()
                    
            self.do_commit()
        except KeyboardInterrupt, ex:
            self.log_write('Caught keyboard interrupt, committing...')
            self.do_commit()

I'm keeping track of the greatest docset (which would be the largest possible thing not able to be paged out) and its only 10,152 longs (at 8 bytes each according to the array module's documentation) at the point 75 seconds into the operation when the process has allocated 224 MB (on a cache_size_bytes of 64*1024*1024).


On a lark I just made an empty ZMap in the interpreter and filled it with 1M unique strings.  It took up something like 190mb.  I committed it and mem usage went up to 420mb.  I then ran cacheMinimize (memory stayed at 420mb).  Then I inserted another 1M entries (strings keyed on ints) and mem usage went up to 820mb.  Then I committed and memory usage dropped to ~400mb and went back up to 833mb.  Then I ran cacheMinimize again and memory usage stayed there.  Does this example (totally decoupled from any other operations by me) make sense to experienced ZODB people?  I have really no functional mental model of ZODB's memory usage patterns.  I love using it, but I really want to find some way to get its allocations under control.  I'm currently running this on a Macbook Pro, but it seems to be behaving the same way on Windows and Linux.

I really appreciate all of the help so far, and if there're any other pieces of my code that might help please let me know.

Cheers,
Ryan

On Mon, May 10, 2010 at 3:18 PM, Jim Fulton <jim@xxxxxxxx> wrote:
On Mon, May 10, 2010 at 5:39 PM, Ryan Noon <rmnoon@xxxxxxxxx> wrote:
> First off, thanks everybody.  I'm implementing and testing the suggestions
> now.  When I said ZODB was more complicated than my solution I meant that
> the system was abstracting a lot more from me than my old code (because I
> wrote it and new exactly how to make the cache enforce its limits!).
>
>> > The first thing to understand is that options like cache-size and
>> > cache-size bytes are suggestions, not limits. :)  In particular, they
>> > are only enforced:
>> >
>> > - at transaction boundaries,
>
> If it's already being called at transaction boundaries how come memory usage
> doesn't go back down to the quota after the commit (which is only every 25k
> documents?).

Because Python generally doesn't return memory back to the OS. :)

It's also possible you have a problem with one of your data
structures.  For example if you have an array that grows effectively
without bound, the array will have to be in memory, no matter how big
it is.  Also, if the persistent object holding the array isn't seen as
changed, because you're appending to the array, then the size of the
array won't be reflected in the cache size. (The size of objects in
the cache is estimated from their pickle sizes.)

I assume you're using ZODB 3.9.5 or later. If not, there's a bug in
handling new objects that prevents cache suggestions from working
properly.

If you don't need list semantics, and set semantics will do, you might
consider using an BTrees.LLBtree.TreeSet, which provides compact
scalable persistent sets.  (If your word ids can be signed, you could
ise the IIBTree variety, which is more compact.) Given the variable
name is wordset, then I assume you're dealing with sets. :)

What is wordid_to_docset? You don't show it's creation.

Jim

--
Jim Fulton



--
Ryan Noon
Stanford Computer Science
BS '09, MS '10
_______________________________________________
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@xxxxxxxx
https://mail.zope.org/mailman/listinfo/zodb-dev
<Prev in Thread] Current Thread [Next in Thread>