java-dev@lucene.apache.org
[Top] [All Lists]

Re: Proposal about Version API "relaxation"

Subject: Re: Proposal about Version API "relaxation"
From: Shai Erera
Date: Thu, 15 Apr 2010 14:52:57 +0300
Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable.

Up until now, Lucene migrated my segments gradually, and before I upgraded from X+1 to X+2 I could run optimize() to ensure my index will be readable by X+2. I don't think I can myself agree to it, let alone convince all the stakeholders in my company who adopt Lucene today in numerous projects, to let go of such capability. We've been there before (requiring reindexing on version upgrades) w/ some offerings and customers simply didn't like it and were forced to use an enterprise-class search engine which offered less (and didn't use Lucene, up until recently !). Until we moved to Lucene ...

What's Solr's take on it?

I differentiate between structural changes and runtime changes. I, myself, don't mind if we let go of back-compat support for runtime changes, such as those generated by analyzers. For a couple of reasons, the most important ones are (1) these are not so frequent (but so is index structural change) and (2) that's a decision I, as the application developer, makes - using or not a newer version of an Analyzer. I don't mind working hard to make a 2.x Analyzer version work in the 3.x world, but I cannot make a 2.x index readable by a 3.x Lucene jar, if the latter doesn't support it. That's the key difference, in my mind, between the two. I can choose not to upgrade at all to a newer analyzer version ... but I don't want to be forced to stay w/ older Lucene versions and features because of that ... well people might say that it's not Lucene's problem, but I beg to differ. Lucene benefits from wider and faster adoption and we rely on new features to be adopted quickly. That might be jeopardized if we let go of that strong capability, IMO.

What we can do is provide an index migration tool ... but personally I don't know what's the difference between that and gradually migrating segments as they are merged, code-wise. I mean - it has to be the same code. Only an index migration tool may take days to complete on a very large index, while the ongoing migration takes ~0 time when you come to upgrade to a newer Lucene release.

And the note about Terrier requiring reindexing ... well I can't say it's a strength of it but a damn big weakness IMO.

About the release pace, I don't think we can suddenly release every 2 years ... makes people think the project is stuck. And some out there are not so fond of using a 'trunk' version and release it w/ their products because trunk is perceived as ongoing development (which it is) and thus less stable, or is likely to change and most importantly harder to maintain (as the consumer). So I still think we should release more often than not.

That's why I wanted to differentiate X and Y, but I don't mind if we release just X ... if that's so important to people. BTW Mike, Eclipse's releases are like Lucene, and in fact I don't know of so many projects that just release X ... many of them seem to release X.Y.

I don't understand why we're treating this as a "all or nothing" thing. We can let go of API back-compat, that clearly has no affect on index structure and content. We can even let go of index runtime changes for all I care. But I simply don't think we can let go of index structure back-support.

Shai

On Thu, Apr 15, 2010 at 1:12 PM, Michael McCandless <lucene@xxxxxxxxxxxxxxxxxx> wrote:
2010/4/15 Shai Erera <serera@xxxxxxxxx>:

> One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former.

I prefer X.Y, ie, changes to Y only is a minor release (mostly bug
fixes but maybe small features); changes to X is a major release.  I
think that's more "standard", ie, people will generally grok that 3.3
-> 4.0 is a major change but 3.3 -> 3.4 isn't.

So this proposal would change how Lucene releases are numbered.  Ie,
the next release would be 4.0.  Bug fixes / small features would then
be 4.1.

> Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise.

No... in the proposal, you must re-index on upgrading to the next
major release (3.x -> 4.0).

I think supporting old indexes, badly (what we do today) is not a
great solution.  EG on upgrading to 3.1 you'll immediately see a
search perf hit since the flex emulation layer is running.  It's a
trap.

It's this freedom, I think, that'd let us drop Version entirely.  It's
the back-compat of the index that is the major driver for having
Version today (eg so that the analyzers can produce tokens matching
your old index).

EG Terrier seems to have the same requirement -- note the bold "All
indexes must be rebuilt":

 http://terrier.org/docs/current/whats_new.html

Also, Lucene isn't a primary store (like a filesytem or a database).
We expect that your "true" content still lives somewhere else.  So why
do we go to such great lengths to keep the index format for so
long...?

> BTW, w/ all that - does it mean 'backwards' can be dropped, or at least test-backwards activated only on a branch which we decide needs it? That'll be really great.

I think the stable branches (2.x, 3.x) would have backwards tests
created the moment they are branched, to make sure as we fix bugs /
backport minor features we don't break back compat, along that branch.

I don't think we need the .Z part of a release numbering -- our
numbers would look like most other software projects.  3.0 is a major
release, 3.1, 3.2, 3.3 fix bugs / add minor features, etc.

If flex were done in this world I would've finished it alot faster!  A
huge amount of time went into the cross back compat emulation layers
(pre-flex APIs and pre-flex index).

> Also, we will still need to maintain the Backwards section in CHANGES (or move it to API Changes), to help people upgrade from release to release.

I think we'd create a migration guide to explain how apps migrate to
the next major release (this is what other projects do), eg like this:

 http://community.jboss.org/wiki/Hibernate3MigrationGuides#A42

> Unless you're telling me we'll start releasing major releases more often?

I think this is mostly orthogonal?  We could still do major releases
frequently or rarely with this model... however, it would give us more
freedom to do major releases frequently (vs today where every major
release sets a scary back-compat-burden stake in the ground).

> I don't see why would anyone releases a 3.x after 4.0 is out unless someone really wants to work hard on maintaining back-compat of some features

I think the minor releases on the stable branch (3.1, 3.2, 3.3) would
be mostly bug fixes, but maybe also minor features if
contributor's/developer's had the itch to make them available on the
stable (3.x) branch.  How much dev happens on the stable branch can be
largely determined by itch...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-dev-help@xxxxxxxxxxxxxxxxx


<Prev in Thread] Current Thread [Next in Thread>