[email protected]
[Top] [All Lists]

Re: Lucene Index backboned by DB

Subject: Re: Lucene Index backboned by DB
From: jian chen
Date: Tue, 15 Nov 2005 22:59:46 -0800
Dear All,

I have some thoughts on this issue as well.

1) It might be OK to implement retrieving field values separately for a
document. However, I think from a simplicity point of view, it might be
better to have the application code do this drudgery. Adding this feature
could complicate the nice and simple design of Lucene without much benefit.

2) The application could separately a document into several documents, for
example, one document for indexing mainly, the other documents for storing
binary values for different fields. Thus, giving the relevant doc id, its
associated binary value for a particular field could be loaded very fast
with just a disk lookup (looking up the fdx file).

This way, only the relevant field is loaded into memory rather than all of
the fields for a doc. There is no change on Lucene side, only some more work
for the application code.

My view for a search library (or in general, a library), should be small and
efficient, since it is used by lot of applications, any additional feature
could potentially impact its robustness and liability to performance
drawback.

Welcome for any critics or comments?

Jian

On 11/15/05, Robert Kirchgessner <[email protected]> wrote:
>
> Hi,
>
> a discussion in
>
> http://issues.apache.org/jira/browse/LUCENE-196
>
> might be of interest to you.
>
> Did you think about storing the large pieces of documents
> in a database to reduce the size of Lucene index?
>
> I think there are good reasons to adding support for
> storing fields in separate files:
>
> 1. One could define a binary field of fixed length and store it
> in a separate file. Then load it into memory and have fast
> access for field contents.
>
> A use case might be: store calendar date (YYYY-MM-DD)
> in three bytes, 4 bits for months, 5 bits for days and up to
> 15 bits for years. If you want to retrieve hits sorted by date
> you can load the fields file of size (3 * documents in index) bytes
> and support sorting by date without accessing hard drive
> for reading dates.
>
> 2. One could store document contents in a separate
> file and fields of small size like title and some metadata
> in the way it is stored now. It could speed up access to
> fields. It would be interesting to know whether you gain
> significant perfomance leaving the big chunks out, i.e.
> not storing them in index.
>
> In my opinion 1. is the most interesting case: storing some
> binary fields (dates, prices, length, any numeric metrics of
> documents) would enable *really* fast sorting of hits.
>
> Any thoughts about this?
>
> Regards,
>
> Robert
>
>
>
> We have a similiar problem
>
> Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora:
> > Hi all,
> > in our testing application using lucene 1.4.3. Thanks you guys for
> > that great job.
> > We have index file around 12GiB, one file (merged). To retrieve hits it
> > takes nice small amount of the time, but reading fields takes 10-100
> > times more (the stored ones). I think because all the fields are read.
> > I would like to try implement lucene index files as tables in db with
> > some lazy fields loading. As I have searched web I have found only impl.
> > of the store.Directory (bdb), but it only holds data as binary streams.
> > This technique will be not so helpful because BLOB operations are not
> > fast performing. On another side I will have a lack of the freedom from
> > documents fields variability but I can omit a lot of the skipping and
> > many opened files. Also IndexWriter can have document/term locking
> > granuality.
> > So I think that way leads to extends IndexWriter / IndexReader and have
> > own implementation of index.Segment* classes. It is the best way or I
> > missing smthg how achieve this?
> > If it is bad idea, I will be happy to heard another possibilities.
> >
> > I would like also join development of the lucene. Is there some points
> > how to start?
> >
> > Thx for reading this,
> > sorry if I did some mistakes
> >
> > Karel
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
<Prev in Thread] Current Thread [Next in Thread>