java-user@lucene.apache.org
[Top] [All Lists]

Re: problems with deleteDocuments

Subject: Re: problems with deleteDocuments
From: "Erick Erickson"
Date: Sun, 8 Jul 2007 18:18:48 -0400
First, let me say that I ran a few tests to determine the behavior, so
it's entirely possible someone who actually understands the code will
tell me I'm all wet.

The problem here is that for every scenario in which deleting on partial
field matches would be good, I can create one where it would be bad. Or
worse, disastrous <G>....

The issue isn't deleting multiple documents at once, it's deleting on
partial term matches. In your example, you could have, say, two
IDs for the attachments, one the unique ID and one the parent ID. This
would allow you to remove all the attachments in one delete, and use
a separate delete for the original. Or you could have a "group" id
for an  e-mail and it's attachments, and delete them all with a single
delete on the "group" term. Or.....

But I'd *much* rather have a system that required me to jump through
a hoop when I wanted this kind of behavior than have a system that
allowed me to shoot myself in the foot. Or blow off the whole leg. And
deleting on, say, a text field with a partial term match on "the" is just
too terrible to contemplate <G>....

Best
Erick


On 7/8/07, Nadav Har'El <nyh@xxxxxxxxxxxxxxxxxxx> wrote:

On Wed, Jul 04, 2007, Erick Erickson wrote about "Re: problems with
deleteDocuments":
> Consider what would happen otherwise. Say you have documents
> with the following values for a field (call it blah).
> some data
> some data I put in the index
> lots of data
> data
>
> Then I don't want deleting on the term blah:data to remove all
> of them. Which seems to be what you're asking. Even if
> you restricted things to "phrases", then deleting on the term
> 'blah:some data' would remove two documents.
>
> So, while UN_TOKENIZED isn't a *requirement*, exact total term
> matches *is* the requirement. By that, I meant that whatever
> goes into the field must not be broken into pieces by the indexing
> tokenizer for deletes to work as you expect.

I disagree, and frankly, am very surprised that "exact total term matches"
is
actually a requirement (I never tried it, so you may be absolutely right,
I
just hope you aren't).

Let me give you just one example where id fields containing multiple
words,
and the ability for a delete query to match several documents, are useful.

Consider an application for indexing emails with attachments. The email
text,
and each document attachment, is indexed as a separate document. When an
email is deleted, we also need to delete its attachments. How shall we do
this? One simple implementation is to have an "id" field for each
document;
The email text document will have a unique id, and the attachment document
will have two ids: its own unique id, and the containing email's id. When
we need to remove an email and all its attachments, we just remove all
documents that match the email's id - and this will include the main text
and the attachments.

By the way, the method is called "deleteDocuments" - doesn't that imply
that it's perfectly acceptable to delete many documents with one term?


--
Nadav Har'El                        |      Sunday, Jul  8 2007, 22 Tammuz
5767
IBM Haifa Research
Lab              |-----------------------------------------
                                    |I am not a complete idiot - some
parts
http://nadav.harel.org.il           |are missing.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx


<Prev in Thread] Current Thread [Next in Thread>