java-user@lucene.apache.org
[Top] [All Lists]

Re: Phrase search using quotes -- special Tokenizer

Subject: Re: Phrase search using quotes -- special Tokenizer
From: "Erick Erickson"
Date: Sat, 2 Sep 2006 09:43:44 -0400
Well, you're flying blind. Is the behavior rooted in the indexing or
querying? Since you can't answer that, you're reduced to trying random
things hoping that one of them works. A little like voodoo. I've wasted
faaaaarrrrrr too much time trying to solve what I was *sure* was the problem
only to find it was somewhere else (the last place I look, of course) <G>...

Using Luke on a RAMDir. No, I don't know how to, but it should be a simple
thing to write the index to an FSDir at the same time you create your RAMDir
and use Luke then. This is debugging, after all.

I'd be really, really, really reluctant to modify the query parser and/or
the tokenizer, since whenever I've been tempted it's usually because I don't
understand the tools already provided. Then I have to maintain my custom
code. Which sucks. Although it sure feels more productive to hack a bunch of
code and get something that works 90% of the time, then spend weeks making
the other 10% work than taking two days to find the 3 lines you *really*
need <G>.

Have you thought of a PatternAnalyzer? It takes a regular expression as the
tokenizer and  (from the Javadoc)
<<< Efficient Lucene analyzer/tokenizer that preferably operates on a String
rather than a Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
that can flexibly separate text into terms via a regular expression
Pattern<http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with
behaviour identical to
String.split(String)<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29>),
and that combines the functionality of
LetterTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html>,
LowerCaseTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html>,
WhitespaceTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html>,
StopFilter<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html>into
a single efficient multi-purpose class.>>>

One word of caution, the regular expression consists of expressions that
*break* tokens, not expressions that *form* words, which threw me at first.
Just like the doc says, like splitstring <G>.... This is in 2.0, although I
*believe* it's also in the contrib section of 1.9 (or is in the regular API,
I forget).

Best
Erick

On 9/1/06, Philip Brown <pmb@xxxxxxxxxx> wrote:


No, I've never used Luke.  Is there an easy way to examine my RAMDirectory
index?  I can create the index with no quoted keywords, and when I search
for a keyword, I get back the expected results (just can't search for a
phrase that has whitespace in it).  If I create the index with phrases in
quotes, then when I search for anything in double quotes, I get back
nothing.  If I create the index with everything in quotes, then when I
search for anything by the keyword field, I get nothing, regardless of
whether I use quotes in the query string or not.  (I can get results back
by
searching on other fields.)  What do you think?

Philip


Erick Erickson wrote:
>
> OK, I've gotta ask. Have you examined your index with Luke to see if
what
> you *think* is in the index actually *is*???
>
> Erick
>
> On 9/1/06, Philip Brown <pmb@xxxxxxxxxx> wrote:
>>
>>
>> Interesting...just ran a test where I put double quotes around
everything
>> (including single keywords) of source text and then ran searches for a
>> known
>> keyword with and without double quotes -- doesn't find either time.
>>
>>
>> Mark Miller-5 wrote:
>> >
>> > Sorry to hear you're having trouble. You indeed need the double
quotes
>> in
>> > the source text. You will also need them in the query string. Make
sure
>> > they
>> > are in both places. My machine is hosed right now or I would do it
for
>> you
>> > real quick. My guess is that I forgot to mention...no only do you
need
>> to
>> > add the <QUOTED> definiton to the TOKEN section, but below that you
>> will
>> > find the grammer...you need to add <QUOTED> to the grammer. If you
look
>> > how
>> > <NUM> and <APOSTROPHE> are done you will prob see what you should do.
>> If
>> > not, my machine should be back up tomarrow...
>> >
>> > - Mark
>> >
>> > On 9/1/06, Philip Brown <pmb@xxxxxxxxxx> wrote:
>> >>
>> >>
>> >> Well, I tried that, and it doesn't seem to work still.  I would be
>> happy
>> >> to
>> >> zip up the new files, so you can see what I'm using -- maybe you can
>> get
>> >> it
>> >> to work.  The first time, I tried building the documents without
>> quotes
>> >> surrounding each phrase.  Then, I retried by enclosing every phrase
>> >> within
>> >> double quotes.  Neither seemed to work.  When constructing the query
>> >> string
>> >> for the search, I always added the double quotes (otherwise, it'd
>> think
>> >> it
>> >> was multiple terms).  (I didn't even test the underscore and
>> hyphenated
>> >> terms.)  I thought Lucene was (sort of by default) set up to search
>> >> quoted
>> >> phrases.  From http://lucene.apache.org/java/docs/api/index.html -->
A
>> >> Phrase is a group of words surrounded by double quotes such as
"hello
>> >> dolly".  So, this should be easy, right?  I must be missing
something
>> >> stupid.
>> >>
>> >> Thanks,
>> >>
>> >> Philip
>> >>
>> >>
>> >> Mark Miller-5 wrote:
>> >> >
>> >> > So this will recognize anything in quotes as a single token and
'_'
>> and
>> >> > '-' will not break up words. There may be some repercussions for
the
>> >> NUM
>> >> > token but nothing I'd worry about. maybe you want to use Unicode
for
>> >> '-'
>> >> > and '_' as well...I wouldn't worry about it myself.
>> >> >
>> >> > - Mark
>> >> >
>> >> >
>> >> > TOKEN : {                      // token patterns
>> >> >
>> >> >   // basic word: a sequence of digits & letters
>> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>> >> >
>> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
>> >> >
>> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
>> >> >   // use a post-filter to remove possesives
>> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>> >> >
>> >> >   // acronyms: U.S.A., I.B.M., etc.
>> >> >   // use a post-filter to remove dots
>> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>> >> >
>> >> >   // company names like AT&T and Excite@Home.
>> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>> >> >
>> >> >   // email addresses
>> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>> >> > (("."|"-") <ALPHANUM>)+ >
>> >> >
>> >> >   // hostname
>> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>> >> >
>> >> >   // floating point, serial, model numbers, ip addresses, etc.
>> >> >   // every other segment must have at least one digit
>> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
>> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
>> <HAS_DIGIT>)+
>> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
>> <ALPHANUM>)+
>> >> >         )
>> >> >   >
>> >> > | <#P: ("_"|"-"|"/"|"."|",") >
>> >> > | <#HAS_DIGIT:                      // at least one digit
>> >> >     (<LETTER>|<DIGIT>)*
>> >> >     <DIGIT>
>> >> >     (<LETTER>|<DIGIT>)*
>> >> >   >
>> >> >
>> >> > | < #ALPHA: (<LETTER>)+>
>> >> > | < #LETTER:                      // unicode letters
>> >> >       [
>> >> >        "\u0041"-"\u005a",
>> >> >        "\u0061"-"\u007a",
>> >> >        "\u00c0"-"\u00d6",
>> >> >        "\u00d8"-"\u00f6",
>> >> >        "\u00f8"-"\u00ff",
>> >> >        "\u0100"-"\u1fff",
>> >> >        "-", "_"
>> >> >       ]
>> >> >   >
>> >> >
>> >> >
>> ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
>> >> > For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
>> >> >
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>>
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
>> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >>
>> >>
>> >>
---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
>> >> For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>>
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
>> Sent from the Lucene - Java Users forum at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
>> For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
>>
>>
>
>

--
View this message in context:
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx


<Prev in Thread] Current Thread [Next in Thread>