java-user@lucene.apache.org
[Top] [All Lists]

RE: Using org.apache.lucene.analysis.compound

Subject: RE: Using org.apache.lucene.analysis.compound
From: Benjamin Douglas
Date: Wed, 21 Oct 2009 15:09:58 -0400
OK, that makes sense. So I just need to add all of the sub-compounds that are 
real words at posIncr=0, even if they are combinations of other sub-compounds.

Thanks!

-----Original Message-----
From: Robert Muir [mailto:rcmuir@xxxxxxxxx] 
Sent: Wednesday, October 21, 2009 11:49 AM
To: java-user@xxxxxxxxxxxxxxxxx
Subject: Re: Using org.apache.lucene.analysis.compound

yes, your dictionary :)

if Ãberwachungsgesetz is a real word, add it to your dictionary.

for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Ãberwachung" }, and you index
RindfleischÃberwachungsgesetz, then all 3 queries will have the same score.
but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Ãberwachung", "Ãberwachungsgesetz" }, then this makes
a big difference.

all 3 queries will still match, but Ãberwachungsgesetz will have a higher
score. this is because now things are analyzed differently:
RindfleischÃberwachungsgesetz will be decompounded as before, but with an
additional token: Ãberwachungsgesetz.
so back to your original question, these 'concatenations' of multiple
components, yes compounds will do that, if they are real words. but it won't
just make them up.

"Ãberwachungsgesetz"
0.23013961 = (MATCH) sum of:
  0.057534903 = (MATCH) weight(field:Ãberwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:Ãberwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:Ãberwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:Ãberwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:Ãberwachung in 0), product of:
    0.5 = queryWeight(field:Ãberwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:Ãberwachung in 0), product of:
      1.0 = tf(termFreq(field:Ãberwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:Ãberwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:Ãberwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:Ãberwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:Ãberwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
    0.5 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"gesetzÃberwachung"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:Ãberwachung in 0), product of:
    0.2814906 = queryWeight(field:Ãberwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:Ãberwachung in 0), product of:
      1.0 = tf(termFreq(field:Ãberwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"fleischgesetz"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
    0.2814906 = queryWeight(field:fleisch), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
      1.0 = tf(termFreq(field:fleisch)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)




On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
<bbdouglas@xxxxxxxxxxxxx>wrote:

> Thanks for all of the answers so far!
>
> Paul's question is similar to another aspect I am curious about:
>
> Given the way the sample word is analyzed, is there anything in the scoring
> mechanism that would rank "Ãberwachungsgesetz" higher than
> "gesetzÃberwachung" or "fleischgesetz"?
>
>

-- 
Robert Muir
rcmuir@xxxxxxxxx
<Prev in Thread] Current Thread [Next in Thread>