java-user@lucene.apache.org
[Top] [All Lists]

Re: Document Similarities lucene(particularly using doc id's)

Subject: Re: Document Similarities luceneparticularly using doc id's
From: Grant Ingersoll
Date: Mon, 20 Aug 2007 06:47:27 -0400
You would have write it, as it doesn't exist in Lucene (but could be a useful contribution). The easiest version is probably the cosine similarity, described at http://en.wikipedia.org/wiki/Vector_space_model

Essentially, you have two vectors, and you need to calculate the cosine of the angle between them. Then you can do this for each one in your set.

What is the bigger goal you are trying to get at? What do you need the similarity score for? Do you need to compare every item in set 1 against every item in set 2?

On Aug 19, 2007, at 11:19 PM, Lokeya wrote:


Hi,

Thanks for your reply.

I can use the getTermFreqVector() on Index Reader and get it. But I am
wondering whats the API which has to be used to find the similarity between 2 such vectors which would give a score (doc-doc similairty in essence).

Thanks.



Grant Ingersoll-6 wrote:

Hi,


On Aug 16, 2007, at 2:20 PM, Lokeya wrote:


Hi All,

I have the following set up: a) Indexed set of docs. b) Ran 1st
query and
got tops docs  c) Fetched the id's from that and stored in a data
structure.
d) Ran 2nd query , got top docs , fetched id's and stored in a data
structure.

Now i have 2 sets of doc ids (set 1) and (set 1).

I want to find out the document content similarity between these 2
sets(just
using doc ids information which i have).


Not sure what you mean here.  What do the doc ids have to do with the
content?

Qn 1: Is it possible using any lucene api's. In that case can you
point me
to the appropriate API's. I did a search at
:http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
javadoc/index.html
But couldn't find anything.


It is possible if you use Term Vectors (see
IndexReader.getTermFreqVector).  You will need to store (when you
construct your Field) and load the term vectors and then calculate
the similarity.  A common way of doing this is by calculating the
cosine of the angle between the two vectors.

-Grant

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx




--
View this message in context: http://www.nabble.com/Document- Similarities-lucene%28particularly-using-doc-id%27s%29- tf4281286.html#a12229492
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>