|
|
28 mar 2007 kl. 15.24 skrev Sengly Heng:
Thank you but I still have have no clue of how to do that by using
Weka
after taking a look at its API. Let me reformulate my problem :
I have a collection of vector of terms (actually each vector of terms
represents the list of tokens extracted from a file) and I do not
have the
original files. I would like to calculate TF as well as TFIDF of
each term
and sorted them by these value respectively. As suggested by Grant
Ingersoll, I could index those vectors of terms again using Lucene
and then
use its API to measure TF and TFIDF. However I guess there should be a
simpler way or API just fit-in this case.
To my knowledge there is no thing in Lucene that makes it simpler for
you than what Grant suggests. And according to me, Weka really must
be the simplest way around. However, perhaps you should supply us
with an example of what these vectors look like. That might change
everything. Perhaps we are talking of completely different things here.
Let me reformulate my suggestion:
1. rebuild your vector to a string.
2. put the data in a file called myvectors.arff:
@relation termvectors
@attribute the_vector string
@data
"first term vector as a string"
"second term vector as a string"
3. open the file in the weka explorer application.
4. select filter/unsupervised/attribute/string to word vector
5. set your preferences of normalization, et c.
6. apply the filter.
7. save the output.
All this can be done progamatically too, with only a few lines of code.
Thanks once again everyone.
Best regards,
Sengly
On 3/28/07, karl wettin <karl.wettin@xxxxxxxxx> wrote:
28 mar 2007 kl. 10.36 skrev Sengly Heng:
> Does anyone of you know any Java API that directly handle this
> problem?
> or I have to implement from scratch.
You can also try
weka.filters.unsupervised.attribute.StringToWordVector, it has many
neat features you might be interested in. And if applicable to what
you attempt to do, the feature selection algorithms of the same
project (Weka) does a great job reducing the data set.
http://www.cs.waikato.ac.nz/ml/weka/
It is GPL.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx
|
|