I wonder if anyone knows.....
- is there a place I can get already crawled internet web pages in an
archive (10 - 100Gb of data)
- is there a place I can get already created Lucene index for these
- is there such thing as 'categorized-terms' index, meaning each page is
processed by an NLP engine which assigns 'category tokens' to terms in
the text (eg, can detect company names, location names, addresses, etc),
so in the end on top of inverted index of pages you'll also get list of
detected term categories, as well as straight and inverted index of term
categories (ie, 'location names': 'toronto', 'new york', etc; 'location
names docids': doc1, doc2, doc3; 'toronto as location name': doc1, doc5,
- can anyone recommend any particular open source NLP library?
- I've browsed through a bunch of them and the most impressive one was
'gate' (http://gate.ac.uk/). Has anyone used this library?
- can these libraries compliment each other (ie, ran the same content
through several libraries and then have a 'vote' on the most probable
PS: thanks to everyone who answered recent questions, especially Otis,
Mark, Andrzej and Grant.