A little while ago I announced the existence of the Aperture project,
founded by my company together with the DFKI institute.
We just released Aperture 2006.1 alpha 2, which may be of interest to
all Lucene users dealing with crawling and text extraction.
The project page is located at:
To summarize, Aperture now has code for the following tasks:
- Crawling of file systems, websites and IMAP folders. An Outlook
mailbox crawler is also in the works, any help is welcome.
- Text and metadata extraction of a large and growing number of document
formats, e.g. MS Office files, MS Works, OpenOffice, OpenDocument, RTF,
PDF, WordPerfect, Quattro, Presentations, HTML, XML, plain text...
- A robust magic number-based MIME type identifier, a must for choosing
the right extractor for a given document.
- Security-related classes for handling self-signed certificates when
communicating using SSL.
Most of the code is already in good shape. The reason that it is still
labeled as "alpha" is that we only recently started applying Aperture in
our own software, which may still lead to certain (probably minor) API
Future plans include continuously extending the set of extractors, e.g.
by including extractors for mp3, images, videos, etc., adding support
for Thunderbird and other mail clients, support for expanding and
crawling archives, address books, ...
Furthermore we are working on metadata storage facilities that build
upon Lucene and Sesame, a RDF storage and query engine (see
www.openrdf.org). This should combine the expressiveness of RDF and the
performance and scalability of Sesame with Lucene's full-text indexing
For questions please consider joining the aperture-devel mailing list.
Prinses Julianaplein 14-b
3817 CS Amersfoort
+31 33 465 9987 phone
+31 33 465 9987 fax
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx