[email protected]
[Top] [All Lists]

Implementation in C & Some Questions

Subject: Implementation in C & Some Questions
From: Robert Kirchgessner
Date: Sat, 12 Nov 2005 02:02:18 +0100

please excuse me if I'm completely wrong here,
I know there is a Lucene4c in Incubator, but there
seems to be not much traffic on its mailing list.

First I want to thank all people involved in the
project for this great software.

I've made a port of Lucene to C, we use it in a
corporate environment as PHP-module and as a
standalone CGI in pure C for indexing and searching.
It's a great success. The code is Open Source
(Apache License).

The current implementation supports following:

- indexing and searching in pure C
- binary storage
- omit norms
- portable, runs on Linux, Windows, Mac, .... (it's ANSI C ... I hope)
- unit tested code
- no known memory leaks ( used valgrind for checking )

It lacks some features:

- Unicode support (was a deliberate decision, but may be fixed in future)
- no thread support
- no file locking
- no publicly available QueryParser (we've written some highly specific
version for our corporate needs, but it will be an easy task for us to
write a version matching Java-Lucene. We use re2c for regular
expressions and lemon parser generator)
- as we use Lucene in German environment, there is no support
for other languages.
- no span, range, wildcard and fuzzy search yet.

Now I am updating the code to the latest development version of
Java Lucene. By the way I introduce consequent memory management
with apr-pools (APR - Apache Portable Runtime -> it's great! ),
apr-like error handling with apr_status_t and apr-like code conventions.

So here are my questions:

1. Is someone interested in this? If so, what's the best way to share
sources? Some very early version of project is in SourceForge. I may
checkin current sources there.

2. I'd like to keep the developement of the C-code in sync with Java.
As we use this C-library very hevily at our company, we get
some ideas for extending the search engine, some of them are:

- support for separatly stored fields, e.g. like norms in the current
- support for binary fields of fixed length (e.g. for purposes of sorting,
numerical comparison, optimization of memory consumption and
fast file access)

Further I've got many questions considering Java implementation like:

- Why storing tokenized, binary and compressed flags in field data
instead of in field info as global field attributes? In case where
this attributes are constant for a field it consumes a byte per document,
which could be saved, if stored in field info file.

- Why the assumption that NO_NORMS for a field implies that the
field is not tokenized with an analyzer:

>    /** Index the field's value without an Analyzer, and disable
>    * the storing of norms.  No norms means that index-time boosting
>   * and field length normalization will be disabled.  The benefit is
>     * less memory usage as norms take up one byte per indexed field
>     * for every document in the index.
>     */
>    public static final Index NO_NORMS = new Index("NO_NORMS");

We have many use cases in our applications where we omit norms while
tokenizing fields (e.g. think about use case like retrieving hits using
custom sorting depending on some field).

Can we discuss such questions in this mailing list? If these discussions
result in some decisions, it would be no problem for me to implement
some ideas in Java.

Thank you very much in advance,

Robert Kirchgessner

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

<Prev in Thread] Current Thread [Next in Thread>