[email protected]
[Top] [All Lists]

Re: bytecount as String and prefix length

Subject: Re: bytecount as String and prefix length
From: Marvin Humphrey
Date: Mon, 31 Oct 2005 22:36:12 -0800
On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:

All of the JDK source is available via download from Sun.
Thanks.  I believe the UTF-8 coding algos can be found in...

j2se > src > share > classes > sun > nio > cs > UTF_8.java

It looks like the translator methods have fairly high loop overheads, since they have to keep track of the member variables of ByteBuffer and CharBuffer objects and prepare to return result objects on each loop iter. Also, they have robust error-checking for malformed source data, which Lucene traditionally has not. The algo below my sig should be faster.
I wrote...

So my next step is to write a utf8ToString method that's as efficient
as I can make it.
Ok, this time we made a little headway. We're down from 20% slower
to around 10% slower indexing than current implementation. But I
don't see how I'm going to get it any faster. There's maybe one
conditional in FieldsReader that can be simplified.
There's another downside to the way I'm implementing this right now.
The byteBuf and charBuf have to be kept somewhere. Currently, I'm
allocating a ByteBuffer for each TermInfosWriter and a charBuf for
each TermBuffer. That's something of a memory hit, though it's hard
to say exactly how much. IndexInput and IndexOutput are still using
the Sun methods -- when I gave them Buffers, they slowed down.
I've got one more idea... time to try overriding readString and
writeString in BufferedIndexInput and BufferedIndexOutput, to take
advantage of buffers that are already there.
Marvin Humphrey
Rectangular Research


  public static final CharBuffer utf8ToChars (
        byte[] bytes, int start, int length, CharBuffer charBuf) {
    int i = start;
    int j = 0;
    final int end = start + length;
    char[] chars = charBuf.array();
    try {
      while (i < end) {
        byte b = bytes[i++];
        switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
          case 0:
            chars[j++] = (char)(b & 0x7F);
          case 1:
            chars[j++] = (char)(((b & 0x1F) << 6)
              | (bytes[i++] & 0x3F));
          case 2:
            chars[j++] = (char)(((b & 0x0F) << 12)
              | ((bytes[i++] & 0x3F) << 6)
              |  (bytes[i++] & 0x3F));
          case 3:
            int utf32 = (((b & 0x0F) << 18)
              | ((bytes[i++] & 0x3F) << 12)
              | ((bytes[i++] & 0x3F) << 6)
              |  (bytes[i++] & 0x3F));
            chars[j++] = (char)((utf32 >> 10) + 0xD7C0);
            chars[j++] = (char)((utf32 & 0x03FF) + 0xDC00);
    catch (ArrayIndexOutOfBoundsException e) {
      float bytesProcessed = (float)(i - start);
      float bytesPerChar = (j / bytesProcessed) * 1.1f;

      float bytesLeft = length - bytesProcessed;
float targetSize = (float)chars.length + bytesPerChar * bytesLeft + 1.0f; return utf8ToChars(bytes, start, length, CharBuffer.allocate ((int)targetSize));
    return charBuf;

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

<Prev in Thread] Current Thread [Next in Thread>