java-dev@lucene.apache.org
[Top] [All Lists]

Re: bytecount as String and prefix length

Subject: Re: bytecount as String and prefix length
From: Marvin Humphrey
Date: Mon, 31 Oct 2005 16:31:14 -0800
I wrote...

I think I'll take a at a custom charsToUTF8 converter algo.

Still no luck. Still 20% slower than the current implementation. The algo is below, for reference.

It's entirely possible that my patches are doing something dumb that's causing this, given my limited experience with Java. But if that's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate buffer before blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on the length VInt yields a significant benefit over the standard techniques for reading in UTF-8. That wouldn't be hard to believe. Without that number, there's a lot of guesswork involved. English requires about 1.1 bytes per UTF-8 code point; Japanese, 3. Multiple memory allocation ops may be required as bytes get read in, especially if the final String object kicked out HAS to use the bare minimum amount of memory. I don't suppose there's any way for me to snoop just what's happening under the hood in these CharsetDecoder classes or String constructors, is there?

Scanning through a SegmentTermEnum with next() doesn't seem to be any slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs benchmarker doesn't slow down that much when IndexInput is changed to use a String constructor that accepts UTF-8 bytes rather than chars. However, it's possible that the modified toTerm method of TermBuffer is a bottleneck, as it also uses the UTF-8 String constructor. It doesn't get exercised under SegmentTermEnum.next(), but during merging of segments I believe it sees plenty of action -- maybe a lot more than IndexInput's readString.

So my next step is to write a utf8ToString method that's as efficient as I can make it. After that... I dunno, I'm running out of ideas.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


  public static final ByteBuffer stringToUTF8(
        String s, int start, int length, ByteBuffer byteBuf) {
    byteBuf.clear();
    int i = start;
    int j = 0;
    try {
      final int end = start + length;
      byte[] bytes = byteBuf.array();
      for ( ; i < end; i++) {
        final int code = (int)s.charAt(i);
        if (code < 0x80)
          bytes[j++] = (byte)code;
        else if (code < 0x800) {
          bytes[j++] = (byte)(0xC0 | (code >> 6));
          bytes[j++] = (byte)(0x80 | (code & 0x3F));
        } else if (code < 0xD800 || code > 0xDFFF) {
          bytes[j++] = (byte)(0xE0 | (code >>> 12));
          bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
          bytes[j++] = (byte)(0x80 | (code & 0x3F));
        } else {
          // surrogate pair
          int utf32;
          // confirm valid high surrogate
          if (code < 0xDC00 && (i < end-1)) {
            utf32 = ((int)s.charAt(i+1));
            // confirm valid low surrogate and write pair
            if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
              utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
              i++;
              bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
              bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
              bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
              bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
              continue;
            }
          }
          // replace unpaired surrogate or out-of-order low surrogate
          // with substitution character
          bytes[j++] = (byte)0xEF;
          bytes[j++] = (byte)0xBF;
          bytes[j++] = (byte)0xBD;
        }
      }
    }
    catch (ArrayIndexOutOfBoundsException e) {
      // guess how many more bytes it will take, plus 10%
      float charsProcessed = (float)(i - start);
      float bytesPerChar = (j / charsProcessed) * 1.1f;

      float charsLeft = length - charsProcessed;
      float targetSize
        = (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;

return stringToUTF8(s, start, length, ByteBuffer.allocate((int) targetSize));
    }
    byteBuf.position(j);
    return byteBuf;
  }





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-dev-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>