java-user@lucene.apache.org
[Top] [All Lists]

Re: Language identification ??

Subject: Re: Language identification ??
From: Mathieu Lecarme
Date: Fri, 14 Mar 2008 15:48:47 +0100
Itamar Syn-Hershko a écrit :
For what it worths, I did something similar in my BidiAnalyzer so I can
index both Hebrew/Semitic texts and English/Latin words without switching
analyzers, giving each the proper treatment. I did it simply by testing the
first char and looking at its numeric value - so it falls between Hebrew
Aleph and Taph then its Hebrew, else its Latin. I wonder how you would spot
a French word in an English text for instance (aren't there parallel words?)

Itamar.
With ngram statistic compare.
Finding foreign word in a sentence is very difficult, many words are very similar, and some are "faux amis" : same differents means in each language. Querying in mixing language seems to be a bit vicious. Mixing alphabet is more common (and easier to handle).

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-user-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>