java-dev@lucene.apache.org
[Top] [All Lists]

[jira] Commented: (LUCENE-550) InstantiatedIndex - faster but memory con

Subject: [jira] Commented: (LUCENE-550) InstantiatedIndex - faster but memory consuming index
From: "Karl Wettin (JIRA)"
Date: Sat, 17 Feb 2007 00:26:06 -0800 PST
    [ 
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473892
 ] 

Karl Wettin commented on LUCENE-550:
------------------------------------

(now proof read and all)
Package level java doc of the spell checker:

A dictionary with weighted suggestions,
ordered by user activity,
backed by algorithmic suggestions.
<p/>

<h1>What, where, when and how.</h1>

<h2>Goal trees</h2>
A user session could contain multiple quests for content.
For example,
first the user looks for the Apache licence,
spells it wrong, inspects different results,
and then the user searches for the author Ivan Goncharov.
<p/>
In this package we call them different goals.
<p/>
User activities are represented by a tree of QueryGoalNodes,
each describes a user query,
if the current query (goal node) was a suggestion from the system to a previous 
user query,
what search results was further inspected,
when it happend,
and for how long.
<p/>
The biggest task as a consumer when implementing this package
will be to keep track of what goal node the user came from,
so that the new queries (goal node) will become children to the parent.
Probably you add it as meta data to all actions,
e.g. in the &lt;a href="?goalID=>, as &lt;input type=hidden name="goalID" 
value=>, et c,
and keep track of them in a Map&lt;Integer, QueryGoalNode> in the user session.
<p/>
It is up to the QueryGoalTreeExtractor implementations to decide what
events in a session are parts of the same goal,
as we don't want to suggest the user to check out Goncharov
when they are looking for the Apache license.
<p/>
In the default query goal tree extractor,
nodes are parts of the same goal as their parent when:
<ul>
  <li>The queries are the same.</li>
  <li>The user took a suggestion from the system.</li>
  <li>The current and the parent queries are similair enough.</li>
  <li>The queries was entered within short enough time.</li>
</ul>
<p/>

<h2>Adaptive training</h2>
Adaptive means that the suggestions to a query
depends on how users previously have been acting.
This means that the dictionary could be tampered with quite easy
and you should therefore try to train only with data from trusted users.
<p/>
The default trainer implementation works like this:
<ul>
  <li>If a user accepts the suggestion made by the system, then we increase the 
score for that suggestion. (positive
    adaptation)
  </li>
  <li>If a user does not accept the suggestion made by the system, then we 
decrease the score for that suggestion.
    (negative adaptation)
  </li>
  <li>
    If the goal tree is a single query, one query only (perhaps with multiple 
inspections)
    then we adapt negative once again.
  </li>
  <li>
    Suggestions are the queries with inspections, ordered by the classification 
weight.
    All the queries in the goal witout inspections will be adpated positive with
    the query with inspections that has the shortest edit distance.
  </li>
  <li>Suggests back from best goal to second best goal. homm -> heroes of might 
and magic -> homm</li>
</ul>
<p/>

<h2>Suggesting</h2>
Suggestions are created by the suggester, that navigates a dictionary.
The default implementation works like this:
<ul>
  <li>
    Returns highest scoring suggestion available,
    unless the score is lower than the suggestion supression threadshold.
  </li>
  <li>
    If there are no suggestions available, the second level suggesters
    registred to the dictionary are used to produce the suggestions.
  </li>
  <li>
    If the top scoring suggestion is same as the query,
    and the second best is not supressed below threadshold,
    change order
  </li>
</ul>
Ignoring a suggestion 50 times or so with a DefaultTrainer makes a score hit 
0.05d.
<p/>

<h2>Second level suggestion</h2>
If the dictionary does not contain a suggestion for a given query,
it will be passed on to any available SecondLevelSuggester,
usually an algorithmic suggestion scheme
that hopefully can come up with a suggestion.
As a user accepts such a suggestion it will be trained
and become a part of the adaptive layer.
<h3>Token suggesters</h3>
The lowest level of suggestion is single token suggestions,
and the default implementation is a refactor of the contrib/spellcheck.
<h3>TokenPhraseSuggester</h3>
A layer on top of the single token suggesting that enables muti token (phrase) 
suggestions.
<p/>
For example, the user places the query "thh best game".
The matrix of similar tokens are:
<pre>
  the best game
  tho rest fame
           lame
</pre>
These can be represented in a finite number of ways:
<pre>
  tho best game
  tho best fame
  tho best lame
  tho rest game
  tho rest fame
  tho rest lame
  the best game
  the best fame
  the best lame
  the rest game
  the rest fame
  the rest lame
</pre>
A query is created for each combination, in the default SpanNearQueries, to 
find valid suggestions.
<p/>
If any of the valid hits contains a TermPositionVector
it will be analyzed and suggest the query in the order of terms in the index.
E.g. query "camel that broke the staw" is suggested with "straw that broke the 
camel"
todo: if term positions available and stored, suggest that for cosmetic 
reasons.)


<h1>Consumer interface example</h1>
Code from the test cases. 
<pre>
  private SuggestionFacade&lt;R> suggestionFacade;

  @Override
  protected void setUp() throws Exception {
    suggestionFacade = = new SuggestionFacade&lt;R>();
  }

  public void testBasicTraining() throws Exception {
    QueryGoalNode&lt;R> node;

    node = new QueryGoalNode&lt;R>(null, "heroes of nmight and magic", 3);
    node = new QueryGoalNode&lt;R>(node, "heroes of night and magic", 3);
    node = new QueryGoalNode&lt;R>(node, "heroes of might and magic", 10);
    node.new Inspection(23, QueryGoalNode.GOAL);
    suggestionFacade.queueGoalTree(node.getRoot());

    node = new QueryGoalNode&lt;R>(null, "heroes of night and magic", 3);
    node = new QueryGoalNode&lt;R>(node, "heroes of knight and magic", 7);
    node = new QueryGoalNode&lt;R>(node, "heroes of might and magic", 20);
    node.new Inspection(23, QueryGoalNode.GOAL);
    suggestionFacade.queueGoalTree(node);

    node = new QueryGoalNode&lt;R>(null, "heroes of might and magic", 20, 1l);
    suggestionFacade.queueGoalTree(node);

    node = new QueryGoalNode&lt;R>(null, "heroes of night and magic", 7, 0l);
    node = new QueryGoalNode&lt;R>(node, "heroes of light and magic", 14, 1l);
    node = new QueryGoalNode&lt;R>(node, "heroes of might and magic", 2, 6l);
    node.new Inspection(23, QueryGoalNode.GOAL);
    node.new Inspection(23, QueryGoalNode.GOAL);
    suggestionFacade.queueGoalTree(node);

    node = new QueryGoalNode&lt;R>(null, "heroes of night and magic", 4, 0l);
    node = new QueryGoalNode&lt;R>(node, "heroes of knight and magic", 17, 1l);
    node = new QueryGoalNode&lt;R>(node, "heroes of might and magic", 2, 2l);
    node.new Inspection(23, QueryGoalNode.GOAL);
    suggestionFacade.queueGoalTree(node);

    suggestionFacade.flush();

    assertEquals("heroes of might and magic", 
suggestionFacade.didYouMean("heroes of light and magic"));
    assertEquals("heroes of might and magic", 
suggestionFacade.didYouMean("heroes of night and magic"));
    assertEquals("heroes of might and magic", 
suggestionFacade.didYouMean("heroes ofnight andmagic"));
  }
</pre>
<p/>
Notice the last assertation:
<pre>
  assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes 
ofnight andmagic"));
</pre>
The dictionary will strip keys from puctuation and whitespace,
resulting in better support for de/compositions of words.
<p/>
Above example will be user session analyzing and adaptive only,
no algorithmic suggestions if the user types in something nobody miss spelled 
before.
Simply add one to the dictionary:
<pre>
  protected void setUp() throws Exception {
    suggestionFacade = new SuggestionFacade&lt;R>();

    // your primary index that suggestions must match.
    IndexFacade aprioriIndex = new IndexFacade(new RAMDirectoryIndex());
    String aprioriField = "title";

    // build the ngram suggester
    IndexFacade ngramIndex = new IndexFacade(new RAMDirectoryIndex());
    NgramTokenSuggester ngramSuggester = new NgramTokenSuggester(ngramIndex);
    ngramSuggester.indexDictionary(new 
TermEnumIterator(aprioriIndex.getReader(), aprioriField));

    // the greater the better results but with a longer response time.
    int maxSuggestionsPerToken = 3;

    // add ngram suggester wrapped in a single token phrase suggester as second 
level suggester.
    
suggestionFacade.getDictionary().getPrioritesBySecondLevelSuggester().put(new 
SecondLevelTokenPhraseSuggester(ngramSuggester, aprioriField, false, 
maxSuggestionsPerToken, new WhitespaceAnalyzer(), aprioriIndex), 1d);
  }
</pre>
<h1>Persistence and memory usage.</h1>
By default the dictionary is soft referenced,
meaning it will consume as much memory it can get,
and if some other application is in need of memory
low prioritized (priority is decided by the JVM) instances will be released.
<p/>
There is currently no persistence but java.io.Serliazlible available for the in 
the adaptive layer.
You need to implement your own Map&lt;String, SuggestionList> that is persistent
and pass it to the constructor of your directory.


> InstantiatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
>                 Key: LUCENE-550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.0.0
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: didyoumean.jpg, lucene-550.jpg, test-reports.zip, 
> trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2
>
>
> An non file centrinc all in memory index. Consumes some 2x the memory of a 
> RAMDirectory (in a term satured index) but is between 3x-60x faster depending 
> on application and how one counts. Average query is about 8x faster. 
> IndexWriter and IndexModifier have been realized in InterfaceIndexWriter and 
> InterfaceIndexModifier. 
> InstantiatedIndex is wrapped in a new top layer index facade (class Index) 
> that comes with factory methods for writers, readers and searchers for unison 
> index handeling. There are decorators with notification handling that can be 
> used for automatically syncronizing searchers on updates, et.c. 
> Index also comes with FS/RAMDirectory implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: java-dev-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>