Quantcast
Viewing all articles
Browse latest Browse all 9364

Michael McCandless: Near-real-time readers with Lucene's SearcherManager and NRTManager

Lasttime, I described the useful SearcherManager class,coming in the next (3.5.0) Lucene release, to periodically reopen yourIndexSearcher when multiple threads need to share it.This class presents a verysimple acquire/release API, hiding thethread-safe complexities of opening and closing theunderlying IndexReaders.

But that example used a non near-real-time (NRT)IndexReader, which has relatively high turnaround timefor index changes to become visible, since you must callIndexWriter.commit first.

If you have access to the IndexWriter that's activelychanging the index (i.e., it's in the same JVM as your searchers), usean NRT reader instead! NRT readers let youdecouple durability to hardware/OS crashesfrom visibility of changes to a new IndexReader.How frequently you commit (for durability) and how frequently youreopen (to see new changes) become fully separate decisions.This controlledconsistency model that Lucene exposes is a nice "best of bothworlds" blend between thetraditional immediateand eventualconsistency models.

Since reopening an NRT reader bypasses the costly commit, and sharessome data structures directly in RAM instead of writing/readingto/from files, itprovides extremelyfast turnaround time on making index changes visible to searchers.Frequent reopens such as every 50 milliseconds, even under relativelyhigh indexing rates, is easily achievable on modern hardware.

Fortunately, it's trivial to use SearcherManager with NRTreaders: use the constructor that takes IndexWriterinstead of Directory:

boolean applyAllDeletes = true;
ExecutorService es = null;
SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,
new MySearchWarmer(), es);
This tells SearcherManager that its source for newIndexReaders is the provided IndexWriterinstance (instead of a Directory instance). After that,use the SearcherManager justas before.

Typically you'll set the applyAllDeletes boolean totrue, meaning each reopened reader is required to applyall previous deletion operations (deleteDocumentsor updateDocument/s) up until that point.

Sometimes your usage won't require deletions to be applied. Forexample, perhaps you index multiple versions of each document overtime, always deleting the older versions, yet during searching youhave some way to ignore the old versions. If that's the case, you canpass applyAllDeletes=false instead. This will make theturnaround time quite a bit faster, as the primary-key lookupsrequired to resolve deletes can be costly. However, if you're usingLucene's trunk (to be eventually released as 4.0), another option isto use MemoryCodec on your id fieldto greatlyreduce the primary-key lookup time.

Note that some or even all of the previous deletes may still beapplied even if you pass false. Also, the pendingdeletes are never lost if you pass false: theyremain buffered and will still eventually be applied.

If you have some searches that can tolerate unapplied deletes andothers that cannot, it's perfectly fine to create twoSearcherManagers, one applying deletes and one not.

If you pass a non-null ExecutorService, then each segmentin the index can be searched concurrently; this is a way to gainconcurrency within a single search request. Most applications do notrequire this, because the concurrency across multiple searches issufficient. It's also not clear that this is effective in general asit adds per-segment overhead, and the available concurrency is afunction of your index structure. Perversely, a fully optimized indexwill have no concurrency! Most applications should passnull.



NRTManager

What if you want the fast turnaround time of NRT readers, but needcontrol over when specific index changes become visible to certainsearches? Use NRTManager!

NRTManager holds onto the IndexWriterinstance you provide and then exposes the same APIs for making indexchanges (addDocument/s, updateDocument/s,deleteDocuments). These methods forward to theunderlying IndexWriter, but then return ageneration token (a Java long) which you canhold onto after making any given change. The generation onlyincreases over time, so if you make a group of changes, just keep thegeneration returned from the last change you made.

Then, when a given search request requires certain changes to bevisible, pass that generation back toNRTManager to obtain a searcher that's guaranteed toreflect all changes for that generation.

Here's one example use-case: let's say your site has a forum, and youuse Lucene to index and search all posts in the forum. Suddenly auser, Alice, comes online and adds a new post; in your server, youtake the text from Alice's post and add it as a document to the index,usingNRTManager.addDocument, saving the returned generation.If she adds multiple posts, just keep the last generation.

Now, if Alice stops posting and runs a search, you'd like to ensureher search covers all the posts she just made. Of course, if yourreopen time is fast enough (say once per second), unless Alicetypes very quickly, any search she runs will already reflecther posts.

But pretend for now you reopen relatively infrequently (say once every5 or 10 seconds), and you need to be certain Alice's search covers herposts, so you call NRTManager.waitForGeneration to obtainthe SearcherManager to use for searching. If the latestsearcher already covers the requested generation, the method returnsimmediately. Otherwise, it blocks, requesting a reopen (see below),until the required generation has become visible in a searcher, andthen returns it.

If some other user, say Bob, doesn't add any posts and runs a search,you don't need to wait for Alice's generation to be visible whenobtaining the searcher, since it's far less important when Alice'schanges become immediately visible to Bob. There's (usually!) nocausal connection between Alice posting and Bob searching, so it'sfine for Bob to use the most recent searcher.

Another use-case is an index verifier, where you index a document andthen immediately search for it to perform end-to-end validation thatthe document "made it" correctly into the index. That immediatesearch must first wait for the returned generation to becomeavailable.

The power of NRTManager is you have full control overwhich searches must see the effects of which indexing changes; this isa further improvement in Lucene's controlled consistencymodel. NRTManager hides all the tricky details oftracking generations.

But: don't abuse this! You may be tempted to always wait for lastgeneration you indexed for all searches, but this would result in verylow search throughput on concurrent hardware since all searches wouldbunch up, waiting for reopens. With proper usage, only a small subsetof searches should need to wait for a specific generation, like Alice;the rest will simply use the most recent searcher, like Bob.

Managing reopens is a little trickier with NRTManager,since you should reopen at higher frequency whenever a search iswaiting for a specific generation. To address this, there's theuseful NRTManagerReopenThread class; use it like this:

double minStaleSec = 0.025;
double maxStaleSec = 5.0;
NRTManagerReopenThread thread = new NRTManagerReopenThread(
nrtManager,
maxStaleSec,
minStaleSec);
thread.start();
...
thread.close();
The minStaleSec sets an upper bound on how frequentlyreopens should occur. This is used whenever a searcher is waiting fora specific generation (Alice, above), meaning the longest such a searchshould have to wait is approximately 25 msec.

The maxStaleSec sets a lower bound on how frequentlyreopens should occur. This is used for the periodic "ordinary"reopens, when there is no request waiting for a specific generation(Bob, above); this means any changes done to the index more thanapproximately 5.0 seconds ago will be seen when Bob searches. Notethat these parameters are approximate targets and not hard guaranteeson the reader turnaround time. Be sure to eventuallycall thread.close(), when you are done reopening (forexample, on shutting down the application).

You are also free to use your own strategy forcalling maybeReopen; you don't have to use NRTManagerReopenThread. Just remember that gettingit right, especially when searches are waiting for specificgenerations, can be tricky!

Viewing all articles
Browse latest Browse all 9364

Trending Articles