Lasttime, I described the useful
But that example used a non near-real-time (NRT)
If you have access to the
Since reopening an NRT reader bypasses the costly commit, and sharessome data structures directly in RAM instead of writing/readingto/from files, itprovides extremelyfast turnaround time on making index changes visible to searchers.Frequent reopens such as every 50 milliseconds, even under relativelyhigh indexing rates, is easily achievable on modern hardware.
Fortunately, it's trivial to use
Typically you'll set the
Sometimes your usage won't require deletions to be applied. Forexample, perhaps you index multiple versions of each document overtime, always deleting the older versions, yet during searching youhave some way to ignore the old versions. If that's the case, you canpass
Note that some or even all of the previous deletes may still beapplied even if you pass
If you have some searches that can tolerate unapplied deletes andothers that cannot, it's perfectly fine to create two
If you pass a non-null
What if you want the fast turnaround time of NRT readers, but needcontrol over when specific index changes become visible to certainsearches? Use
Then, when a given search request requires certain changes to bevisible, pass that generation back to
Here's one example use-case: let's say your site has a forum, and youuse Lucene to index and search all posts in the forum. Suddenly auser, Alice, comes online and adds a new post; in your server, youtake the text from Alice's post and add it as a document to the index,using
Now, if Alice stops posting and runs a search, you'd like to ensureher search covers all the posts she just made. Of course, if yourreopen time is fast enough (say once per second), unless Alicetypes very quickly, any search she runs will already reflecther posts.
But pretend for now you reopen relatively infrequently (say once every5 or 10 seconds), and you need to be certain Alice's search covers herposts, so you call
If some other user, say Bob, doesn't add any posts and runs a search,you don't need to wait for Alice's generation to be visible whenobtaining the searcher, since it's far less important when Alice'schanges become immediately visible to Bob. There's (usually!) nocausal connection between Alice posting and Bob searching, so it'sfine for Bob to use the most recent searcher.
Another use-case is an index verifier, where you index a document andthen immediately search for it to perform end-to-end validation thatthe document "made it" correctly into the index. That immediatesearch must first wait for the returned generation to becomeavailable.
The power of
But: don't abuse this! You may be tempted to always wait for lastgeneration you indexed for all searches, but this would result in verylow search throughput on concurrent hardware since all searches wouldbunch up, waiting for reopens. With proper usage, only a small subsetof searches should need to wait for a specific generation, like Alice;the rest will simply use the most recent searcher, like Bob.
Managing reopens is a little trickier with
The
You are also free to use your own strategy forcalling
SearcherManager
class,coming in the next (3.5.0) Lucene release, to periodically reopen yourIndexSearcher
when multiple threads need to share it.This class presents a verysimple acquire
/release
API, hiding thethread-safe complexities of opening and closing theunderlying IndexReader
s.But that example used a non near-real-time (NRT)
IndexReader
, which has relatively high turnaround timefor index changes to become visible, since you must callIndexWriter.commit
first.If you have access to the
IndexWriter
that's activelychanging the index (i.e., it's in the same JVM as your searchers), usean NRT reader instead! NRT readers let youdecouple durability to hardware/OS crashesfrom visibility of changes to a new IndexReader
.How frequently you commit (for durability) and how frequently youreopen (to see new changes) become fully separate decisions.This controlledconsistency model that Lucene exposes is a nice "best of bothworlds" blend between thetraditional immediateand eventualconsistency models.Since reopening an NRT reader bypasses the costly commit, and sharessome data structures directly in RAM instead of writing/readingto/from files, itprovides extremelyfast turnaround time on making index changes visible to searchers.Frequent reopens such as every 50 milliseconds, even under relativelyhigh indexing rates, is easily achievable on modern hardware.
Fortunately, it's trivial to use
SearcherManager
with NRTreaders: use the constructor that takes IndexWriter
instead of Directory
:This tells
boolean applyAllDeletes = true;
ExecutorService es = null;
SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,
new MySearchWarmer(), es);
SearcherManager
that its source for newIndexReader
s is the provided IndexWriter
instance (instead of a Directory
instance). After that,use the SearcherManager
justas before.Typically you'll set the
applyAllDeletes
boolean totrue
, meaning each reopened reader is required to applyall previous deletion operations (deleteDocuments
or updateDocument/s
) up until that point.Sometimes your usage won't require deletions to be applied. Forexample, perhaps you index multiple versions of each document overtime, always deleting the older versions, yet during searching youhave some way to ignore the old versions. If that's the case, you canpass
applyAllDeletes=false
instead. This will make theturnaround time quite a bit faster, as the primary-key lookupsrequired to resolve deletes can be costly. However, if you're usingLucene's trunk (to be eventually released as 4.0), another option isto use MemoryCodec
on your id
fieldto greatlyreduce the primary-key lookup time.Note that some or even all of the previous deletes may still beapplied even if you pass
false
. Also, the pendingdeletes are never lost if you pass false
: theyremain buffered and will still eventually be applied.If you have some searches that can tolerate unapplied deletes andothers that cannot, it's perfectly fine to create two
SearcherManager
s, one applying deletes and one not.If you pass a non-null
ExecutorService
, then each segmentin the index can be searched concurrently; this is a way to gainconcurrency within a single search request. Most applications do notrequire this, because the concurrency across multiple searches issufficient. It's also not clear that this is effective in general asit adds per-segment overhead, and the available concurrency is afunction of your index structure. Perversely, a fully optimized indexwill have no concurrency! Most applications should passnull
.NRTManager
What if you want the fast turnaround time of NRT readers, but needcontrol over when specific index changes become visible to certainsearches? Use
NRTManager
!NRTManager
holds onto the IndexWriter
instance you provide and then exposes the same APIs for making indexchanges (addDocument/s
, updateDocument/s
,deleteDocuments
). These methods forward to theunderlying IndexWriter
, but then return ageneration token (a Java long
) which you canhold onto after making any given change. The generation onlyincreases over time, so if you make a group of changes, just keep thegeneration returned from the last change you made.Then, when a given search request requires certain changes to bevisible, pass that generation back to
NRTManager
to obtain a searcher that's guaranteed toreflect all changes for that generation.Here's one example use-case: let's say your site has a forum, and youuse Lucene to index and search all posts in the forum. Suddenly auser, Alice, comes online and adds a new post; in your server, youtake the text from Alice's post and add it as a document to the index,using
NRTManager.addDocument
, saving the returned generation.If she adds multiple posts, just keep the last generation.Now, if Alice stops posting and runs a search, you'd like to ensureher search covers all the posts she just made. Of course, if yourreopen time is fast enough (say once per second), unless Alicetypes very quickly, any search she runs will already reflecther posts.
But pretend for now you reopen relatively infrequently (say once every5 or 10 seconds), and you need to be certain Alice's search covers herposts, so you call
NRTManager.waitForGeneration
to obtainthe SearcherManager
to use for searching. If the latestsearcher already covers the requested generation, the method returnsimmediately. Otherwise, it blocks, requesting a reopen (see below),until the required generation has become visible in a searcher, andthen returns it.If some other user, say Bob, doesn't add any posts and runs a search,you don't need to wait for Alice's generation to be visible whenobtaining the searcher, since it's far less important when Alice'schanges become immediately visible to Bob. There's (usually!) nocausal connection between Alice posting and Bob searching, so it'sfine for Bob to use the most recent searcher.
Another use-case is an index verifier, where you index a document andthen immediately search for it to perform end-to-end validation thatthe document "made it" correctly into the index. That immediatesearch must first wait for the returned generation to becomeavailable.
The power of
NRTManager
is you have full control overwhich searches must see the effects of which indexing changes; this isa further improvement in Lucene's controlled consistencymodel. NRTManager
hides all the tricky details oftracking generations.But: don't abuse this! You may be tempted to always wait for lastgeneration you indexed for all searches, but this would result in verylow search throughput on concurrent hardware since all searches wouldbunch up, waiting for reopens. With proper usage, only a small subsetof searches should need to wait for a specific generation, like Alice;the rest will simply use the most recent searcher, like Bob.
Managing reopens is a little trickier with
NRTManager
,since you should reopen at higher frequency whenever a search iswaiting for a specific generation. To address this, there's theuseful NRTManagerReopenThread
class; use it like this:The
double minStaleSec = 0.025;
double maxStaleSec = 5.0;
NRTManagerReopenThread thread = new NRTManagerReopenThread(
nrtManager,
maxStaleSec,
minStaleSec);
thread.start();
...
thread.close();
minStaleSec
sets an upper bound on how frequentlyreopens should occur. This is used whenever a searcher is waiting fora specific generation (Alice, above), meaning the longest such a searchshould have to wait is approximately 25 msec. The
maxStaleSec
sets a lower bound on how frequentlyreopens should occur. This is used for the periodic "ordinary"reopens, when there is no request waiting for a specific generation(Bob, above); this means any changes done to the index more thanapproximately 5.0 seconds ago will be seen when Bob searches. Notethat these parameters are approximate targets and not hard guaranteeson the reader turnaround time. Be sure to eventuallycall thread.close()
, when you are done reopening (forexample, on shutting down the application).You are also free to use your own strategy forcalling
maybeReopen
; you don't have to use NRTManagerReopenThread
. Just remember that gettingit right, especially when searches are waiting for specificgenerations, can be tricky!