Lucene's TokenStreamclass produces the sequence of tokens to be indexed for a document's fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, CharTermAttribute holds the text of the token; OffsetAttributehas the character start and end offset into the original string corresponding to this token, for highlighting purposes. There are a number of standard token attributes, and some tokenizers add their own attributes.
The
Lucene and Solr have a wide variety of
Let's tokenize a simple example:
Each node is a position, and each arc is a token. The
Next, let's add SynoynmFilterinto our analysis chain, applying these synonyms:
Now the graph is more interesting! For each token (arc), the PositionIncrementAttributetells us how many positions (nodes) ahead this arc starts from, while the new (as of 3.6.0) PositionLengthAttributetells us the how many positions (nodes) ahead the arc arrives to.
Besides
Other analysis components should produce a graph but don't yet (patches welcome!): WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter, and likely others.
This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.
QueryParseralso ignores position length, however this should be easier to fix. This would mean you can run graph analyzers at query time (i.e., query time expansion) and get the correct results.
Another problem is that
![]()
Notice how
This of course also messes up phrase queries ("domain name service is up" should match but doesn't, while "dns name up" shouldn't match but does). To work around this problem you should ensure all of your injected synonyms are single tokens! For this case, you could run the reverse mapping (
This happens because
Another problem is that
We've only just started but bit by bit our token streams are producing graphs!
The
TokenStream
is actually a chain, starting with a Tokenizerthat splits characters into initial tokens, followed by any number of TokenFilters that modify the tokens. You can also use a CharFilterto pre-process the characters before tokenization, for example to strip out HTML markup, remap character sequences or replace characters according to a regular expression, while preserving the proper offsets back into the original input string. Analyzer is the factory class that creates TokenStream
s when needed. Lucene and Solr have a wide variety of
Tokenizer
s and TokenFilter
s, including support for at least 34 languages. Let's tokenize a simple example:
fast wi fi network is down
. Assume we preserve stop words. When viewed as a graph, the tokens look like this: 
Each node is a position, and each arc is a token. The
TokenStream
enumerates a directed acyclic graph, one arc at a time. Next, let's add SynoynmFilterinto our analysis chain, applying these synonyms:
-
fast
→speedy
-
wi fi
→wifi
-
wi fi network
→hotspot

Now the graph is more interesting! For each token (arc), the PositionIncrementAttributetells us how many positions (nodes) ahead this arc starts from, while the new (as of 3.6.0) PositionLengthAttributetells us the how many positions (nodes) ahead the arc arrives to.
Besides
SynonymFilter
, several other analysis components now produce token graphs. Kuromoji's JapaneseTokenizeroutputs the decompounded form for compound tokens. For example, tokens like ショッピングセンター (shopping center) will also have an alternate path with ショッピング (shopping) followed by センター (center). Both ShingleFilterand CommonGramsFilterset the position length to 2 when they merge two input tokens. Other analysis components should produce a graph but don't yet (patches welcome!): WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter, and likely others.
Limitations
There are unfortunately several hard-to-fix problems with token graphs. One problem is that the indexer completely ignoresPositionLengthAttribute
; it only pays attention to PositionIncrementAttribute
. This means the indexer acts as if all arcs always arrive at the very next position, so for the above graph we actually index this: 
This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.
QueryParseralso ignores position length, however this should be easier to fix. This would mean you can run graph analyzers at query time (i.e., query time expansion) and get the correct results.
Another problem is that
SynonymFilter
also unexpectedly performs its own form of sausagization when the injected synonym is more than one token. For example if you have this rule: -
dns
→domain name service

Notice how
name
was overlapped onto is
, and service
was overlapped onto up
. It's an odd word salad! This of course also messes up phrase queries ("domain name service is up" should match but doesn't, while "dns name up" shouldn't match but does). To work around this problem you should ensure all of your injected synonyms are single tokens! For this case, you could run the reverse mapping (
domain name service
→ dns
) at query time (as well as indexing time) and then both queries dns
and domain name service
will match any document containing either variant. This happens because
SynonymFilter
never creates new positions; if it did so, it could make new positions for tokens in domain name service
, and then change dns
to position length 3. Another problem is that
SynonymFilter
, like the indexer, also ignores the position length of the incoming tokens: it cannot properly consume a token graph. So if you added a second SynonymFilter
it would fail to match hotspot is down
. We've only just started but bit by bit our token streams are producing graphs!