Analyzers in Depth

As previously introduced, Analyzers prepare your searchable text for indexing and searching.

Your choice of analyzers is very important. Analyzers are concrete classes that extend the class org.apache.lucene.analysis.Analyzer. The GSS comes complete with several analyzers, and you can create and use your own. Sometimes when you are tempted to define a field as untokenized you may want to consider your choice of analyzer more carefully instead.

Each Search Service has a default analyzer, and any Search Service Field can override that analyzer to define a specific analyzer for use with that field (see analyzerName) GSS will use the same analyzer both for indexing and for searching.

The Generic Search Server provides the following predefined analyzers.

LUCENESTANDARD
Splits text at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. Recognizes email addresses and internet hostnames as one token. Normalizes token text to lower case and removes common English stop words.
STANDARD
Similar to LUCENESTANDARD analyzer but common stopwords are removed from the tokenized terms and if the content to be tokenized is a single number it will not be altered (making it suitable for processing generated infrastructure IDs which may be negative numbers).
SIMPLE
Splits text at non-letter characters and normalizes token text to lower case.
STOP
Splits text at non-letter characters, normalizes token text to lower case and removes common English stop words.
WHITESPACE
Splits text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.
KEYWORD
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.

Note that if you are using an analyzer other than a predefined GSS analyzer or analyzers shipped with Lucene the class must be available on the Generic Search Server classpath.