Package org.apache.lucene.search.similarities
Similarity
serves
as the base for ranking functions. For searching, users can employ the models
already implemented or create their own by extending one of the classes in this
package.
Table Of Contents
Summary of the Ranking Methods
ClassicSimilarity
is the original Lucene
scoring function. It is based on a highly optimized
Vector Space Model. For more
information, see TFIDFSimilarity
.
BM25Similarity
is an optimized
implementation of the successful Okapi BM25 model.
SimilarityBase
provides a basic
implementation of the Similarity contract and exposes a highly simplified
interface, which makes it an ideal starting point for new ranking functions.
Lucene ships the following methods built on
SimilarityBase
:
 Amati and Rijsbergen's DFR framework;
 Clinchant and Gaussier's Informationbased models for IR;
 The implementation of two language models from Zhai and Lafferty's paper.
 Divergence from independence models as described in "IRRA at TREC 2012" (DinĂ§er).
SimilarityBase
is not
optimized to the same extent as
ClassicSimilarity
and
BM25Similarity
, a difference in
performance is to be expected when using the methods listed above. However,
optimizations can always be implemented in subclasses; see
below.
Changing Similarity
Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see a "fair" similarity).
To change Similarity
, one must do so for both indexing and
searching, and the changes must happen before
either of these actions take place. Although in theory there is nothing stopping you from changing midstream, it
just isn't welldefined what is going to happen.
To make this change, implement your own Similarity
(likely
you'll want to simply subclass an existing method, be it
ClassicSimilarity
or a descendant of
SimilarityBase
), and
then register the new class by calling
IndexWriterConfig.setSimilarity(Similarity)
before indexing and
IndexSearcher.setSimilarity(Similarity)
before searching.
Extending SimilarityBase
The easiest way to quickly implement a new ranking method is to extend
SimilarityBase
, which provides
basic implementations for the low level . Subclasses are only required to
implement the SimilarityBase.score(BasicStats, float, float)
and SimilarityBase.toString()
methods.
Another option is to extend one of the frameworks
based on SimilarityBase
. These
Similarities are implemented modularly, e.g.
DFRSimilarity
delegates
computation of the three parts of its formula to the classes
BasicModel
,
AfterEffect
and
Normalization
. Instead of
subclassing the Similarity, one can simply introduce a new basic model and tell
DFRSimilarity
to use it.
Changing ClassicSimilarity
If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:
The
SweetSpotSimilarity
inorg.apache.lucene.misc
gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization — By overriding
Similarity.computeNorm(org.apache.lucene.index.FieldInvertState state)
, it is possible to discount how the length of a field contributes to a score. InClassicSimilarity
, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".
[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.

Interface Summary Interface Description LMSimilarity.CollectionModel A strategy for computing the collection language model. 
Class Summary Class Description AfterEffect This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework.AfterEffect.NoAfterEffect Implementation used when there is no aftereffect.AfterEffectB Model of the information gain based on the ratio of two Bernoulli processes.AfterEffectL Model of the information gain based on Laplace's law of succession.Axiomatic Axiomatic approaches for IR.AxiomaticF1EXP F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freqAxiomaticF1LOG F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freqAxiomaticF2EXP F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freqAxiomaticF2LOG F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freqAxiomaticF3EXP F2EXP is defined as Sum(tf(term_doc_freq)*IDF(term)gamma(docLen, queryLen)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLenqueryLen)*queryLen*s/avdlAxiomaticF3LOG F2EXP is defined as Sum(tf(term_doc_freq)*IDF(term)gamma(docLen, queryLen)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLenqueryLen)*queryLen*s/avdlBasicModel This class acts as the base class for the specific basic model implementations in the DFR framework.BasicModelBE Limiting form of the BoseEinstein model.BasicModelD Implements the approximation of the binomial model with the divergence for DFR.BasicModelG Geometric as limiting form of the BoseEinstein model.BasicModelIF An approximation of the I(n_{e}) model.BasicModelIn The basic tfidf model of randomness.BasicModelIne Tfidf model of randomness, based on a mixture of Poisson and inverse document frequency.BasicModelP Implements the Poisson approximation for the binomial model for DFR.BasicStats Stores all statistics commonly used ranking methods.BM25Similarity BM25 Similarity.BooleanSimilarity Simple similarity that gives terms a score that is equal to their query boost.ClassicSimilarity Expert: Historical scoring implementation.DFISimilarity Implements the Divergence from Independence (DFI) model based on Chisquare statistics (i.e., standardized Chisquared distance from independence in term frequency tf).DFRSimilarity Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen.Distribution The probabilistic distribution used to model term occurrence in informationbased models.DistributionLL Loglogistic distribution.DistributionSPL The smoothed powerlaw (SPL) distribution for the informationbased framework that is described in the original paper.IBSimilarity Provides a framework for the family of informationbased models, as described in Stéphane Clinchant and Eric Gaussier.Independence Computes the measure of divergence from independence for DFI scoring functions.IndependenceChiSquared Normalized chisquared measure of distance from independenceIndependenceSaturated Saturated measure of distance from independenceIndependenceStandardized Standardized measure of distance from independenceLambda The lambda (λ_{w}) parameter in informationbased models.LambdaDF Computes lambda asdocFreq+1 / numberOfDocuments+1
.LambdaTTF Computes lambda astotalTermFreq+1 / numberOfDocuments+1
.LMDirichletSimilarity Bayesian smoothing using Dirichlet priors.LMJelinekMercerSimilarity Language model based on the JelinekMercer smoothing method.LMSimilarity Abstract superclass for language modeling Similarities.LMSimilarity.DefaultCollectionModel Modelsp(wC)
as the number of occurrences of the term in the collection, divided by the total number of tokens+ 1
.LMSimilarity.LMStats Stores the collection distribution of the current term.MultiSimilarity Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A.Normalization This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.Normalization.NoNormalization Implementation used when there is no normalization.NormalizationH1 Normalization model that assumes a uniform distribution of the term frequency.NormalizationH2 Normalization model in which the term frequency is inversely related to the length.NormalizationH3 Dirichlet Priors normalizationNormalizationZ ParetoZipf NormalizationPerFieldSimilarityWrapper Provides the ability to use a differentSimilarity
for different fields.Similarity Similarity defines the components of Lucene scoring.Similarity.SimScorer Similarity.SimWeight Stores the weight for a query across the indexed collection.SimilarityBase A subclass ofSimilarity
that provides a simplified API for its descendants.TFIDFSimilarity Implementation ofSimilarity
with the Vector Space Model.