Class DFRSimilarityFactory


  • public class DFRSimilarityFactory
    extends SimilarityFactory
    Factory for DFRSimilarity

    You must specify the implementations for all three components of DFR (strings). In general the models are parameter-free, but two of the normalizations take floating point parameters (see below):

    1. basicModel: Basic model of information content:
      • G: Geometric approximation of Bose-Einstein
      • I(n): Inverse document frequency
      • I(ne): Inverse expected document frequency [mixture of Poisson and IDF]
      • I(F): Inverse term frequency [approximation of I(ne)]
    2. afterEffect: First normalization of information gain:
      • L: Laplace's law of succession
      • B: Ratio of two Bernoulli processes
    3. normalization: Second (length) normalization:
      • H1: Uniform distribution of term frequency
        • parameter c (float): hyper-parameter that controls the term frequency normalization with respect to the document length. The default is 1
      • H2: term frequency density inversely related to length
        • parameter c (float): hyper-parameter that controls the term frequency normalization with respect to the document length. The default is 1
      • H3: term frequency normalization provided by Dirichlet prior
        • parameter mu (float): smoothing parameter μ. The default is 800
      • Z: term frequency normalization provided by a Zipfian relation
        • parameter z (float): represents A/(A+1) where A measures the specificity of the language. The default is 0.3
      • none: no second normalization

    Optional settings:

    • discountOverlaps (bool): Sets SimilarityBase.setDiscountOverlaps(boolean)
    WARNING: This API is experimental and might change in incompatible ways in the next release.