This library implements Taily algorithm as described by Aly et al. in the 2013 paper Taily: shard selection using the tail of score distributions.
At this early stage of development, the library interface is subject to changes. If you rely on it now, I advise to use a specific git tag.
taily is a header-only library. For now, copy and include include/taily.hpp file.
cmake and conan to come...
Library compiles with GCC >= 4.9 and Clang >= 4, and it requires C++14. The only other dependency is Boost.Math library used for Gamma distribution.
Chances are you will only need to call one function that scores all shards with respect to one query:
std::vector<double> score_shards(
const Query_Statistics& global_stats,
const std::vector<Query_Statistics>& shard_stats,
const int ntop)global_stats contains statistics for the entire index, while shard_stats
vector represents the shards, and ntop is the parameter of Taily---the
number top results for which a score threshold will be estimated.
Query_Statistics is a simple structure that contains the collection size
and a vector of of length equal to the number of query terms.
struct Query_Statistics {
std::vector<Feature_Statistics> term_stats;
int size;
};Each element of term_stats contains the values needed for computations:
struct Feature_Statistics {
double expected_value;
double variance;
int frequency;
template<typename FeatureRange>
static Feature_Statistics from_features(const FeatureRange& features);
template<typename ForwardIterator>
static Feature_Statistics from_features(ForwardIterator first, ForwardIterator last);
};In case you want to use this library for storing features as well,
you can use the helper functions from_features() to computes statistics:
const std::vector<double>& features = fetch_or_generate_features(term);
auto stats = Feature_Statistics::from_features(features);or
double* features = fetch_or_generate_features(term);
auto stats = Feature_Statistics::from_features(features, features + len);The first one takes any forward range, such as std::vector, std::array,
that overload std::begin() and std::end() that return a forward iterator
of doubles. The latter takes two of such iterators.