An empirical analysis of pruning techniques performance, retrievability and bias

Chen, Ruey-Cheng and Azzopardi, Leif and Scholer, Falk; (2017) An empirical analysis of pruning techniques performance, retrievability and bias. In: CIKM 2017 - Proceedings of the 2017 ACM Conference on Information and Knowledge Management. Association for Computing Machinery, SGP, pp. 2023-2026. ISBN 9781450349185 (https://doi.org/10.1145/3132847.3133151)

[thumbnail of Chen-etal-CIKM-2017-An-empirical-analysis-of-pruning-techniques-performance-retrievability-and-bias]
Preview
Text. Filename: Chen_etal_CIKM_2017_An_empirical_analysis_of_pruning_techniques_performance_retrievability_and_bias.pdf
Accepted Author Manuscript

Download (746kB)| Preview

Abstract

Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relation between retrieval performance and retrieval bias. While various factors influencing retrievability have been examined, showing how the retrieval model may influence bias, no prior work has examined the impact of the index (and how it is optimized) on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the retrieval bias of a system changes as the inverted index is optimized for efficiency through static index pruning. In our analysis, we consider four pruning methods and examine how they affect performance and bias on the TREC GOV2 Collection. Our results show that the relationship between these factors is varied and complex-and very much dependent on the pruning algorithm. We find that more pruning results in relatively little change or a slight decrease in bias up to a point, and then a dramatic increase. The increase in bias corresponds to a sharp decrease in early precision such as NDCG@10 and is also indicative of a large decrease in MAP. The findings suggest that the impact of pruning algorithms can be quite varied-but retrieval bias could be used to guide the pruning process. Further work is required to determine precisely which documents are most affected and how this impacts upon performance.