The impact of fielding on retrieval performance and bias

Wilkie, Colin and Azzopardi, Leif (2018) The impact of fielding on retrieval performance and bias. In: ASIS&T Annual Meeting 2018, 2018-11-10 - 2018-11-14.

[thumbnail of Wilkie-Azzopardi-ASIST-2018-The-impact-of-fielding-on-retrieval-performance]
Preview
Text. Filename: Wilkie_Azzopardi_ASIST_2018_The_impact_of_fielding_on_retrieval_performance.pdf
Accepted Author Manuscript

Download (477kB)| Preview

Abstract

Within many domains, such as news, medicine and patent, documents contain a variety of fields such as title, author, body, source, etc. As such fielded retrieval models that query across fields are often employed. It is largely presumed that fielding provides a better representation of the document and offers more control when querying, and that this will lead to improved retrieval performance. However, depending on how the fields are weighted and if the fields are populated, the retrieval algorithm may unduly favour certain documents over others. This is known as algorithmic bias and it may be detrimental to retrieval systems performance. In this paper, we explore the impact of fielding on retrieval bias and performance across a variety of TREC News Test Collections. We perform an extensive large-scale analysis on two types of fielded retrieval model variations that are based on the popular BM25 retrieval algorithm where either: fields are scored independently and then combined (Model 1), or fields are first combined and then scored (Model 2). Our findings show that for Model 1 fielding, a strong correlation exists between retrieval bias and performance such that as title fields are weighted more heavily, bias increases, while retrieval performance decreases. When weighting is applied to content-based fields, performance increases as bias decreases, showing that relying more on content may be favourable in terms of fairness and performance. On the other hand, for Model 2 fielding, the relationship between retrieval bias and performance is more complex. But, crucially we show that Model 2 fielding results in lower retrieval bias and greater performance than Model 1 fielding. And, we observed that under Model 1, news articles without titles are substantially less retrievable (i.e. more susceptible to algorithmic bias). These findings have serious ramifications as many popular Open Source Information Retrieval frameworks, commonly used by professional searchers, use the default implementation of Model 1 for their fielded search capability. This research shows the importance of analysing retrieval algorithms with respect to both bias and performance to ensure they minimize any unwanted or unintended biases when maximising performance. Further work is required to examine this phenomenon in more detail and to design fielded retrieval models that have the advantages of control and performance without detrimental biases.