Information Retrieval Models

Analysis of IR Effectiveness using Retrievability Measure

August 31th, 2011

Most information retrieval settings, such as web search, are typically precision-oriented, i.e. they focus on retrieving a small number of highly relevant documents. However, in specific domains, such as patent retrieval or law, recall becomes more relevant than precision: in these cases the goal is to find all relevant documents, requiring algorithms to be tuned more towards recall at the cost of precision. This raises important questions with respect to retrievability and search engine bias: depending on how the similarity between a query and documents is measured, certain documents may be more or less retrievable in certain systems, up to some documents not being retrievable at all within common threshold settings. Biases may be oriented towards popularity of documents (increasing weight of references), towards length of documents, favour the use of rare or common words; rely on structural information such as metadata or headings, etc.

We examine the retrieval bias of different retrieval models for different collections. The collections that we use for experiments include patent and news documents. For these collections, we first determine the retrievability of documents with different retrieval models, and then we analyze how far these retrieval models are differed in terms of retrieval bias that they imposed on the documents of collections. The overall retrievability of documents provides an indication of how easily the documents are accessible with different retrieval models. The overall retrievability inequality between the documents of collection shows the retrieval bias of retrieval model.

Collections

The following collections are used for the retievability analysis.

TREC Chemical Retrieval Track 2009

Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt

USPC Class 422 Patents Collection (downloaded from USPTO)

Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt

USPC Class 433 Patents Collection (downloaded from USPTO)

Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt

Austrain Newspaper Collection (German)

http://www.ifs.tuwien.ac.at/~andi/tmp/STANDARD.tgz

Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt

Retrievability Scores

The retrievability scores of collections are calculated with 8 standard retrieval models. These include

(1) Normalized-TFIDF.

(2) TFIDF.

(3) OKAPI-BM25.

(4) Language Modeling (Bayesian Smoothing).

(5) Language Modeling (Jelinek Mercer).

(6) Language Modeling (TwoStage Smoothing).

(7) Language Modeling (Absolute Discounting).

(8) SMART Retrieval Model.

Each line of retrievability scores file is separated with 6 fields. First five fields represent retrievability scores with five different rank cut-off level c=50, c=100, c=150, c=200, and c=250. The last field represents the total number of queries that are used for retrieving documents without considering any rank cutoff level.