Analysis of IR Effectiveness using Retrievability Measure
Most information retrieval settings, such as web search, are typically precision-oriented, i.e. they focus on retrieving a small number of highly relevant documents. However, in specific domains, such as patent retrieval or law, recall becomes more relevant than precision: in these cases the goal is to find all relevant documents, requiring algorithms to be tuned more towards recall at the cost of precision. This raises important questions with respect to retrievability and search engine bias: depending on how the similarity between a query and documents is measured, certain documents may be more or less retrievable in certain systems, up to some documents not being retrievable at all within common threshold settings. Biases may be oriented towards popularity of documents (increasing weight of references), towards length of documents, favour the use of rare or common words; rely on structural information such as metadata or headings, etc.
We examine the retrieval bias of different retrieval models for different collections. The collections that we use for experiments include patent and news documents. For these collections, we first determine the retrievability of documents with different retrieval models, and then we analyze how far these retrieval models are differed in terms of retrieval bias that they imposed on the documents of collections. The overall retrievability of documents provides an indication of how easily the documents are accessible with different retrieval models. The overall retrievability inequality between the documents of collection shows the retrieval bias of retrieval model.
Collections
The following collections are used for the retievability analysis.
TREC Chemical Retrieval Track 2009
Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt
USPC Class 422 Patents Collection (downloaded from USPTO)
Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt
USPC Class 433 Patents Collection (downloaded from USPTO)
Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt
Austrain Newspaper Collection (German)
http://www.ifs.tuwien.ac.at/~andi/tmp/STANDARD.tgz
Download Vector_File.txt | Document TermID_Word_Mapping.txt | Download Document_IDs.txt
Retrievability Scores
The retrievability scores of collections are calculated with 8 standard retrieval models. These include
(1) Normalized-TFIDF.
(2) TFIDF.
(3) OKAPI-BM25.
(4) Language Modeling (Bayesian Smoothing).
(5) Language Modeling (Jelinek Mercer).
(6) Language Modeling (TwoStage Smoothing).
(7) Language Modeling (Absolute Discounting).
(8) SMART Retrieval Model.
Each line of retrievability scores file is separated with 6 fields. First five fields represent retrievability scores with five different rank cut-off level c=50, c=100, c=150, c=200, and c=250. The last field represents the total number of queries that are used for retrieving documents without considering any rank cutoff level.
TREC Chemical Retrieval Track 2009
Download Retrievability Scores | Download Queries Set
USPC Class 422 Patents Collection (downloaded from USPTO)
Download Retrievability Scores | Download Queries Set
USPC Class 433 Patents Collection (downloaded from USPTO)
Download Retrievability Scores | Download Queries Set
Austrain Newspaper Collection (German)
http://www.ifs.tuwien.ac.at/~andi/tmp/STANDARD.tgz
Download Retrievability Scores | Download Queries Set