Automatic Retrievability Classification

Predicting Document Retrievability Ranks using Document Features

August 31th, 2011

@INPROCEEDINGS{:sigir2009:1, 

  author = {  and Jianhan Zhu}, 

  TITLE = {Portfolio Theory of Information Retrieval}, 

  BOOKTITLE = {SIGIR09 Full Paper}, 

  url={http://web4.cs.ucl.ac.uk/staff/./blog/2009/06/17/portfolio-theory-of-information-retrieval/}, 

  pdf = {http://web4.cs.ucl.ac.uk/staff/./papers/2009-sigir09-portfoliotheory.pdf}, 

  YEAR = {2009}

  }

Rerievability analysis are usually performed with the of queries. On major limitation of the query based approach is that it is prohibitively resource expansive. In order to predict document retrevability ranks quickly, we try to predict the retrievability ranks of documents using query independent approach via document features analysis. We compute a number of document features on the basis of surface level, terms weighting methods, and density around the nearest neighbors, and examined their correlation with the document retrievability ranks. On major advantage of this approach is that it requires less resources. Thus can be computed more quickly as compared to the query based approach. While, on the other hand one major disadvantage of this approach is that it can only predict the retrievability ranks of documents, but can not calculate how much there is a retrievability inequality between the documents of collection (retrieval bias) with different retrieval models.

The following three classes of features are used.

(1) Surface Level Features

This features set captures the distributional characteristics of the terms within a document (based on the term frequencies) and over the whole collection (based on term document frequencies).

Download Features

TREC Chemical Retrieval Track 2009

USPC Class 422 Patents Collection (downloaded from USPTO)

USPC Class 433 Patents Collection (downloaded from USPTO)

Austrian Newspaper Collection (German)

(2) Features based Terms Weights

The retrieval models that we use for the retrievability analysis do not rely on the absolute term frequencies within documents for calculating the document relevance score. In order to provide better relevance scores of documents, they modified the absolute term frequencies with the help of different features (i.e. length, vocabulary size, term document frequency, etc) and parameters. These modified scores are called terms weights.
The features of this feature set are defined on the basis of distributional characteristics of term weights. The distributional characteristics are based on the average of term weights within the documents, and the ranks of term weights relative to other documents of the collection. The ranks of terms are defined with the help of average of term rank positions in the inverted lists, the variance of term rank positions in the inverted lists, the term weights differences relative to median weight, and the term low rank ratio relative to all terms of the document.

Download Features

TREC Chemical Retrieval Track 2009

USPC Class 422 Patents Collection (downloaded from USPTO)

USPC Class 433 Patents Collection (downloaded from USPTO)

Austrian Newspaper Collection (German)

(3) Document Density based Features

This feature set is based on the density around the nearest neighbor of documents.

Download Features

TREC Chemical Retrieval Track 2009

USPC Class 422 Patents Collection (downloaded from USPTO)

USPC Class 433 Patents Collection (downloaded from USPTO)

Austrian Newspaper Collection (German)

Categories: Keywords: Fast Retrievability Identification, Documents Classification, Text-based Features, IR System bias identification

Improving Retrievability by Collection Partitioning Approach

August 317th, 2011

The experiments perform on retrievability analysis reveal that, indeed, a collection consists of documents that show highly different behavior in retrievability. Some documents are returned within the top-c results for a huge number of queries, possibly suppressing others that almost never show up within the top-c results for any query. This means that these documents are virtually inexistent for a searcher. One of the goal of recall-oriented retrieval domains is to ensure that all relevant documents are potentially found. We thus need to devise ways to ensure that the documents exhibiting low retrievability can also be retrieved by the queries that they are potentially relevant for. In order to do so, we propose to split a collection into two partitions, consisting of equal number of high and low retrievable documents. Having splitting the collection into these two partitions we then perform retrieval by treating these partitions as an independent collections, and process queries independently for each partition and subsequently combining the result lists afterwards. This ensures that the final result list will always include also documents having a low retrievability scores i.e., that would rarely or never have been returned within a certain rank cut-off in a standard retrieval setting independent of collection partition.

portfolio-explained The retrieval bias of standard and partition based retrieval strategies for the TREC Chemical Retrieval Patents 2009 collection. Partition reflects partition based retrieval strategy, and Standard reflects standard retrieval strategy.

Categories:

Categories: Keywords: Prior-Art Search, Corpus Partitioning, Documents Classification, Fast Retrievability Identification

Shariq Bashir

Archive

Predicting Document Retrievability Ranks using Document Features

Improving Retrievability by Collection Partitioning Approach

Topics