Pseudo Relevance Feedback and Retrievability

Improving Retrievability with Cluster based Pseudo-Relevance Feedback

August 31th, 2011

High findability of documents within a certain cut-off rank is considered an important factor in recall-oriented application domains such as patent or legal document retrieval. Findability is hindered by two aspects, namely the inherent bias favoring some types of documents over others introduced by the retrieval model, and the failure to correctly capture and interpret the context of conventionally rather short queries. Wee analyze the bias impact of different retrieval models and query expansion strategies. We furthermore propose a novel query expansion strategy based on document clustering to identify dominant relevant documents. This helps to overcome limitations of conventional query expansion strategies that suffer strongly from the noise introduced by imperfect initial query results for pseudo-relevance feedback documents selection. Experiments with different collections of patent documents suggest that clustering based document selection for pseudo-relevance feedback is an effective approach for increasing the findability of individual documents and decreasing the bias of a retrieval system.

portfolio-explained The retrieval bias of different retrieval models for the TREC Chemical Retrieval Track 2009 Collections.

Keywords:

Retrievability Scores

The retrievability scores of collections are calculated with the help of following retrieval strategies. These include

Standard Retrieval and Language Models

(1) Normalized-TFIDF.

(2) TFIDF.

(3) OKAPI-BM25.

(4) Language Modeling (Bayesian Smoothing).

(5) Language Modeling (Jelinek Mercer).

(6) Language Modeling (TwoStage Smoothing).

(7) Language Modeling (Absolute Discounting).

(8) SMART Retrieval Model.

Query Expansion Strategies

(1) Query Expansion based on language modeling approach (TS-LM) [Larkey02]: This method uses top-n documents for PRF selection. The candidate terms for the expansion in the PRF set are ranked according to the sum of divergences between the documents they occurred and the importance of the terms in the whole collection.

(2) Query Expansion based on Kullback-Leibler divergence (TS-KLD) [Croft00]: This method also uses top-n for PRF selection. However, terms for the expansion in the PRF set are ranked according to the relative rareness of terms in the PRF set as opposed to the whole collection.

(3) Our clustering approach for pseudo relevance feedback documents selection (CCGen).

Each line of retrievability scores file is separated with 6 fields. First five fields represent retrievability scores with five different rank cut-off level c=50, c=100, c=150, c=200, and c=250. The last field represents the total number of queries that are used for retrieving documents without considering any rank cutoff level.

TREC Chemical Retrieval Track 2009

Download Retrievability Scores

USPC Class 422 Patents Collection (downloaded from USPTO)

Download Retrievability Scores

USPC Class 433 Patents Collection (downloaded from USPTO)

Download Retrievability Scores

Austrain Newspaper Collection (German)

http://www.ifs.tuwien.ac.at/~andi/tmp/STANDARD.tgz

Download Retrievability Scores

Keywords: Measurement, Algorithms, Experimentation, Relevance Feedback, Document Clustering.

Improving Retrievability of Patents in Prior-Art Search

August 31th, 2011

Prior-art search is an important task in patent retrieval. The success of this task relies upon the selection of relevant search queries. Typically terms for prior-art queries are extracted from the claim fields of query patents. However, due to the complex technical structure of patents, and presence of terms mismatch and vague terms, selecting relevant terms for queries is a difficult task. During evaluating the patents retrievability coverage of prior-art queries generated from query patents, a large bias toward a subset of the collection is experienced. A large number of patents either have a very low retrievability score or can not be discovered via any query. To increase the retrievability of patents, we expand prior-art queries generated from query patents using query expansion with pseudo relevance feedback. Missing terms from query patents are discovered from feedback patents, and better patents for relevance feedback are identified using a novel approach for checking their similarity with query patents. We specifically focus on how to automatically select better terms from query patents based on their proximity distribution with prior-art queries that are used as features for computing similarity.
Our results show, that the coverage of prior-art queries can be increased significantly by incorporating relevant queries terms using query expansion.

portfolio-explained The retrieval bias of different retrieval models for the TREC Chemical Retrieval Track 2009 Collections.

Categories:

Retrievability Scores

The retrievability scores of collections are calculated with the help of following retrieval strategies. These include

Standard Retrieval and Language Models

(1) Normalized-TFIDF.

(2) TFIDF.

(3) OKAPI-BM25.

(4) Language Modeling (Bayesian Smoothing).

(5) Language Modeling (Jelinek Mercer).

(6) Language Modeling (TwoStage Smoothing).

(7) Language Modeling (Absolute Discounting).

(8) SMART Retrieval Model.

Query Expansion Strategies

(1) Query Expansion based on language modeling approach (TS-LM) [Larkey02]: This method uses top-n documents for PRF selection. The candidate terms for the expansion in the PRF set are ranked according to the sum of divergences between the documents they occurred and the importance of the terms in the whole collection.

(2) Query Expansion based on Kullback-Leibler divergence (TS-KLD) [Croft00]: This method also uses top-n for PRF selection. However, terms for the expansion in the PRF set are ranked according to the relative rareness of terms in the PRF set as opposed to the whole collection.

(3) (QP-TS): Our approach that selects the PRF documents according to their similarity with the query patent via specific terms.

Each line of retrievability scores file is separated with 6 fields. First five fields represent retrievability scores with five different rank cut-off level c=50, c=100, c=150, c=200, and c=250. The last field represents the total number of queries that are used for retrieving documents without considering any rank cutoff level.

TREC Chemical Retrieval Track 2009

Download Retrievability Scores | Download Term Classification Training Dataset

Keywords: Prior-Art Search, Terms-Proximity, Retrieval Systems Bias, Query Expansion, Pseudo-Relevance Feedback

Shariq Bashir

Archive

Improving Retrievability with Cluster based Pseudo-Relevance Feedback

TREC Chemical Retrieval Track 2009

USPC Class 422 Patents Collection (downloaded from USPTO)

USPC Class 433 Patents Collection (downloaded from USPTO)

Austrain Newspaper Collection (German)

Improving Retrievability of Patents in Prior-Art Search

TREC Chemical Retrieval Track 2009

Topics