Automatic Retrievability Classification

Tool for Predicting Document Retrievability Ranks via Document Features

August 31th, 2011

@INPROCEEDINGS{:sigir2009:1, 

  author = {  and Jianhan Zhu}, 

  TITLE = {Portfolio Theory of Information Retrieval}, 

  BOOKTITLE = {SIGIR09 Full Paper}, 

  url={http://web4.cs.ucl.ac.uk/staff/./blog/2009/06/17/portfolio-theory-of-information-retrieval/}, 

  pdf = {http://web4.cs.ucl.ac.uk/staff/./papers/2009-sigir09-portfoliotheory.pdf}, 

  YEAR = {2009}

  }

Rerievability analysis are usually performed with the of queries. On major limitation of the query based approach is that it is prohibitively resource expansive. In order to predict document retrevability ranks quickly, we try to predict the retrievability ranks of documents using query independent approach via document features analysis. We compute a number of document features on the basis of surface level, terms weighting methods, and density around the nearest neighbors, and examined their correlation with the document retrievability ranks. On major advantage of this approach is that it requires less resources. Thus can be computed more quickly as compared to the query based approach. While, on the other hand one major disadvantage of this approach is that it can only predict the retrievability ranks of documents, but can not calculate how much there is a retrievability inequality between the documents of collection (retrieval bias) with different retrieval models.

(A) Input File Format

The tools require two files in order to run them. These are

1. fullText99.txt

2. ItemsetProcessing99.txt

1. fullText99.txt

This file contains the information about the document vectors. 99 is the vector file unqiue ID. Each vector contains the terms information of the documents. Vectors weights are represented by the Term Frequency format. These term information is further divided by two fields. Term numeric ID and Term frequency of the documents. Additionally each vector is separated by (-17 -17) End Header Tag.

Example:

1 12 2 4 3 2 4 7 -17 -17
8 14 9 7 21 7 35 6 -17 -17
1 8 9 7 35 6 21 6 -17 -17

The above example contains the information about three documents vectors. After each Term ID (bold font) there is its term frequency in the document (italic font). -17 -17 represents the vector End Header Tags.

2. ItemsetProcessing99.txt

This file contains the information about the total number of documents and the Highest ID of Terms IDs in the fullText99.txt. For the above example, the content of this file should be look like as.

Example:

3
35

In the above, 3 represents that are three vector in the fullText99.txt file, while 35 represents the highest numeric ID of term in the fullText99.txt.

(A)Tool for Calculating Surface Level Features

This features set captures the distributional characteristics of the terms within a document (based on the term frequencies) and over the whole collection (based on term document frequencies).

Program Arguments:

The software accepts the following arguments for running the code.

1. Directory Location where (fullText99.txt, ItemsetProcessing99.txt, and Settings.txt are saved).

2. Unique ID of vector file. For example, if your vector file name is fullText99.txt, then just put 99.

3. Not required. Just type 0.

4. The total number of vectors in the fullText99.txt.

Download Code:

Example:

./code //my_doc_collection// 99 0 75000

(B)Tool for Calculating Terms Weights based Features

The retrieval models that we use for the retrievability analysis do not rely on the absolute term frequencies within documents for calculating the document relevance score. In order to provide better relevance scores of documents, they modified the absolute term frequencies with the help of different features (i.e. length, vocabulary size, term document frequency, etc) and parameters. These modified scores are called terms weights.
The features of this feature set are defined on the basis of distributional characteristics of term weights. The distributional characteristics are based on the average of term weights within the documents, and the ranks of term weights relative to other documents of the collection. The ranks of terms are defined with the help of average of term rank positions in the inverted lists, the variance of term rank positions in the inverted lists, the term weights differences relative to median weight, and the term low rank ratio relative to all terms of the document.

Program Arguments:

The software accepts the following arguments for running the code.

1. Directory Location where (fullText99.txt, ItemsetProcessing99.txt, and Settings.txt are saved).

2. Unique ID of vector file. For example, if your vector file name is fullText99.txt, then just put 99.

3. Not required. Just type 0.

4. The total number of vectors in the fullText99.txt.

5. Not required. Just type 0.

6. Retrieval Model ID. These are

........... (1) Normalized-TFIDF.

........... (2) OKAPI-BM25.

........... (3) Language Modeling (Bayesian Smoothing).

........... (4) Language Modeling (Jelinek Mercer).

........... (5) Language Modeling (Absolute Discounting).

........... (6) Language Modeling (TwoStage Smoothing).

........... (7) TFIDF.

........... (8) SMART Retrieval Model.

7. Parameter values of Retrieval Models. In case of Normalized-TFIDF, TFIDF, and SMART Retrieval Models just typed 1, since these retrieval models do not need parameter values. In case of BM25, select the parameter value of b between 0 to 1. In case of Language Modeling (Bayesian Smoothing), select the parameter value of \mu between 50 to 10000. In case of Language Modeling (Jelinek Mercer), Language Modeling (Absolute Discounting), and Language Modeling (TwoStage Smoothing), select the parameter values of \lambda between 0 to 1.

Download Code:

Example:

./code //my_doc_collection// 99 0 75000 0 2 .75

(C)Tool for Calculating Document Density based Features

This feature set is based on the density around the nearest neighbor of documents.

Program Arguments:

The software accepts the following arguments for running the code.

1. Directory Location where (fullText99.txt, ItemsetProcessing99.txt, and Settings.txt are saved).

2. Unique ID of vector file. For example, if your vector file name is fullText99.txt, then just put 99.

3. Not required. Just type 0.

4. The total number of vectors in the fullText99.txt.

5. The maximum size of vector for calcuating the density.

6. Retrieval Model ID. These are

........... (1) Normalized-TFIDF.

........... (2) OKAPI-BM25.

........... (3) Language Modeling (Bayesian Smoothing).

........... (4) Language Modeling (Jelinek Mercer).

........... (5) Language Modeling (Absolute Discounting).

........... (6) Language Modeling (TwoStage Smoothing).

........... (7) TFIDF.

........... (8) SMART Retrieval Model.

Download Code:

Example:

./code //my_doc_collection// 99 0 75000 200 2 .75

Categories:

Shariq Bashir

Archive

Tool for Predicting Document Retrievability Ranks via Document Features

(A) Input File Format

(A)Tool for Calculating Surface Level Features

(B)Tool for Calculating Terms Weights based Features

(C)Tool for Calculating Document Density based Features

Topics