Archive

Archive for the ‘Retrievability Tools ’ Category

Tool for Calculating Document Retrievability with Query Expansion

September 2nd, 2011

 

The software can calculate the retrievability scores of documents with the following Query Expansion methods.

(1) Query Expansion based on language modeling approach (TS-LM) [Larkey02]: This method uses top-n documents for PRF selection. The candidate terms for the expansion in the PRF set are ranked according to the sum of divergences between the documents they occurred and the importance of the terms in the whole collection.

(2) Query Expansion based on Kullback-Leibler divergence (TS-KLD) [Croft00]: This method also uses top-n for PRF selection. However, terms for the expansion in the PRF set are ranked according to the relative rareness of terms in the PRF set as opposed to the whole collection.

(3) (CCGen): A pseudo relevance feedback selection approach that selects documents for PRF on the basis of document clustering.

 

(A) Input File Format

The software requires three files in order to run it. These are

1. fullText99.txt

2. ItemsetProcessing99.txt

3. Settings.txt

 

1. fullText99.txt

This file contains the information about the vectors. 99 is the vector file unqiue ID. Each vector contains the terms information of the documents. Vectors weights are represented by the Term Frequency format. These term information is further divided by two fields. Term numeric ID and Term frequency of the documents. Additionally each vector is separated by (-17 -17) End Header Tag.

Example:

1 12 2 4 3 2 4 7 -17 -17
8 14 9 7 21 7 35 6 -17 -17
1 8 9 7 35 6 21 6 -17 -17

The above example contains the information about three documents vectors. After each Term ID (bold font) there is its term frequency in the document (italic font). -17 -17 represents the vector End Header Tags.

 

2. ItemsetProcessing99.txt

This file contains the information about the total number of documents and the Highest ID of Terms IDs in the fullText99.txt. For the above example, the content of this file should be look like as.

Example:

3
35

In the above, 3 represents that are three vector in the fullText99.txt file, while 35 represents the highest numeric ID of term in the fullText99.txt.

 

3. Settings.txt

The file contain the useul settings information about running the code. It is seperated by the followign 11 fields.

1. Total Number of Threads. (In case of 2 the code will run with only single process.)

2. Query Set File Path. The format of this file is given bellow.

3. Rank Cutoff Level 1.

4. Rank Cutoff Level 2.

5. Rank Cutoff Level 3.

6. Rank Cutoff Level 4.

7. Rank Cutoff Level 5.

8. Just write "not requried".

9. clusters nearest neighbors File Path. The format of this file is given bellow.

10. Just write "not requried".

11. Just write "not requried".

 

3.1. Query Set File

This file contains the information about the queries that are used for calculating the retrievability scores. Each line represents the query, and it is seperated with two parts. First part contains the information about the query terms (numeric IDs), and terms of query are seperated by spaces. The second part contains the information about End Header Tag, and it is represented by -17.

Example:

1 2 3 -17
9 21 35 -17
1 9 35 21 -17

The above example file contains information about three queries. -17 represents the End header tags of queries, while other numeric numbers represent the query term ID numbers.

 

3.2. Clusters Nearest Neighbors File

This file contains the information about the nearest neighbors of clusters. Each line of file contains information about the cluster nearest neighbors, and it is divided into three parts. Cluster ID, nearest neighbor document ID, and cosine similarity between them.

 

(B) Output File Format

Each line of output file is separated with 6 fields. First five fields represent retrievability scores with five different rank cutoff levels given the Settings.txt file. The last field represents the total number of queries that are used for retrieving documents without considering any rank cutoff level.

 

(C) Program Arguments

The software accepts the following arguments for running the code.

1. Directory Location where (fullText09.txt, ItemsetProcessing09.txt, and Settings.txt are saved).

2. Unique ID of vector file. For example, if your vector file name is fullText99.txt, then just put 99.

3. Not required. Just type 0.

4. The total number of vectors in the fullText99.txt.

5. Not required. Just type 0.

6. Query Expansion Method ID. These are

........... (11) Query Expansion based on language modeling approach (TS-LM) [Larkey02]: This method uses top-n documents for PRF selection. The candidate terms for the expansion in the PRF set are ranked according to the sum of divergences between the documents they occurred and the importance of the terms in the whole collection..

........... (12) Query Expansion based on Kullback-Leibler divergence (TS-KLD) [Croft00]: This method also uses top-n for PRF selection. However, terms for the expansion in the PRF set are ranked according to the relative rareness of terms in the PRF set as opposed to the whole collection.

........... (10) (CCGen): A pseudo relevance feedback selection approach that selects documents for PRF on the basis of document clustering.

 

7. This is parameter b of BM25, and it is used for providing the initial ordering of documents. Use this value between 0 to 1. 0.75 is recommended.

 

Download Code:

Example:

./code //my_doc_collection// 99 0 75000 0 11 .75