Archive

Archive for the ‘Effectiveness Tools ’ Category

Tool for Calculating Retrieval Model Effectiveness with Query Expansion and Selecting Pseudo-Relevance Feedback Documents Using Query Patent Similarity

September 2nd, 2011

 

Description of the method for selecting Pseudo Relevance Feedback using Query Patent Similarity

Prior-art search is an important task in patent retrieval. The success of this task relies upon the selection of relevant search queries. Typically terms for prior-art queries are extracted from the claim fields of query patents. However, due to the complex technical structure of patents, and presence of terms mismatch and vague terms, selecting relevant terms for queries is a difficult task. During evaluating the patents retrievability coverage of prior-art queries generated from query patents, a large bias toward a subset of the collection is experienced. A large number of patents either have a very low retrievability score or can not be discovered via any query. To increase the retrievability of patents, we expand prior-art queries generated from query patents using query expansion with pseudo relevance feedback. Missing terms from query patents are discovered from feedback patents, and better patents for relevance feedback are identified using a novel approach for checking their similarity with query patents. We specifically focus on how to automatically select better terms from query patents based on their proximity distribution with prior-art queries that are used as features for computing similarity.

 

(A) Input File Format

The software requires three files in order to run it. These are

1. fullText99.txt

2. ItemsetProcessing99.txt

3. Settings.txt

 

1. fullText99.txt

This file contains the information about the document vectors. 99 is the vector file unqiue ID. Each vector contains the terms information of the documents. Vectors weights are represented by the Term Frequency format. These term information is further divided by two fields. Term numeric ID and Term frequency of the documents. Additionally each vector is separated by (-17 -17) End Header Tag.

Example:

1 12 2 4 3 2 4 7 -17 -17
8 14 9 7 21 7 35 6 -17 -17
1 8 9 7 35 6 21 6 -17 -17

The above example contains the information about three documents vectors. After each Term ID (bold font) there is its term frequency in the document (italic font). -17 -17 represents the vector End Header Tags.

 

2. ItemsetProcessing99.txt

This file contains the information about the total number of documents and the Highest ID of Terms IDs in the fullText99.txt. For the above example, the content of this file should be look like as.

Example:

3
35

In the above, 3 represents that are three vector in the fullText99.txt file, while 35 represents the highest numeric ID of term in the fullText99.txt.

 

3. Settings.txt

The file contain the useul settings information about running the code. It is seperated by the followign 9 fields.

1. Query Topic File Name

2. Total number of topics in Query Topic File Name.

3. Relevance Judgements of Query Topic.

4. Recall rank cutoff level.

5. Precision rank cutoff level.

6. Just write "not requried".

7. Just write "not requried".

8. Term Posting File. The format of this file is given bellow.

9. Classifiation rules File for selecting best terms from the Query Patent.

3.1. Query Topics File

This file contains the information about the topic that are used for calculating the effectiveness of retrieval models. Each line represents the topic, and it is seperated with two parts. First part contains the information about the topic text, and text is represented by term numeric ID. The second part contains the information about End Header Tag, and it is represented by -17.

Example:

1 2 3 -17
9 21 35 -17
1 9 35 21 -17

The above example file contains information about three queries. -17 represents the End header tags of queries, while other numeric numbers represent the query term ID numbers.

 

3.2. Relevance Judgement of Topic Queries

This file contains information about the relevance judgements of topic query. Each line (representing judged document of topic query) represent the ID of vector in the fullText99.txt file. When the line contain -17 then it indicates that the file reached at the end of relevance judgement of given topic.

Example:

6
10
7523
578
-17
98
42
68
-17

The above example file contains the information about the relevance judgements of two topic queries. The italic numeric IDs represent the vector IDs in fullText99.txt, and bold numeric IDs represent the end Tag of the relevance judgements of topics. First topic contains 4 relevance judgements (6,10,7523,578) and second topic contains 3 relevance judgements (98,42,68).

 

3.3. Classification Rules File

Download Term Classification Training Dataset for TREC Chemical Retrieval Track 2009 Dataset

This file contains the rules for classifing good and bad terms from Query Patents. Supply the rules using this style.

Feature1 = a 
| Feature2 <= 75: good (2.0)
|   Feature2 > 75: bad (3.0)
Feature3 = c: good (4.0)
Feature3 = d
| Feature4 = TRUE: good (2.0)
| Feature4 = FALSE: bad (3.0)

Download Trained J48 Rules for TREC Chemical Retrieval Track 2009 Dataset

 

3.4. Term Posting File

This file contain the posting of terms in the documents. This file is necessary for calculating the Query Patent Term similarity with the terms of queries.

 

(B) Effectiveness Measures

The effectiveness of retrieval models are calculated with the following measures

1. Recall

2. Precision

3. Mean Average Precision

4. b-pref

 

(C) Program Arguments

The software accepts the following arguments for running the code.

1. Directory Location where (fullText99.txt, ItemsetProcessing99.txt, and Settings.txt are saved).

2. Unique ID of vector file. For example, if your vector file name is fullText99.txt, then just put 99.

3. Not required. Just type 0.

4. The total number of vectors in the fullText99.txt.

5. Not required. Just type 0.

6. Retrieval Model ID. These are

........... (21) The method that selects the PRF documents according to their similarity with the query patent via classified good terms.

7. This is parameter \lamda of Language Modeling (Jelinek Mercer), that is used for providing the initial ordering of documents. Use this value between 0 to 1. 0.7 is recommended.

8. Not required. Just type 0.

 

Download Code:

Example:

./code //my_doc_collection// 99 0 75000 0 21 .75 0