Wolfgang Tannebaum

Vienna University of Technology
Institute of Software Technology and Interactive Systems
Information and Software Engineering Group
 
Favoritenstrasse 9-11/188
A-1040 Wien
Austria

email: tannebaum@ifs.tuwien.ac.at


Team Member

Bibliography


In 2006 I received my Diploma in Patent Engineering at the University of Applied Science Amberg-Weiden (Germany). From 2006 until 2015 I worked in the patent department of W&H one of the leading manufacturers of dental transmission instruments & devices in the world (Austria). In 2015 I joined the legal department of medi, which is a leading and cutting-edge manufacturer of medical aids, sports textiles and fashion products that combine precision effectiveness with maximum comfort (Germany). I am in charge of all patent and trademark topics. My main activities in Patent Management are: Patent Monitoring, Patent Retrieval, Patent Analysis, Drafting Patent Applications, and Prosecuting Patent Applications;

Currently, I am a PhD student under the supervision of Professor Dr. Andreas Rauber at the Faculty of Informatics, Department of Software Technology and Interactive Systems of Vienna University of Technology (Austria).


Publications

Tannebaum, W., Mahdabi, P. and Rauber, A. 2015. Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching. In Proceedings of 6th International Conference of the CLEF Initiative (CLEF2015), Toulouse, France.

Tannebaum W., Rauber A. 2015. Learning Keyword Phrases from Query Logs of USPTO Patent Examiners for Automatic Query Scope Limitation in Patent searching. In World Patent Information, Volume 41.

Tannebaum W., Rauber, A. 2015. PatNet: A lexical database for the patent domain. In Proceedings of the 37th European Conference on Information Retrieval (ECIR 2015), Vienna, Austria.

Tannebaum W., Rauber, A. 2014. Using Query Logs of USPTO Patent examiners for automatic Query Expansion in Patent Searching. In Information Retrieval, Volume 17, Issue 5-6.

Tannebaum, W., Rauber, A. 2013. Mining Query Logs of USPTO Patent Examiners. In Proceedings of 4th International Conference of the CLEF Initiative (CLEF 2013), Valencia, Spain.

Tannebaum, W., Rauber, A. 2012. Acquiring lexical knowledge from Query Logs for Query Expansion in Patent Searching. In Proceedings of the 6th IEEE International Conference on Semantic Computing (IEEE ICSC 2012), Palermo, Italy.

Tannebaum, W., Rauber, A. 2012. Analyzing Query Logs of USPTO examiners to identify useful Query Terms in Patent Documents: A Preliminary Study. In Proceedings of the Information Retrieval Facility Conference (IRFC 2012), Vienna, Austria.

Tannebaum, W., Rauber, A. 2010. Query Expansion for Patent Retrieval using Domain Specific Thesaurus. In Proceedings of the 2010 Conference on the Interaction of Information Related Rights, Information Technology and Knowledge Management (KnowRight 2010), Vienna, Austria.

Research

In our research we investigate the problem of query term expansion (QTE) in the query generation step of patent searching with the goal of suggesting relevant expansion terms, in particular synonyms and equivalents, to a query term in a semi-automatic or fully automatic manner for Boolean retrieval. In our current work we analyze query logs of patent examiners of the United Patent and Trademark Office (USPTO) to gain insights into the search behaviour and characteristic of patent expert’s queries.

Experiment Set up
Google has begun crawling the USPTO's public PAIR sites and provides free download of the patent applications and the query logs (http://www.google.com/googlebooks/uspto-patents.html). Since April 2013, Reed Technology, a contractor to the USPTO, undertakes this task and hosts the data. We selected fifteen classes that are topically related as well as completely disjunct classes and classes having different numbers of query log files for our experiments. We downloaded and preprocessed 103,896 query logs available for fifteen selected US classes, making it the largest collection of query logs used for experiments in the patent IR domain. The list of the patent application numbers used for the experiments is available here.
 
Approach used for knowledge extraction

We generate a list of download links for each class based on the download URL “http://storage.googleapis.com/uspto pair/applications/APP_NUM.zip”, where we replace “APP_NUM” in the URL with the application numbers. We harvest the zip files via Wget* a free software package for retrieving files from web servers. Next, we unzip and filter the files using the file name ending “SRNT.pdf” to retrieve the query logs called “Examiner`s search strategy and results”. Then we carry out OCR conversion using ABCocr** a product to extract text from images on a Windows 7 platform and converted the PDF files to TXT files. Subsequently, all terms were fed into the extraction process. Following, we generate n-grams, in particular 3-grams, 5-grams, 7-grams, and 9-grams, from the extracted text using AntConc*** a free n-gram extraction tool.


Code
 
General workflow of our proposed approach to acquire lexical knowledge
from query logs of USPTO patent examiners.


To detect single term relations, in particular synonyms and keyword phrases we then filter all 3-grams generated from the text queries in the form “X b Y”, where b is the Boolean operator “OR” or “ADJ” and X and Y are query terms. We measure the frequency (>1) of each relation to exclude mismatches and misspellings. Further, to detect single term to phrase and phrase to phrase relations we filter all 5-grams in the form “X b Y p Z” and “ X p Y b Z”, and all 7-grams in the form “X p Y b Z p W“, where X, Y, Z and W are query terms, p the proximity operator “ADJ” and b the Boolean operator “OR”. Finally, to query the generated expansion lists, we load them into the open source thesaurus management software TheW32****.

Lexical database PatNet
The source files used for generating the lexical databases PatNet are available here. PatNet can be used to assist patent searchers in generating the invention diagram, in particular to expand the searchable features of an invention with additional query terms, in a semi-automatic manner.

Example
Using PatNet for semi-automatic Query Term Expansion

The invention diagram includes in a first column the searchable features of the invention selected from a source document, particularly from a patent application or an invention report, and in a second column the corresponding expansion terms. PatNet suggests to the searchable features single terms (STR), keyword phrases (STPR), and single terms, which in combination with the query term constitute a keyword phrase and finally suggests synonym phrases (PPR).
Automatic QTE
At first we explored, if the query log-based query term expansion approach is useful for automatic QTE. In particular, we tried to find out if the performance of our query term expansion approach depends on the training set and class size, and if there any advantages in using the US classification system to build US class-specific lexical databases. Further, we calculated whether the query log based query term expansion approach outperform standard dictionaries. The evaluation has shown that recall and coverage measures increase with the availability of a larger set of query logs. On average, up to 8 out of 10 ETs, which are used by the examiners for query term expansion, are suggested by the class-independent lexical databases. Expectedly, the class-specific and class-related databases achieve better precision scores than the class-independent databases. On average, 1 out of 20 suggested ETs, which were suggested by the class-independent databases, were used by the examiners for QTE. This is similar to numbers achieved in related work for patent searching (about 5%). Further, the results of the evaluation show that the specific lexical databases drastically outperform the general-purpose source WordNet.
 
Query Term Expansion strategies
Further, we studied how the query log-based query term expansion model could be optimized to carry out effective QTE. We used (1) patent US class-specific and class-related ETs, (2) successively suggested ETs based on their frequency in the training set, and (3) suggested ETs based on overlap of sense definitions. The results of the experiments showed that the achieved precision scores (about 20%) significantly exceed the scores achieved in related work for patent searching (about 5%) and are comparable to numbers reported for professional academic search (about 17%). Only a minor decrease in recall (from 70 to 63%) has been noticed, when considering frequency of the extracted relations and successively suggesting the highest ranked ETs (while precision can be improved up to 22%). This expansion strategy fits very well with the recall-oriented patent search task and with query term expansion scenarios (as they occur in patent searching), where search sessions extend over many queries that are gradually refined. To avoid time-consuming term selection form a complete list of ETs or invention diagram, we recommend to guide users through the query expansion process, instead of limiting the number of suggested ETs. The latter had the effect that relevant ETs (available in PatNet) are not suggested.
 
Effect of query log based QTE on document retrieval

 Finally, we evaluated our log based query term expansion approach based on real patent searchers. We used the CLEF-IP 2010 benchmark data set and measured the effect of synonymous query term expansion on retrieval effectiveness in patent searching. The experiments show that the retrieval performance decreases or can only be slightly improved, when using PatNet for fully-automatic QTE. No significant improving can be recognized. Through the expansion of the initial query terms with synonyms and equivalents the query scope of the query topics is radically expanded. Many additional relevant and non-relevant documents will be retrieved. So synonym expansion has generally no positive effect on the retrieval performance. But the analysis of the retrieval results shows that the query log-based QTE method does not have generally a negative effect on the retrieval effectiveness. Recall is drastically improved for query topics, where the baseline queries achieve, on average, only low recall values. But we have not detected any commonality that allows us to characterize these queries. So we recommend using synonyms for semi-automatic query term expansion in Boolean retrieval, where expanding query terms with synonyms and equivalents with the aim of expanding the query scope is a common practice.



*http://www.gnu.org/software/wget/
**http://www.websupergoo.com/abcocr-1.htm
***http://www.laurenceanthony.net/antconc_index.html
****http://publish.uwo.ca/~craven/freeware.htm