In 2006 I received my Diploma in Patent Engineering
at the University of Applied Science Amberg-Weiden (Germany). From 2006 until 2015 I worked
in the patent department of W&H one of the leading manufacturers of dental
transmission instruments & devices in the world (Austria). In 2015 I joined the legal department of medi,
which is a leading and cutting-edge manufacturer of medical aids,
sports textiles and fashion products that combine precision
effectiveness with maximum comfort (Germany). I am in charge of all patent and trademark topics. My
main activities in Patent Management are: Patent Monitoring, Patent Retrieval,
Patent Analysis, Drafting Patent Applications, and Prosecuting Patent
Currently, I am a PhD student under the
supervision of Professor Dr. Andreas Rauber at the Faculty of Informatics,
Department of Software Technology and Interactive Systems of Vienna University
of Technology (Austria).
Tannebaum, W., Mahdabi, P. and Rauber, A. 2015. Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching. In Proceedings of 6th International Conference of the CLEF Initiative (CLEF2015), Toulouse, France.
Tannebaum W., Rauber A. 2015. Learning Keyword Phrases from Query Logs of USPTO Patent Examiners for Automatic Query Scope Limitation in Patent searching. In World Patent Information, Volume 41.
Tannebaum W., Rauber, A. 2015. PatNet: A lexical database for the patent domain. In Proceedings of the 37th European Conference on Information Retrieval (ECIR 2015), Vienna, Austria.
Tannebaum W., Rauber, A. 2014. Using Query Logs of USPTO Patent examiners for automatic Query Expansion in Patent Searching. In Information Retrieval, Volume 17, Issue 5-6.
Tannebaum, W., Rauber, A. 2013. Mining Query Logs of USPTO Patent Examiners.
In Proceedings of 4th International
Conference of the CLEF Initiative (CLEF 2013), Valencia, Spain.
Tannebaum, W., Rauber,
A. 2012. Acquiring lexical
knowledge from Query Logs for Query Expansion in Patent Searching. In
Proceedings of the 6th IEEE
Conference on Semantic Computing (IEEE ICSC 2012), Palermo,
Tannebaum, W., Rauber, A. 2012. Analyzing
Query Logs of USPTO examiners to identify useful Query Terms in Patent
Documents: A Preliminary Study. In
Proceedings of the Information Retrieval Facility Conference (IRFC
2012), Vienna, Austria.
Tannebaum, W., Rauber, A. 2010. Query Expansion for
Patent Retrieval using Domain Specific Thesaurus.
In Proceedings of the 2010 Conference on the Interaction of Information
Related Rights, Information Technology and Knowledge Management
(KnowRight 2010), Vienna, Austria.
In our research we investigate the problem of query
term expansion (QTE) in the query generation step of patent searching with the
goal of suggesting relevant expansion terms, in particular synonyms and
equivalents, to a query term in a semi-automatic or fully automatic manner for
Boolean retrieval. In our current work we analyze query logs of patent
examiners of the United Patent and Trademark Office (USPTO) to gain insights
into the search behaviour and characteristic of patent expert’s queries.
Experiment Set up
Google has begun crawling the USPTO's public PAIR
sites and provides free download of the patent applications and the query logs (http://www.google.com/googlebooks/uspto-patents.html). Since April 2013, Reed Technology, a contractor to
the USPTO, undertakes this task and hosts the data. We selected fifteen classes that are topically
related as well as completely disjunct classes and classes having different numbers
of query log files for our experiments.
We downloaded and preprocessed 103,896 query logs
available for fifteen selected US classes, making it the largest collection of
query logs used for experiments in the patent IR domain. The list of the patent application numbers used for
the experiments is available here.
Approach used for knowledge extraction
We generate a list of download links for each class based on the
download URL “http://storage.googleapis.com/uspto
pair/applications/APP_NUM.zip”, where we replace “APP_NUM” in the URL
with the application numbers. We harvest the zip files via Wget
* a free software package for retrieving files from web servers. Next,
we unzip and filter the files using the file name ending “SRNT.pdf” to
retrieve the query logs called “Examiner`s search strategy and
results”. Then we carry out OCR conversion using ABCocr
product to extract text from images on a Windows 7 platform and
converted the PDF files to TXT files. Subsequently, all terms were fed
into the extraction process. Following, we generate n-grams, in
particular 3-grams, 5-grams, 7-grams, and 9-grams, from the extracted
text using AntConc***
a free n-gram extraction tool.
General workflow of our proposed approach to acquire lexical knowledge
from query logs of USPTO patent examiners.
To detect single term
relations, in particular synonyms and keyword phrases we then filter all
3-grams generated from the text queries in the form “X b
Y”, where b
Boolean operator “OR” or “ADJ” and X and Y are query terms. We measure
frequency (>1) of each relation to exclude mismatches and
misspellings. Further, to detect single term to phrase and phrase to
phrase relations we
filter all 5-grams in the form “X b
Z” and “ X p
Z”, and all 7-grams
in the form “X p
W“, where X, Y, Z and W are query terms, p
the proximity operator “ADJ” and b
the Boolean operator “OR”. Finally, to query the
generated expansion lists, we load them into the open source thesaurus
management software TheW32****.
Lexical database PatNet
source files used for generating the lexical databases PatNet are available here
. PatNet can be used to assist patent searchers in generating
the invention diagram, in particular to
expand the searchable features of an invention with additional query terms, in a semi-automatic manner.
Using PatNet for semi-automatic Query Term Expansion
The invention diagram includes in a first column the
searchable features of the invention selected from a source document,
particularly from a patent application or an invention report, and in a second
column the corresponding expansion terms. PatNet suggests to the searchable features single terms (STR), keyword phrases (STPR), and single terms, which in
combination with the query term constitute a keyword phrase and finally
suggests synonym phrases (PPR).
At first we explored, if the query log-based query
term expansion approach is useful for automatic QTE
. In particular, we tried to find out if the
performance of our query term expansion approach depends on the training set
and class size, and if there any advantages in using the US classification
system to build US class-specific lexical databases. Further, we calculated
whether the query log based query term expansion approach outperform standard
dictionaries. The evaluation has shown that recall and coverage
measures increase with the availability of a larger set of query logs. On
average, up to 8 out of 10 ETs
are used by the examiners for query term expansion, are suggested by the
class-independent lexical databases. Expectedly, the class-specific and class-related databases
achieve better precision scores than the class-independent databases. On
average, 1 out of 20 suggested ETs
which were suggested by the class-independent databases, were used by the
examiners for QTE
. This is similar to numbers achieved in related work for
patent searching (about 5%). Further, the results of the evaluation show that
the specific lexical databases drastically outperform the general-purpose
Query Term Expansion
Further, we studied how the query log-based query
term expansion model could be optimized to carry out effective QTE. We used (1) patent US class-specific and
class-related ETs, (2) successively
suggested ETs based on their frequency
in the training set, and (3) suggested
ETs based on overlap of sense definitions.
The results of the experiments showed that the
achieved precision scores (about 20%) significantly exceed the scores achieved
in related work for patent searching (about 5%) and are comparable to numbers
reported for professional academic search (about 17%). Only a minor decrease in recall (from 70 to 63%)
has been noticed, when considering frequency of the extracted relations and
successively suggesting the highest ranked
ETs (while precision can be improved up to 22%). This expansion strategy
fits very well with the recall-oriented patent search task and with query term
expansion scenarios (as they occur in patent searching), where search sessions
extend over many queries that are gradually refined. To avoid time-consuming term selection form a
complete list of ETs or invention
diagram, we recommend to guide users through the query expansion process,
instead of limiting the number of suggested
ETs. The latter had the effect that relevant ETs (available in PatNet)
are not suggested.
Effect of query log based QTE on document retrieval
Finally, we evaluated our log based query term
expansion approach based on real patent searchers. We used the CLEF-IP 2010 benchmark data set and
measured the effect of synonymous query term expansion on retrieval
effectiveness in patent searching.
The experiments show that the retrieval performance
decreases or can only be slightly improved, when using PatNet for fully-automatic QTE. No significant improving can be
recognized. Through the expansion of the initial query terms
with synonyms and equivalents the query scope of the query topics is radically
expanded. Many additional relevant and non-relevant documents will be retrieved.
So synonym expansion has generally no positive effect on the retrieval
But the analysis of the retrieval results shows
that the query log-based QTE method
does not have generally a negative effect on the retrieval effectiveness.
Recall is drastically improved for query topics, where the baseline queries
achieve, on average, only low recall values. But we have not detected any
commonality that allows us to characterize these queries. So we recommend using synonyms for semi-automatic
query term expansion in Boolean retrieval, where expanding query terms with
synonyms and equivalents with the aim of expanding the query scope is a common