Data Sources in Patent Data Mining

In a scientific context patent retrieval was first introduced in the NIIs NTCIR 1 campaigns (2002 to 2007). Several initiatives that included patent retrieval as research topics followed, e.g. Dutch Belgian Information Retrieval workshop, International Workshop on Advances in Patent Information Retrieval (AsPIRe), Patent Information Retrieval (PaIR), TExt REtrieval Conference Chemical track (TREC-Chem, 2009-2011) and the Information Retrieval Facility Symposium and Conference (2008-2014).

The larges academic research impact, in Europe, has been made by the CLEF-IP tracks (2009-2013), which were part of the Cross-Language Evaluation Forum (CLEF). The CLEF-IP track started in 2009 with the Prior Art Candidate Search track and in 2012 paragraph retrieval was introduced in Claims to passage task.

Benchmarking data and test collections are available at:

  • CLEF-IP: Retrieval in the Intellectual Property Domain (website)
  • TREC-Chem: TREC Chemical IR Track (website)
  • NIIs NTCIR Evaluation of Information Access Technologies (website)


WPI Test Collection:

The World Patent Information (WPI) Test Collection: in order to support the objectives of the journal, to publish new research and insights covering a broad spectrum of Intellectual Property information retrieval and patent analytics related practices and methods, the WPI Journal editors, together with the team at IFI CLAIMS® Patent Services, have put together a patent research collection, publicly available and for free, to foster scientific good practice: comparability, reproducibility, transparency and repeatability of experiments and results. The WPI collection is for this reason static. It will not be updated with new data. Otherwise, changes in experimental results would be impossible to trace with confidence to the new methods being proposed and not seen as a result of changes in the dataset.

  • The test collection is available here.
  • Further information: Mihai Lupu. (2019). WPI Test Collection (Version 1) [Data set]. World Patent Information.


Linked Open EP Data:

Linked Open EP Data uses Uniform Resource Identifiers (URIs) to identify patent applications, publications and other resources present in patent data. The URIs make it possible to link the data other datasets. The data set covers the most relevant, but not all available bibliographic data elements for patents. The URIs make it possible to link the data other data sets. It also includes references to the full-text publication in PDF, HTML and XML format, which are stored on the European Publication Server.

  • The data is accessible here.


EP Full-Text Data for Text Analytics:

This is a new data set that EPO made available especially for data analysis. This new data set contains basic bibliographic information and includes full-text for each European patent publication. The data set can be used under the Creative Commons Attribution 4.0 International Public License. The EP Full-Text Data for Text Analytics can be accessed via the Google Cloud platform. For mor information see the data set description page here.


Google Patent Services:

In October 2017, Google launched several patent related data collection and services. Google provide a Google Patents Public Datasets on BigQuery, with a collection of publicly accessible, connected database tables for empirical analysis of the international patent system. The google patent data set can provide solution to developing and answering search oriented questions using SQL. For instance, it is possible to formulate questions such as "what percentage of the patents have more than one inventor?" or "what funding does the government provide to promote innovation in certain patent areas?"

  • The test collection is available here.