Input | Output | Details | Training data | Test data | Tools | Contact
Last year, the Image2Structure task of TREC-CHEM invited participants to provide chemical structure files given a raster image of a chemical compound. The results were very positive, and this year we go one step further and make it more useful for the practitioners.
The 2012 CLEF-IP chemical image task gives you a full patent document as a PDF or TIF and asks you to 1. extract the locations of the chemical images in this document and 2. transform the so-recognized images into structure files.
Tiff images containing full patent documents.
In order to make sure that those participants who may not have the resources to extract chemical images from the full-page images also participate, we will also make available a set of already extracted chemical images, similarly to the way the Image-to-Structure task had done in 2011.
US-20010014694-A1,1,1505,1085,295,52 US-20010014694-A1,4,1505,1151,295,52 US-20010014694-A1,5,290,2486,328,257 ...
For the set of already extracted images we will have provided as an input (technically, this should be a subset of what participants will have extracted in the first phase), the corresponding MOL files.
US20010014694A1_p0004_x1505_y1151_c00001.tif:
US20010014694A1_p0005_x0290_y2486_c00003.tif:
Here you have a set of 30 patents and manually extracted image clips.
Here you have a set of patent files for which you need to extract the bounding boxes of all chemicals.
Here you have a set of tif files for which you need to provide MOL structure files, similarly to last year's image-to-structure recognition task at TREC-CHEM.
The segmentation results evaluation program is also available here. It includes a single .java file, a README.txt that explains everything (I hope) and a tiny sample ground truth and evaluation set file just to check that it has compiled and is working properly. Thanks to Alan Sexton for writing it.
The structure recognition task uses a perl scrip and the Chemistry::OpenBabel to identify the structure your method returns and compare it against the existing MOL file. However, this method works only for those structure for which InChi can be generated. For the training data, this means 94 out of the 133 pairs we provided. For the rest, the verification has to be done manually. Here is the script.
For questions and anything else regarding this task, please contact the organizers at (clef-ip-chem at ifs.tuwien.ac.at)