Wednesday 6 May 2009

The WikiChem challenge

We've got lots of problems at Wikipedia Chemistry: you all know that! Let me just tell you about one of them…

We have roughly 25,000 structure diagrams of molecules at Wikimedia Commons – that's my guesstimate, we don't even have exact statistics on the number – and the only way we can find out what we have is to index them by hand. The vast majority of these structure diagrams were created using specialized software, the kind that KNOWS that a carbon atom doesn't have to be labelled with a "C" and that it usually has a valency of four. But then they were exported into an image file where all chemical information was lost.

Now 25,000 structure images is only a tiny fraction of the chemical structure diagrams that are out there on the web, but you can be certain that all of them have been shorn of their chemical relevance for any but trained human eyes. To a search engine, or a spider, they are simply .jpg or .gif or .png…

What we would like is for such images to have embedded metadata to identify the compound to non-human readers. Let's start by suggesting both flavours of InChI and both their hashed keys. We would like this metadata to be added automagically by the software that creates the image in the first place: after all, it is only the very same data that allowed the image to be created!

Sounds simple? Of course it isn't! Wikipedia chemists have had various contacts with chemical software companies over the last two years, but it seems that we're still no nearer the goal. One problem is that not all file formats can easily support metadata: we will, in effect, have to create not just one new industry standard but three or four! Another problem is how to treat reaction schemes, or molecular fragments etc – let's leave that to one side for the minute, and concentrate on how this could work for discrete molecules.

At Wikipedia, we have amassed a certain amount of experience in the online publishing of chemical data, especially images. This experience is freely available to anyone who wishes to tackle the problem of chemical metadata in structure diagrams, so long as the solution is one that is open to the rest of the chemical community. Any takers?

3 comments:

  1. While at ACD/Labs we did the work to embed InChIs into structure images for Wikipedia...over 2.5 years ago. Is anyone using them on Wikipedia? if not..is there a reason? Is it an issue of the SVG over other formats that we seem to hit regularly?

    On ChemSPider we can easily pass you a set of PNG images for a set of structures, certainly all of those that have WIkipedia links associated...about 5400 compounds. If you want us to embed InChIs and other meta data we can. But will people use them? Based on the discussions to date there are other challenges in the way...primarily the format has to match the ACS display format and the images have to be SVG. I think these are actually bigger issues than the technology of adding meta data and here you are looking for a collective agreement. How to get it done???

    ReplyDelete
  2. A long time ago, people also showed how to embed chemistry files, e.g. MDL molfiles and CML, in PNG images.

    Additionally, tools like Strigi-Chemistry can handle and extract this information again... even if this image would be part of a PDF.

    ReplyDelete
  3. Estimed colleagues, I don't doubt you in the slightest. I know that Martin Walker has discussions with ACD/Labs at least two years ago on this very subject, and I can believe that ACD/Labs did the necessary (although I never heard of the result). Similarly, I can quite believe Egon Willighagen when he says that the concept has been proved—the contrary would be ridiculous, it's not that hard to do.

    The problem comes in the practical application. The current PNG standard doesn't provide for metadata, so any application which is PNG compliant can strip the add-ons without mercy… and they do. Without a public standard, how are the public supposed to know what to look for? Can either of you give me an example of a PNG image created under your protocols and which is accessible to Google through its metadata?

    As for SVG, the problem is simpler. BKChem already adds CML metadata to its images. The question then arises, which data should be added? There is a need for a public standard on these matters, just as ACS parameters are a de facto standard for structure drawing at normal resolutions. THAT is the WikiChem challenge.

    ReplyDelete