We've got lots of problems at Wikipedia Chemistry: you all know that! Let me just tell you about one of them…
We have roughly 25,000 structure diagrams of molecules at Wikimedia Commons – that's my guesstimate, we don't even have exact statistics on the number – and the only way we can find out what we have is to index them by hand. The vast majority of these structure diagrams were created using specialized software, the kind that KNOWS that a carbon atom doesn't have to be labelled with a "C" and that it usually has a valency of four. But then they were exported into an image file where all chemical information was lost.
Now 25,000 structure images is only a tiny fraction of the chemical structure diagrams that are out there on the web, but you can be certain that all of them have been shorn of their chemical relevance for any but trained human eyes. To a search engine, or a spider, they are simply .jpg or .gif or .png…
What we would like is for such images to have embedded metadata to identify the compound to non-human readers. Let's start by suggesting both flavours of InChI and both their hashed keys. We would like this metadata to be added automagically by the software that creates the image in the first place: after all, it is only the very same data that allowed the image to be created!
Sounds simple? Of course it isn't! Wikipedia chemists have had various contacts with chemical software companies over the last two years, but it seems that we're still no nearer the goal. One problem is that not all file formats can easily support metadata: we will, in effect, have to create not just one new industry standard but three or four! Another problem is how to treat reaction schemes, or molecular fragments etc – let's leave that to one side for the minute, and concentrate on how this could work for discrete molecules.
At Wikipedia, we have amassed a certain amount of experience in the online publishing of chemical data, especially images. This experience is freely available to anyone who wishes to tackle the problem of chemical metadata in structure diagrams, so long as the solution is one that is open to the rest of the chemical community. Any takers?
OCR in Java (2); Zarkonnen Longan is the best yet
13 hours ago