Wednesday, 20 May 2009

Attribution and moral rights

I was a little snide about the Science Commons protocol yesterday, so double thanks to John Wilbanks for getting back to me so promptly about my queries. There is one point in which the protocol is definitely novel compared with other open-access declarations: unfortunately, it is also the project's Achilles Heel.

The “Protocol for Implementing Open Access Data” states in its point 4.1:
to facilitate data integration and open access data sharing, any implementation of this protocol […] MUST NOT apply any obligations on the user of the data or database such as “copyleft” or “share alike”, or even the legal requirement to provide attribution. Any implementation SHOULD define a non-legally binding set of citation norms in clear, lay-readable language.

I'll come back to the question of how much of a database can be copyrighted in a second – for the moment, we simply need to assume that some aspect of the database can be copyrighted. If that were not the case, there would be no need for protocols such as that prepared by Science Commons, or my "Protocol X" to go with the Panton Principles!

Now the justification for banning contractual requirements of attribution is that it raises the transaction costs for reusers of the data. Very true, users might have to record where they got their data from, which takes time and effort. So Science Commons recommends "that authors simply waive attribution, which does create legal certainty and provides freedom to operate to the data user."

Herein lies the problem. I am a Spanish resident, so all my works are covered by Spanish copyright law. Under Spanish law, I cannot "simply waive attribution": if I pretend to do so in a clause in a contract, that waiver will be null and void. The right to be identified as the author of a work is one of the "moral rights" of copyright law in civil law countries, and it is deemed to be a part of the author and so untransferable except mortis causa. Even after my death, my heirs will be able to insist on attribution of my works for seventy years. That's just Spanish law: if I lived in France (which I did for several years), my moral rights would be eternal (in theory, at least!).

I cannot honestly waive my legal rights to attribution, and the same goes for many, many scientists. If I were to pretend to do so, it would create exactly the sort of legal uncertainty that data-sharing protocols are meant to avoid. My rights to attribution are not enforceable in the United States unless I insist on them in a copyright license, but insisting on an attribution license is creating legal certainty worldwide: I am only licensing what I am able to license, I am not pretending to license something that my local law will not allow me to, nor binding my heirs when that is self-evidently impossible.

So which parts of a database can I claim attribution to? Moral rights are a part of copyright law (article 6-bis of the Berne Convention for the legal geeks), so we need to ask how much of a database can be considered to be covered by copyright (European database rights don't enter into this problem, thank your-favourite-Deity). The current international statement is contained in article 10.2 of the TRIPS agreement:
Compilations of data or other material, whether in machine readable or other form, which by reason of the selection or arrangement of their contents constitute intellectual creations shall be protected as such. Such protection, which shall not extend to the data or material itself, shall be without prejudice to any copyright subsisting in the data or material itself.
To put it in layman's terms, the author of a database only has copyright over those parts that he or she has actually thought about and considered, not the bits which relate to simple data entry. Unfortunately, this "threshold of originality", as it is technically known in copyright law, varies from jurisdiction to jurisdiction! To simplify things outrageously, the threshold is lowest in the UK and highest in Germany, with the U.S. being somewhere in between. Obviously, nobody expects professional scientists to be expert international copyright lawyers, so it is best to set the protocol at the lowest threshold of originality, where pretty much everything except the facts themselves is subject to copyright.

What is the response of the Science Commons protocol to this question? It can be found in section 5.2, where the ban on "share-alike" clauses is justified:
a user would be able to extract the entire contents (to the extent those contents are uncopyrightable factual content) and republish those contents without observing the copyleft or share-alike terms.
I don't agree with the strength of the statement given in the Science Commons protocol, but I will accept that, in practice, that would be the case. If it were not the case, it would violate the pivotal concept that you cannot copyright the facts of nature: a few people would like to do that, but I still have confidence that judges would uphold a principle that has been generally accepted worldwide since at least 1883.

On the other hand, the Science Commons protocol falls into the trap of internal inconsistency. Share-alike clauses are banned because they are effectively unenforceable for databases, but attribution clauses are banned because it would cost far too much to comply with them, and because reusers might face actions for copyright infringement.

Everyone concerned with open-access publication has the greatest respect for the efforts that Creative Commons and its daughter projects have put in to providing legal security for our aspirations. This doesn't mean that mistakes are never made, as John Wilbanks himself points out. I don't wish to knock the efforts that have obviously gone into preparing the Science Commons protocol, merely to try to make it workable outside of the Fifty States.

Tuesday, 19 May 2009

Protocol X

I might have misunderstood the reference to "Protocol X" in the Panton Principles. If people want a real protocol, here is my contribution:

  • All data MUST be obtained in such a way that it it can be communicated to other people if necessary, even if, by its nature or by local legal restrictions, it cannot be published.
  • All data, including "negative" data, MUST be recorded in a permanent form as soon as is reasonably possible. The record SHOULD, as a minimum, include the date on which the data was collected and (if different) the date on which it was recorded.
  • All data MUST be made available to external assessors, so long as any necessary conditions of confidentiality are met, and subject to local legal restrictions.
  • Publication of a scientific paper implies making all the data needed to prepare the paper available to external assessors, under conditions of confidentiality where necessary. It also implies making all non-confidential data available to the general public for any use.
  • It is implicit that scientific practice forbids the publication of partial data (and the concurrent interpretation) when the authors have data which would contradict that interpretation.

Panton Principles

Last week, Peter Murray-Rust of the University of Cambridge and Cameron Neylon of the University of Southampton met with some colleagues in a pub (the Panton Arms pictured, just round the corner from the Chemistry Department in Cambridge). I don't know if they sampled the IPA or the Abbot, but it must have been good as they came up concise plan for open access science which has been baptised the "Panton Principles". The currently accepted statement of the Panton Principles is as follows:

  1. A simple statement is required along the forms of “best practice in data publishing is to apply protocol X”. Not a broad selection of licenses with different effects, not a complex statement about what the options are, but “best practice is X”.
  2. The purpose of publishing public scientific data and collections of data, whether in the form of a paper, a patent, data publication, or deposition to a database, is to enable re-use and re-purposing of that data. Non-commercial terms prevent this in an unpredictable and unhelpful way. Share-alike and copyleft provisions have the potential to do the same under some circumstances.
  3. The scientific research community is governed by strong community norms, particularly with respect to attribution. If we could successfully expand these to include share-alike approaches as a community expectation that would obviate many concerns that people attempt to address via licensing.
  4. Explicit statements of the status of data are required and we need effective technical and legal infrastructure to make this easy for researchers.
It might not be clear from a first reading – I am making this analysis based on the blog comments of the people involved – but there are two important points here that, up until now, have been stumbling blocks in the discussion of open scientific data:
  • The separation of the decision to publish from the question of open access to published data. Not all data can be published, for example data which identifies a specific person in clinical research. The scientific process knows how to deal with this, usually by making such data available to a couple of trusted outsiders (referees), on request and on the basis of confidentiality, and letting the referees vouch for its veracity or verisimilitude.
  • The idea that "best practices" might be different in different domains. This is related to the point above, but also allows a healthy diversity in approaches adapted to different circumstances. Does a chemist really have to run (and publish) an NMR spectrum of every brown-tar reaction product, or will a photo suffice?!
A third, more technical point is that the Panton Principles eschew "non-commercial" and "share-alike" restrictions on licences. I agree with the authors' arguments on this one, but I fear that we've not heard the last of that argument.

So, where now? PMR (as ever!) has launched a challenge: can we (scientists committed to open access science) condense this into a single paragraph that anyone can understand? Actually, PMR gives the example of the Budapest Open Access Initiative, which is neither a single paragraph nor really quite as comprehensible to mere mortals as all that… Mind you, this blog is always up for a challenge, so here goes:

This data has been obtained and made public for the benefit of Society as a whole. Anyone may use it for any purpose so long as the source is acknowledged.
These two sentences could be preceded by a reference to a Code of Practice from a learned society or funding body (e.g., the BBSRC data sharing policy), or could be completed with a reference to a specific licence, e.g. CC0 or the PDDL. And all of this needs to be fitted in with the parallel process at Science Commons, however much that process appears to be reinventing the wheel…

Wednesday, 13 May 2009

Wikipedia and CAS

*** OFFICIAL ***
The community of chemists who contribute to Wikipedia is happy to announce a novel collaboration with Chemical Abstracts Service, Inc. (CAS), a division of the American Chemical Society. Wikipedia is one of the top-ten sources of online information; CAS is an acknowledged world-leader in the provision of chemical information to professionals.

CAS has provided Wikipedia with access to some of its most widely used data – its CAS Registry Numbers®, which are recognized throughout the world as the most commonly used identifiers of chemical substances. The collaboration between CAS and Wikipedia provides a free and, more importantly, verified dataset of CAS Registry Numbers® for common substances for all users.

CAS has published 7800 CAS Registry Numbers®, along with a wide selection of synonyms for chemical names, on a new site, www.commonchemistry.org – Wikipedia will continue in its role as a source of information for a wide audience (including professionals). The links between the two complementary systems will help to ensure a high quality of data for users of both sites, and both sides hope that their number will increase over the coming months.

The work leading up to this formal announcement has been going on for more than a year now. During that time, Wikipedia chemists have been able to audit the accuracy of the chemical data we present: our error rate before correction was comparable with printed compendiums of chemical data. Our aim is not to be authoritative, but to present a snapshot of knowledge in an accessible manner. We are sure that this new collaboration will help us make that snapshot even more accurate.

Wednesday, 6 May 2009

The WikiChem challenge

We've got lots of problems at Wikipedia Chemistry: you all know that! Let me just tell you about one of them…

We have roughly 25,000 structure diagrams of molecules at Wikimedia Commons – that's my guesstimate, we don't even have exact statistics on the number – and the only way we can find out what we have is to index them by hand. The vast majority of these structure diagrams were created using specialized software, the kind that KNOWS that a carbon atom doesn't have to be labelled with a "C" and that it usually has a valency of four. But then they were exported into an image file where all chemical information was lost.

Now 25,000 structure images is only a tiny fraction of the chemical structure diagrams that are out there on the web, but you can be certain that all of them have been shorn of their chemical relevance for any but trained human eyes. To a search engine, or a spider, they are simply .jpg or .gif or .png…

What we would like is for such images to have embedded metadata to identify the compound to non-human readers. Let's start by suggesting both flavours of InChI and both their hashed keys. We would like this metadata to be added automagically by the software that creates the image in the first place: after all, it is only the very same data that allowed the image to be created!

Sounds simple? Of course it isn't! Wikipedia chemists have had various contacts with chemical software companies over the last two years, but it seems that we're still no nearer the goal. One problem is that not all file formats can easily support metadata: we will, in effect, have to create not just one new industry standard but three or four! Another problem is how to treat reaction schemes, or molecular fragments etc – let's leave that to one side for the minute, and concentrate on how this could work for discrete molecules.

At Wikipedia, we have amassed a certain amount of experience in the online publishing of chemical data, especially images. This experience is freely available to anyone who wishes to tackle the problem of chemical metadata in structure diagrams, so long as the solution is one that is open to the rest of the chemical community. Any takers?

Sunday, 19 April 2009

A little light relief

The video is nearly three years old now, but it's a wonderful parady: thanks to the hard working team at Nature Chemistry/The Sceptical Chymist for sacrificing part of their Friday afternoon to rediscover it… and of course to Scott Johnsgard Jr. for creating the video!

Saturday, 18 April 2009

New Tool from ChemSpider

Wikipedia has been working with ChemSpider for some time now, and they have just launched a new feature on their site to help Wikipedia editors:



We have the Wikipedia article lead in thousands of records on ChemSpider now. They are updated regularly as Wikipedia itself expands. One of the areas we have been focused on since the inception of the work was getting correct structures in place with the associated data. This includes the molecular formula, molecular weight, SMILES, InChI String, InChIKey, systematic name and so on. In order to help the process of expanding Wikipedia with new records and to provide a lot of these data automatically we have set about providing a Wikipedia Service so that Wikipedians can use ChemSpider as the source of the chemical structures of interest and generate the DrugBox and ChemBox content from ChemSpider. It’s a rather simple process…ChemSpider Blog, Mar 2009



The way it works is that you go to a ChemSpider page (like this one on musk xylene), then click on the "Wikibox" link in the top right-hand corner. A new window opens with a facility for downloading a molecular structure image and some code which can be copy-pasted into a Wikipedia article to start the infobox on that compound.


The information which ChemSpider fils in automagically is fairly limited for the moment. This is because ChemSpider has similar data curation problems to Wikipedia – they need to be sure that they data is correct! In fact, the issue is one of the big points of our collaboration, but I won't shout about it too much in public until we have some concrete results rather than just good ideas.


As often with our external collaborations, this new feature at ChemSpider raises as many issues for Wikipedia as it resolves. Should we just have it available in English, or should we translate the feature into other languages? How should the WikiProjects at other languages get involved with ChemSpider (if they're even interested)? What information should Wikipedia be providing in tabular format, and which should be explained in prose?


But for the time being, the new ChemSpider tool is certainly helpful, and a nice visible reminder of the important work that both sites do for the chemical information community as a whole.