Access for Tax Funded Research Bill in US Senate

June 29, 2009

(From: GenomeWeb News, 29 June 2009)


The US Senate will consider a bill that would require that researchers funded by a number of government agencies submit electronic versions of their papers within six months after they have been published in peer-reviewed journals.

The legislation, which was introduced yesterday, is similar to the open access policy already in place at the National Institutes of Health, and would apply to federal agencies with annual extramural research budgets of $100 million.

The new bill, the Federal Research Public Access Act, introduced by Senator Joseph Lieberman (I – Conn.) and co-sponsored by Sen. John Cornyn (R – Texas), will require that researchers funded totally or partially by the agency submit the electronic copy of their manuscripts, which will be preserved in a digital repository and available for public access. Each agency affected by the law would either maintain its own repository or select a suitable one that permits free public access, interoperability, and long-term preservation.

The policy would not apply to materials such as lab notes, preliminary data analyses, author notes, phone logs, or other information used in the manuscript.

It would cover the research funded by the Department of Energy; Department of Health and Human Services; National Science Foundation; Department of Agriculture; Department of Commerce; Department of Defense; Department of Education; Department of Homeland Security; Department of Transportation; Environmental Protection Agency; and the National Aeronautics and Space Administration.


Progress report

June 26, 2009

Recap of CWA goals and ambitions:

  • CWA is to focus initially on the Life Sciences.
  • CWA will largely build on existing institutions as a not-for-profit organization with tax-exempt status in all regions. It will be organized in such a way as to allow for funding to be routed through the CWA to its partners and other contract research collaborators.
  • CWA aims at forming an ever-growing Scientific Alliance starting with the faculty already committed after the inaugural meeting.
  • A specific task group has been formed to discuss strategies to grow the Scientific Alliance and propose strategies for maintaining and growing scientific credibility.
  • CWA will mainly serve as a coordinating and administrative body to further the stated goals.
  • CWA will have a Steering Committee elected by participating Founding Members and an Executive Committee with a mandate from the steering committee to run daily operations.
  • CWA aims at having minimally one physical location in each continent, preferably, and where possible, associated with a leading university or organization that is also a Founding Member.
  • Founding and Associate member organizations (the latter are organizations joining CWA after its initial establishment) will be allowed to grant eligible faculty members a formal CWA affiliation in addition to their university or institution affiliations.
  • The Steering Committee may, at its discretion, also grant formal affiliation to individual scientists whose home institution is not a CWA member.
  • A dedicated set of working groups (and sub tasks groups) has been formed.

Road map and operational process

During its inaugural meeting CWA core group members decided to delegate preparation and execution of next steps to the Executive Committee under supervision of the Steering Committee. An inventory has been produced of all activities in need of attention and discussion. Based upon that inventory approximately 20 task groups have been formed, grouped in 3 major categories:

Category 1: organizational and operational
1.1    Governance
1.2    Policies
1.3    Organizational structure
1.4    Legal and licensing
1.5    Capacity building
1.6    Commercialization / valorization
1.7    Sustainability / public fundraising
Category 2: Scientific and Technical
2.1    Content capture
2.2    Tool development
2.3    Storage and maintenance
2.4    Unifies persistent ID
2.5    Quality control
2.6    Triple model / format
2.7    Attribution (micro- and nano-credits)
2.8    Content acquisition
2.9    Triple Browser/reasoning
2.10  Multilingual issues
Category 3: Strategy/Advocacy
3.1    Scientific credibility
3.2    WikiProfessional
3.3    Advocacy
3.4    Conference 2010

For each category and task group, time lines, deliverables and milestones have been proposed and the Executive Committee has been put in charge of managing the process with a final delivery date of December 31, 2009. Chairs of these groups (selected from initial signatories who have expressed an interest in actively participating) have been invited and they will be responsible for both timing and deliverables of their task group. The process will be that a number of task group (virtual) meetings are held, resulting in a proposal per task group, each of which will be “peer reviewed” by experts before it is delivered to the Executive Committee. The Executive Committee will summarize all proposals and present them to the Steering Committee for guidance, approval and execution. It is expected that the first of these task group meetings will be held imminently.

The working group commercialization and valorization is tasked with delivering a plan (due September 2009) on how to build up the ‘trusted party’ role of CWA towards the public as well as the private sector, while building the structure and partnerships to serve commercial users with realizable services based on the ‘triple store’ to contribute to the long term sustainability of CWA.

“Ensuring Persistence” (conference)

June 18, 2009

The 2009 International DOI Foundation open meeting will be held in San Francisco on 7th October, following the successful series of annual meetings in previous years. Participants are welcomed from a wide range of communities and interests.

For further information click here.

Vocabulary Mapping Framework

June 18, 2009

The DOI foundation just alerted us to this:

A new initiative, the Vocabulary Mapping Framework (VMF), has been announced by a consortium of partners. This will create an extensive and authoritative mapping of vocabularies from nine major content metadata standards, creating a downloadable tool to support interoperability across communities. The mapping will also be extensible to other standards. The work builds on the principles of interoperability established in the indecs Content Model, and is an expansion of the existing RDA/ONIX Framework into a comprehensive vocabulary of resource relators and categories, which will be a superset of those used in major standards from the publisher/producer, education and bibliographic/heritage communities.

For further information see: VMF project announcement, June 15, 2009 (pdf download) and DOI News.

There is something fascinating about science*

June 10, 2009
*Draft of an article on concepts and triples, written for relative non-specialists, to appear soon in the journal Serials, which has a broad audience of librarians and publishers.

“There is something fascinating about science. One gets such wholesale returns of conjecture out of such a trifling investment of fact.” (Mark Twain, Life on the Mississippi).

Mark Twain was very witty. But he lived in a different age, and his witticism would certainly not apply to modern science. The problem of today is precisely the opposite: far too few returns in terms of usable knowledge out of such overwhelming investment of fact. That being the case, we are running up against two problems. The first is that a lot of information is deeply hidden, and the second that information is increasingly and overwhelmingly abundant, so that ‘connecting the dots’ is ever more difficult, because there are so many dots to connect. This is rather a paradoxical situation. Wouldn’t we actually increase the information over-abundance were we to find all the relevant information that exists? Of course we would. However, limiting information, or stopping to gather facts, cannot be a solution. We must approach it from the other side, by finding ways to deal with the over-abundance of information. We must think of the situation not as information overload, but perhaps as something like ‘organizational underload’ of information. As having a lack of sufficient conceptual structure. We have to do this, because the amount of relevant information is not going to diminish. Instead, its abundance and availability is bound to increase, quite possibly exponentially. Not only in terms of the peer-reviewed literature, but also in terms of raw data, and of informal literature, such as science blogs, wikis, and the like. A kind of Moore’s law (after Gordon Moore, the co-founder of Intel)  seems to be going on, with the amount of scientific information doubling every so often (Katy Börner quotes ‘Scholtz, J. 2000. DARPA/ITO Information Management Program Background’ in a 2006 chapter – I haven’t been able to locate Scholtz’ article itself – as reporting a doubling of the volume of data every 18 months in some areas, and that was in 2000).

When Homo sapiens was at the beginning of his evolutionary development, he didn’t really have any other use for water than to drink it, and perhaps swim in it. Large bodies of water, such as lakes or seas formed effectively the end of his range. He had to wait until a bright spark developed rafts and boats, and then he started using water also to navigate and go to places hitherto unreachable. The rest is history. Empires were built that way and the world was conquered. My sense is that we are at the beginning of a similar point in our evolutionary development with regard to information. Now, we mainly take it in by reading and consulting databases (the equivalent of drinking in the water analogy), and we haven’t got the means yet to ‘navigate’ the existing information effectively. We are making good progress with searching. But searching, although often called ‘navigating’, isn’t the same and has its own drawbacks. After all, if you want to search, you must already know what for, and you have to formulate your search argument. Finding what you didn’t even know you should be searching for – a form of serendipity if you wish – is hardly given a chance, even though serendipitous findings are often the stuff of scientific breakthroughs. True navigation is different. It is about using information to ‘carry’ you from one place to the next. It is about discovering that connecting information that seems far apart, semantically speaking, can lead to insights otherwise extremely difficult to attain.

In order to be able to navigate ‘oceans of information’ we must first release information from its silos and make it compatible. One of the ways in which that can be done is by breaking information down into its smallest elements: semantic triples. Briefly speaking, triples consist of three elements (as the name implies): a source concept, a target concept, and a relationship between the two. The general format is this: <concept1><relationship><concept2>, for instance, <this article><is published in><the journal Serials>. This example also makes clear that the relationship can be directional (though isn’t always). The triple <this article><is published in><the journal Serials> is not the same as <the journal Serials><is published in><this article>, should this triple be a valid statement in the first place (in the event, it is a nonsensical one). A non-directional triple is, for instance, <triple><co-occurs in the same sentence with><statement>, which would be valid in either direction. The relationship between the concepts in the latter example is much more vague of course than the relationship in the first example. Though I think I made triples look simple, it isn’t quite as simple as this. In order to make vague triples, any triples, more meaningful, you need to qualify them. Triples have attributes, such as a provenance, date and time, conditions (e.g. is true in certain circumstances only), identifiers for the concepts, and potentially many others. We can, for instance, indicate where the sentence occurs from which the triple was mined, who wrote it, when, et cetera. If the same conceptual triple occurs in the literature often enough, and is written down by enough different people, then there may well be added significance in the numbers. Another reason to qualify triples and to add provenance labels, date labels, digital identifiers, and the like to them, is the desirability of crediting, or at least acknowledging, the author (and journal, publisher, database, et cetera), as the whole social fabric of scientific knowledge is dependent on acknowledgement.

I mentioned ‘semantic triples’ above, but the examples I gave do not yet demonstrate what semantic means in this context. The triple <this article><is published in><the journal Serials> could also be written as <this paper><appears in><Serials>. The meaning would be exactly the same. As word triples, they are different, because they use different words, but as semantic triples they are identical, because they mean the same. In order to get semantic triples, disambiguation of the words needs to occur. Synonyms and homonyms need to be resolved. Synonyms are relatively easy to deal with. The difficulty lies in disambiguating homonyms. They require careful analysis of the context. ‘Paper’ could mean ‘article, but it could also mean ‘newspaper’ or the material on which either are printed. ‘Serials’ could mean the journal, or periodicals in general, and quite possibly other things as well. Disambiguating homonyms is sometimes nigh impossible with just automated contextual analysis. Imagine an article about turkey farming in Turkey. Or a piece describing the little statuette of a jaguar on the bonnet of a Jaguar. In cases like this, human intervention is needed. If one wants to be able to get a reasonable overview of a field of knowledge via the use of triples, one needs a reasonable degree of disambiguation and conversion into semantic triples in order to draw any conclusions at all, though there is always likely to remain some fuzziness in the results. Some fuzziness doesn’t perhaps matter too much as long as one is aware of the fact that reasoning with triples is just a tool, albeit an important tool, to discover new knowledge, to identify what the most promising areas of research are, with the most likely chance of finding significant insights, and that it is not the new knowledge itself.

What does this all mean for publishing? There are currently at least two areas in which publishers can take advantage of the semantic triple methodology. First there is the actual publishing of triples. The material that is being published, in journals, books, and databases, can be turned into triples, as long as it is available in electronic form, of course. Such triple collections can be valuable complementary formats to enable the research community to potentially get more out of the published knowledge. The currently prevailing formats, print, pdf, and even html, still require the material to be read. Of course, pdf and html make distribution easier and cheaper than print, and html also increases findability enormously, especially in combination with metadata, but the actual usage of published information remains, on the whole, limited to traditional methods of taking the knowledge in at source. Meta-analyses are being carried out in some areas, but are usually text-based, within narrow fields of research, and their usefulness suffers from the major drawback that is presented by widespread ambiguity in most texts. Were the literature available in the form of triples, however, in particular semantically disambiguated triples, it could be put to much wider use. Not only for meta-analyses, but also for the purpose of pointing to – and linking to – topics of research that are more promising than others, or topics that might easily be missed if the literature is only taken in by reading, and for generating new hypotheses. Properly constructed triples that adhere to common models, such as rich RDF (Resource Description Framework) and best practices, and that are disambiguated, delivered and packaged in a convenient way, are potentially worth significant amounts.

Secondly, the ability and high precision of matching concepts and triples is also very useful in an early manifestation of this semantic triple methodology, helping readers of scientific articles to find new and less obvious information related to what they are studying. The technology already exists to semantically index pages on-the-fly, and then recognise the concepts in the text. Not just keywords, but concepts, so that e.g. the word ‘cancer’ is recognised as the concept ‘malignant neoplasm’, which is the preferred term in the Unified Medical Language System (UMLS). These pages can subsequently be made to highlight, in one form or other, the concepts recognized, and the concepts, when clicked upon, be made to show a whole host of other information relevant to it. The obvious example is showing any synonyms (as with cancer and malignant neoplasm – but really greatly helpful in the case of, for instance, proteins and genes, which often have many synonyms). And then definitions could be given, equivalents in other languages, links to literature (books, journal articles), links to experts, or just other researchers on the topic, links to highly specific laboratory materials (such as antibodies or cell lines or fungus cultures), deep into the suppliers’ catalogues and thus removing a lot of bother for the researcher.

The semantic approach also makes it possible to link to related other concepts (including the nature of the relationship) and, interestingly for publishers, places where else in their portfolios the concept in question or related concepts occur. Apart from convenience for the user, this also increases the likelihood that the user stays within the publisher’s site longer, which must be good, from a publisher’s point of view. All these instances mentioned re-introduce a measure of serendipity again in the scientific literature, a way of stumbling upon connections that are not, and would never be, obvious. And yet, as already mentioned above, serendipitous findings are the stuff of unexpected insights, even breakthroughs, in science. Serendipity and the chance of finding what you didn’t even realise you were looking for is lost, to an extent, with the advent of search functionalities, which require that you at least formulate the search argument. This nudges you into the direction of seeking out more of what you already know. A form of homophily, if you wish – the ‘birds of a feather’ syndrome – which is impeding out-of-the-box thinking, or at least making it more difficult. Semantic concept technology may not quite be the search engine that presents you with exactly the opposite of what you are looking for – though that might perhaps be extremely interesting and beneficial for research –, it certainly has the potential to take you into areas where you would not normally go for discovering knowledge.

There are other reasons why a publisher might deploy this technology. The embedding of highlights and links, which can be made invisible until the reader moves the cursor over them, makes it possible to have a wealth of information at hand without disturbing the user’s reading experience and without the need to leave the page and re-type the concept or keyword in the search box of some other site. More knowledge is ‘served up’, as it were, in the page the user is already reading, increasing the information density of a scientific text without it becoming unreadable. Effectively transforming traditional texts into efficient gateways to other relevant information.

The examples mentioned are just the beginning. The over-abundance of information will increasingly force us to find ways around having to read all relevant scientific articles. Paradoxically, the role that publishers can – and ought to – play is one of helping researchers who go out of their way to avoid reading scientific articles, to do just that: getting the essence of scientific information without having to read too many full articles.

Jan Velterop