This story was originally written in Portuguese, and published to the website of Revista Questão de Ciência. It appears here with permission.
“I have only seen this far because I have stood on the shoulders of giants.” This phrase, generally attributed to the English polymath Isaac Newton (1642-1726) – himself one of the greatest giants in the history of science – illustrates well what we might call a “knowledge production chain”, in which previous discoveries pave the way for new advances.
In modern science, this usually happens through the publication of peer-reviewed articles in indexed scientific journals. These are called “references”, which are essential when the author of a study makes based on something they did not do, observe or verify through experiments, for example.
Until recently, these references were basically “ink on paper” – newspapers, magazines, books or other editorial products, generally stored and accessed in libraries. The advent of digital technologies and the Internet, however, has radically changed this. Not only have new publications migrated to digital, but old references have also begun to be digitised and made available on the Internet.
This process, however, is far from perfect, as shown in a recent study published (digitally) in the Journal of Librarianship and Scholarly Communication. In it, Martin Paul Eve, Professor of Literature, Technology and Publishing at Birkbeck College, University of London, analysed the preservation and digital accessibility of a sample of around 7.5 million articles referenced by the Crossref indexing service through its DOI (digital object identifier) system.
A DOI is a unique internet address for documents or other online files, which allows their final destination to be changed. Thus, for example, if a publisher of scientific journals goes bankrupt or is bought by another and its internet address disappears, the articles published by it will continue to be accessible in other repositories using the same DOI address generated at the time of original publication.
Eve’s survey revealed that just over 2 million of the articles in her sample – 27.64%, or more than one in four – did not have digital copies in the main repositories of these types of work. In other words, more than a quarter of the scientific production that uses the Crossref system is at risk of disappearing or having its access hindered in the event of problems with the original digital storage, that is, becoming “ghost DOIs”.
The challenges of digital preservation
Digital preservation involves several activities, ranging from the production of digital documents themselves to maintaining their availability for the necessary time. However, in the case of scientific literature, this time is ideally indefinite, given the need to preserve the chain of knowledge so that claims can be checked and verified. This is also why there is a need for extra copies of the works, with storage in “dark archives”, such as CLOCKSS (Controlled Lots of Copies Keeps Stuff Safe), LOCKSS (Lots of Copies Keeps Stuff Safe) and Portico, among others. These initiatives store hidden copies of academic materials that can be retrieved and become new destinations for DOI addresses.
One problem is that there is no consensus on who or which institutions should be responsible for preserving scientific literature in the digital age. Eve cites some studies that assume that this continues to be the responsibility of academic libraries, just as it was when libraries physically stored works. In fact, he points out, the LOCKSS system operates on a network whose nodes are academic libraries.
On the other hand, it is in the interest – and therefore the responsibility – of academic publishers to ensure that “their” content is preserved, as well as the legacy of copyright transfer on which the subscription-based access model, still predominant in scientific communication, depends. So much so that Eve points out that, according to the terms of the DOI system user agreement, the members (the publishers, among other content producers) undertake to:
make their best efforts to contract with an external archive or other content repository (an ‘Archive’) … so that this Archive preserves the member’s content and, in the event that the member ceases to store its content, to ensure that this content continues to be available via the permanent link.
Literature at risk
In light of this, Eve’s research focuses on the digital preservation policies and actions of the members of the Crossref DOI system. To this end, he created a scale centered on redundancy, in which members who had at least 75% of their content digitally preserved in three or more of the main dark archives achieved a “gold” standard; those who had at least 50% of their content stored in two or more of these “dark archives” received a “silver” standard; those who had at least 25% of their content in one or more of them received a “bronze” standard; and all members who did not fit into any of these categories were “unclassified”.
Eve then collected samples of documents with a DOI from the more than 20,000 members of the Crossref system – reaching a thousand documents in the case of the richest in content, and proportionally fewer in the smallest ones – totalling 7,438,037 documents. With the help of an automated system, Eve searched for these papers in a selection of the main dark archives, including CLOCKSS, LOCKSS and Portico, plus the Brazilian Rede Cariniana, HathiTrust, the Internet Archive/FATCAT, the Public Knowledge Project PLN and the Scholars Portal.
By cross-referencing these data, Eve found that only 0.96% of Crossref members (204) were observed preserving more than 75% of their content in three or more of the archives consulted, achieving the “gold” classification. A slightly higher proportion, 8.5% (1,797) preserved more than 50% of their content in two or more archives, being classified as “silver”, and just over half – 57.7% (12,257) – achieved the minimum level of preservation, “bronze”, with 25% of their material stored in a single archive. Almost a third of Crossref members (6,982, or 32.9%), however, did not maintain any digital preservation action, going against the recommendations of the Digital Preservation Coalition.
As for the nearly 7.5 million documents themselves, the survey detected nearly 6 million “preservation instances”, a term that denotes the number of copies stored. Thus, an article preserved in three archives has three “preservation instances”. Treating the documents separately, just over 4.3 million of the articles in the sample (58.38%) had at least one “preservation instance”, leaving 2 million works (27.64%) apparently without any preservation efforts. The remaining 13.98% were excluded from the survey because they were published too recently, were not academic articles, or lacked sufficient metadata to have their sources identified.
“Our entire epistemology of science and research depends on a chain of footnotes. If you can’t verify what someone else said at a given time, you’re just blindly trusting things you can’t read for yourself,” lamented Eve, in an interview with Nature magazine.
For the University of London professor, his findings also call into question the academic culture of “publish or perish” – which could well be replaced by “publish and perish”: “Everyone thinks about the immediate gains that come from having a paper published somewhere, but what we should really be thinking about is the long-term sustainability of the research ecosystem. After you’ve been dead for 100 years, will people be able to access the things you worked on?”
The post Scientific publication is now fully digital – so who is responsible for preserving our archives? appeared first on The Skeptic.
Now journals have moved away from paper publications, our access to our ongoing history of discovery and innovation relies entirely on digital archives
The post Scientific publication is now fully digital – so who is responsible for preserving our archives? appeared first on The Skeptic.