The National Library of Spain announced that the full text of the public domain publications of the Digital Newspaper Library can be downloaded in open, free and reusable formats **. Free of course.
The institution has a page on its website that contains a list of titles in the public domain, the full text of which can be downloaded. These texts are obtained from an optical character recognition process or OCR. Which implies that its quality may vary depending on the typeface and conditions of the original document.
The texts can be used freely for analysis, processing or reuse
More than 2,000 press headlines in the public domain
The Digital Newspaper Library has thousands of press headlines, including more than 2,000 in the public domain, whose numbers are now offered as downloadable files so that they can be used freely “for analysis, processing or reuse,” they explain from the BNE.
“Having these texts allows the application of natural language processing technologies and other new tools typical of the so-called digital humanities, whose use is increasingly widespread”.
The initiative came as part of the general strategy of the BNE to promote research and reuse of its digital heritage and in a specific part of this roadmap that intends to analyze, open and publish the data that the institution generates. Doing so in open and reusable formats, following public sector information reuse policies and standards.
Among the large data sets generated and released by the National Library of Spain, adaptations have been made to the JSON, CSV, ODS, TXT or XML formats. “The initiative is proposed as an activity open to collaboration, a starting point to find lines of experimentation, work and exploitation of these data, as a valuable resource in fields such as natural language processing, academic research or the development of software, “they say.
An earlier version of this article was published in 2020.