Open Research Tools and Technologies

DataLad

Perpetual decentralized management of digital objects for collaborative open science
AW1.126
Michael Hanke
Contemporary sciences are heavily data-driven, but today's data management technologies and sharing practices fall at least a decade behind software ecosystem counterparts. Merely providing file access is insufficient for a simple reason: data are not static. Data often (and should!) continue to evolve; file formats can change, bugs will be fixed, new data are added, and derived data needs to be integrated. While (distributed) version control systems are a de-facto standard for open source software development, a similar level of tooling and culture is not present in the open data community. The lecture introduces DataLad, a software that aims to address this problem by providing a feature-rich API (command line and Python) for joint management of all digital objects of science: source code, data artifacts (as much as their derivatives), and essential utilities, such as container images of employed computational environments. A DataLad dataset represents a comprehensive and actionable unit that can be used privately, or be published on today's cyberinfrastructure (GitLab, GitHub, Figshare, S3, Google Drive, etc.) to facilitate large and small-scale collaborations. In addition to essential version control tasks, DataLad aids data discovery by supporting a plurality of evolving metadata description standards. Moreover, Datalad is able to capture data provenance information in a way that enables programmatic re-execution of computations, and as such provides a key feature for the implementation of reproducible science. DataLad is extensible and customizable to fine tune its functionality to specific domains (e.g., field of science or organizational requirements). DataLad is built on a few key principles: DataLad only knows about two things: Datasets and files. A DataLad dataset is a collection of files in folders. And a file is the smallest unit any dataset can contain. At its core, DataLad is a completely domain-agnostic, general-purpose tool to manage data. A dataset is a Git repository. A dataset is a Git repository. All features of the version control system Git also apply to everything managed by DataLad. A DataLad dataset can take care of managing and version controlling arbitrarily large data. To do this, it has an optional annex for (large) file content: Thanks to this annex, DataLad can track files that are TBs in size (something that Git could not do, and that allows you to restore previous versions of data, transform and work with it while capturing all provenance, or share it with whomever you want). At the same time, DataLad does all of the magic necessary to get this important feature to work quietly in the background. The annex is set-up automatically, and the tool git-annex manages it all underneath the hood. DataLad follows the social principle to minimize custom procedures and data structures. DataLad will not transform your files into something that only DataLad or a specialized tool can read. A PDF file (or any other type of file) stays a PDF file (or whatever other type of file it was) whether it is managed by DataLad or not. This guarantees that users will not loose data or data access if DataLad would vanish from their system, or even when DataLad would vanish from the face of Earth. Using DataLad thus does not require or generate data structures that can only be used or read with DataLad -- DataLad does not tie you down, it liberates you. Furthermore, DataLad is developed for complete decentralization. There is no required central server or service necessary to use DataLad. In this way, no central infrastructure needs to be maintained (or paid for) -- your own laptop is the perfect place to live for your DataLad project, as is your institutions webserver, or any other common computational infrastructure you might be using. Simultaneously, though, DataLad aims to maximize the (re-)use of existing 3rd-party data resources and infrastructure. Users can use existing central infrastructure should they want to. DataLad works with any infrastructure from GitHub to Dropbox, Figshare, or institutional repositories, enabling users to harvest all of the advantages of their preferred infrastructure without tying anyone down to central services.

Additional information

Type devroom

More sessions

2/1/20
Open Research Tools and Technologies
Jan Grewe
AW1.126
The reproducibility crisis has shocked the scientific community. Different papers describe this issue and the scientific community has taken steps to improve on it. For example, several initiatives have been founded to foster openness and standardisation in different scientific communities (e.g. the INCF[1] for the neurosciences). Journals encourage sharing of the data underlying the presented results, some even make it a requirement. What is the role of open source solutions in this respect? ...
2/1/20
Open Research Tools and Technologies
Julia Sprenger
AW1.126
The approaches used in software development in an industry setting and a scientific environment are exhibit a number of fundamental differences. In the former industry setting modern team development tools and methods are used (version control, continuous integration, Scrum, ...) to develop software in teams with a focus on the final software product. In contrast, in the latter scientific environment a large fraction of scientific code is produced by individual scientists lacking thorough ...
2/1/20
Open Research Tools and Technologies
Aniket Pradhan
AW1.126
NeuroFedora is an initiative to provide a ready to use Fedora-based Free/Open source software platform for neuroscience. We believe that similar to Free software; science should be free for all to use, share, modify, and study. The use of Free software also aids reproducibility, data sharing, and collaboration in the research community. By making the tools used in the scientific process more comfortable to use, NeuroFedora aims to take a step to enable this ideal.
2/1/20
Open Research Tools and Technologies
AW1.126
Health Data is traditionally held and processed in large and complex mazes of hospital information systems. The market is dominated by vendors offering monolithic and proprietary software due to the critical nature of the supported processes and - in some cases - due to legal requirements. The “digital transformation”, “big data” and “artificial intelligence” are some of the hypes that demand for improved exchange of health care data in routine health care and medical research alike. ...
2/1/20
Open Research Tools and Technologies
Lilly Winfree
AW1.126
Generating insight and conclusions from research data is often not a straightforward process. Data can be hard to find, archived in difficult to use formats, poorly structured and/or incomplete. These issues create “friction” and make it difficult to use, publish and share data. The Frictionless Data initiative (https://frictionlessdata.io/) at Open Knowledge Foundation (http://okfn.org) aims to reduce friction in working with data, with a goal to make it effortless to transport data among ...
2/1/20
Open Research Tools and Technologies
Mateusz Kuzak
AW1.126
ELIXIR is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers.
2/1/20
Open Research Tools and Technologies
Antoine Fauchié
AW1.126
As an editor for WYSIWYM text, Stylo is designed to change the entire digital editorial chain of scholarly journals the field of human sciences. Stylo (https://stylo.ecrituresnumeriques.ca) is designed to simplify the writing and editing of scientific articles in the humanities and social sciences. It is intended for authors and publishers engaged in high quality scientific publishing. Although the structuring of documents is fundamental for digital distribution, this aspect is currently delayed ...