Open Research Tools and Technologies

Journalists are researchers like any others

We are not journalists. But we are developers working for journalists. When we receive leaks, we are flooded by the huge amount of documents and the huge amount of questions that journalists have, trying to dig into this leak. Among others : * Where to begin ? * How many documents mention "tax avoidance" ? * How many languages are in this leaks ? * How many documents are in CSV ? Journalists have more or less the same questions as researchers ! So to help them answer all these questions, we developed Datashare. In a nutshell, Datashare is a tool to answer all your questions about a corpus of documents : just like Google but without Google and without sending information to Google. That means that it extracts content and metadata from all types of documents and index it. Then, it detects any people, locations, organizations and email addresses. The web interface expose all of that to let you have a complete overview of your corpus and search through it. Plus Datashare lets you star and tag documents. We didn't want to reinvent the wheel, and use assets that has been proved to work well. How did we end up with Datashare from an heterogeneous environment ? Initially we had : - a command line tool to extract text from huge document corpus - a proof of concept of NLP pipelines in java - a shared index based on blacklight / RoR and SolR - opensource tools and frameworks Issues we had to fix : - UX - scalability of SolR with millions of documents - integration of all the tools in one - maintainability and robustness while increasing code base

Additional information

Type devroom

More sessions

2/1/20
Open Research Tools and Technologies
Jan Grewe
AW1.126
The reproducibility crisis has shocked the scientific community. Different papers describe this issue and the scientific community has taken steps to improve on it. For example, several initiatives have been founded to foster openness and standardisation in different scientific communities (e.g. the INCF[1] for the neurosciences). Journals encourage sharing of the data underlying the presented results, some even make it a requirement. What is the role of open source solutions in this respect? ...
2/1/20
Open Research Tools and Technologies
Julia Sprenger
AW1.126
The approaches used in software development in an industry setting and a scientific environment are exhibit a number of fundamental differences. In the former industry setting modern team development tools and methods are used (version control, continuous integration, Scrum, ...) to develop software in teams with a focus on the final software product. In contrast, in the latter scientific environment a large fraction of scientific code is produced by individual scientists lacking thorough ...
2/1/20
Open Research Tools and Technologies
Aniket Pradhan
AW1.126
NeuroFedora is an initiative to provide a ready to use Fedora-based Free/Open source software platform for neuroscience. We believe that similar to Free software; science should be free for all to use, share, modify, and study. The use of Free software also aids reproducibility, data sharing, and collaboration in the research community. By making the tools used in the scientific process more comfortable to use, NeuroFedora aims to take a step to enable this ideal.
2/1/20
Open Research Tools and Technologies
AW1.126
Health Data is traditionally held and processed in large and complex mazes of hospital information systems. The market is dominated by vendors offering monolithic and proprietary software due to the critical nature of the supported processes and - in some cases - due to legal requirements. The “digital transformation”, “big data” and “artificial intelligence” are some of the hypes that demand for improved exchange of health care data in routine health care and medical research alike. ...
2/1/20
Open Research Tools and Technologies
Michael Hanke
AW1.126
Contemporary sciences are heavily data-driven, but today's data management technologies and sharing practices fall at least a decade behind software ecosystem counterparts. Merely providing file access is insufficient for a simple reason: data are not static. Data often (and should!) continue to evolve; file formats can change, bugs will be fixed, new data are added, and derived data needs to be integrated. While (distributed) version control systems are a de-facto standard for open source ...
2/1/20
Open Research Tools and Technologies
Lilly Winfree
AW1.126
Generating insight and conclusions from research data is often not a straightforward process. Data can be hard to find, archived in difficult to use formats, poorly structured and/or incomplete. These issues create “friction” and make it difficult to use, publish and share data. The Frictionless Data initiative (https://frictionlessdata.io/) at Open Knowledge Foundation (http://okfn.org) aims to reduce friction in working with data, with a goal to make it effortless to transport data among ...
2/1/20
Open Research Tools and Technologies
Mateusz Kuzak
AW1.126
ELIXIR is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers.