On Friday, June 28, Utrecht University will host a Hackathon where the week's participants can work on scoping the next project, a quick integration between different systems, or some policy examples. Lunch will be included.
Current Agenda (pdf)
More than Just Load Balancing iRODS Using HAProxy
CyVerse, University of Arizona
The iRODS community has demonstrated that HAProxy works well for horizontally scaling iRODS catalog providers, but HAProxy can provide other functionality. This talk will demonstrate how HAProxy can access information about the user and client application and use it to control quality of service through throttling per user access, providing fast lanes for certain applications, and routing data transfer sessions to specific catalog providers. The talk will also show how HAProxy can be configured to filter out port scanners. Finally, the talk will show how to configure HAProxy to allow a single canonical host name for iRODS and other services like davrods.
An authentication solution for iRODS based on the OpenID Connect protocol
Claudio Cacciari, Giusy Muscianisi, Michele Carpenè, Mattia D'Antonio, and Giuseppe Fiameni
We are going to describe an authentication solution for iRODS based on the OpenID Connect (OIDC) protocol. In the context of European data infrastructures, like EUDAT, and projects, like EOSC-hub, iRODS must interoperate with other services, which support OIDC and OAuth2 protocols. The typical usage workflows encompass both direct user interaction via icommand and other clients and service-to-service interaction on behalf of the user. While in the first case we can rely on the already existing iRODS OpenID plugin, in the second one it is not possible because of two main reasons. The first is that the service-to-service process implies that the user is not requested to generate a token for iRODS, but that iRODS is able to re-use an existing token from another service. The second is that the other service needs to get access to iRODS using multiple authentication protocols in a dynamic way, not fixing one of them in the configuration. For example, we have instances of DavRODS that allow to log-in via plain username and password or via OIDC token. In order to achieve those results, we implemented a Pluggable Authentication Module (PAM), which allows iRODS to accept an OIDC token, re-using the password parameter of the PAM based authentication, validate it against an Authentication Service and map the user to a local account relying on the attributes provided back by the Authentication Service, once validated the token. Given the flexibility of the PAM approach, in this way it is possible to stack multiple PAM modules together, enabling a single iRODS instance to support multiple OIDC providers and even to create dynamically the local accounts, without any pre-configured mapping.
Integration of iRODS data workflows in an extensible HTTP REST API framework
Mattia D'Antonio, Claudio Cacciari, Giuseppa Muscianisi, Michele Carpené, and Giuseppe Fiameni
We developed a set of HTTP REST APIs on top of iRODS to support users of different communities to automate both ingestion and retrieval data workflows. We built a common REST APIs layer implementing basic functionalities, including the interaction with iRODS, within an extensible framework (RAPyDO: Rest Apis with Python on Docker) that we developed and adopted to build communities-specific REST APIs.
iRODS in context: Exploring integrations between iRODS and OwnCloud
More in details, we are collaborating with the EUropean DATa infrastructure EUDAT; European projects like EOSC-hub and SeaDataNet; national initiatives in collaboration with Telethon (a non-profit organization for genetic diseases research) and SIGU (Italian Society for Human Genomics).
All endpoints are written by using the Python language through the Flask framework and served by an uWSGI web server deployed within a Docker container.
We created a wrapper of the python irods client (PRC) to let both the core framework and communities specific APIs to easily interact with iRODS by supporting all main authentication protocols (native passwords, Pluggable authentication modules (PAM), Grid Security Infrastructure (GSI) by the Globus Toolkit. To be able to support all required authentication methods we also contributed to the PRC by developing authentication modules for both GSI and PAM.
Most of iRODS-based functions that we developed can be mapped against corresponding icommands like ils, iget, iput, imv, icp, imeta, irule, iticket but also more complex functionalities have been realized, for instance streamed read/write operations from/to network sockets.
To be able to execute data intensive and complex workflows, we also introduced an asynchronous layer implemented on Celery, a task management queue based on distributed message passing.
Hylke Koers, Stefan Wolfsheimer, Arthur Newton, and Sharif Islam
Within the Netherlands, iRODS is gaining substantial traction with universities and other research institutes as a tool to help manage large amounts of heterogeneous research data. In this context, iRODS is usually used as middleware, providing value through data virtualization, metadata management and/or rule-driven workflows. This is then typically combined with other tools and technology to fully support the diverse needs of researchers, data stewards, IT managers, etc.
SODAR - the iRODS-powered System for Omics Data Access and Retrieval
While integrations with other RDM tools are facilitated by iRODS' flexibility, a significant amount of work is usually still required to develop and test them with users in their specific context. For this reason, SURF – as the collaborative ICT organisation for Dutch education and research – sees a role for itself to spearhead the development of such integrations as that effectively means pooling of resources which lowers the collective development cost and accelerates the pace of adoption.
In this contribution, we will focus on a recent project undertaken by SURF to explore the integration between OwnCloud and iRODS. OwnCloud is an open-source, “sync and share” solution to manage data as an individual or as a research team. OwnCloud is the technology behind two successful existing SURF products: SURFdrive and Research Drive. Offering a GUI, versioning, off-line sync and link-based sharing, its functionality is in many ways complementary to iRODS. This makes integrating the two technologies attractive, yet there are several challenges in terms of file inventory synchronization, metadata management, and access control. In the demo, we'd like to share how we have addressed these challenges and discuss a proposed way forward.
As an outlook into future work, this integration could be extended to support seamless publication of research data in trusted, long-term data repositories. Existing data publication workflows have many common tasks, but also significant variance in the “details” of how these tasks are stringed together and how they need to be operationalized. To address this balance, we are exploring an approach that essentially abstracts data publication tasks into an overarching workflow framework, so as to allow for flexibility yet also benefit from standards and common patterns.
Mikko Nieminen, Manuel Holtgrewe, Mathias Kuhring, and Dieter Beule
Berlin Institute of Health
In the past years, a growing number of high-throughput omics assays in the areas of genomics, proteomics, and metabolomics have become widespread in life science research. This creates increasing demand for handling the large amounts of data as well as models for the complex experimental designs. Further challenges include the FAIR principles for making the data findable, accessible, interoperable, and reusable. Collaboration between multiple institutes further complicates data management.
iRODS and use case of Bristol-Myers Squibb to manage genomics data
Here we present SODAR (System for Omics Data Access and Retrieval), our effort of fulfilling these requirements. The modular system allows for the curation of complex studies with the required meta data as well as for the storage of large bulk data. To facilitate effective and efficient data management workflows, SODAR provides project-based access control, a web-based graphical user interface for data management, programmatic data access, ID management for study objects as well as various tools for omics data management.
The system is based on open source solutions. iRODS is used for large data storage while Davrods allows for providing access through the widely supported HTTP(S) protocol (e.g., for integration with the IGV software). Graphical interfaces and APIs are implemented in Python using the Django framework. Our data model is based on the ISA-Tools data model ISA-tab is used as the meta data file exchange format. A transaction sub system integrates activities spanning both data and meta data. Core parts of SODAR are available as reusable libraries for creating project-based data management systems that share access control with SODAR.
We will demonstrate our flag ship rare disease genetics use case, starting from bulk data import to the browsing of study design and metadata and interacting with the data through IGV.
A beta version of SODAR is currently deployed in our institutes. The system will be made available as open source under a permissive license.
Bristol-Myers Squibb Company
This presentation will discuss how iRODS helps Bristol-Myers Squibb manage petabytes of NGS data, synchronizing it from different on-premise locations with AWS Cloud store and challenges it presents.
Asynchronous file handling with iRODS tape resources
In the days of high speed internet and cloud computing, the old paradigms in drugs discovery went through significant changes. The shifts in IT landscape and medicine economics force “big pharma” companies to seek better routes for innovation and efficiencies. Since DNA has gone digital, various scientific communities around the globe routinely run multiple tests, take high-resolution medical images, and use big data in health research on daily basis.
New cloud computing infrastructure contributes to swift increases in research partnerships in bioanalysis via collaboration consortia, which at the end leads to data fragmentation in terms of sources, data types, and storage. In addition, this data should be also securely stored, well maintained through the lifetime, and access-controlled in accordance with latest local regulations and data compliance requirements.
Arthur Newton, Stefan Wolfsheimer, Sharif Islam, Hylke Koers, and Matthew Saum
Tiered storage systems comprised of a disk cache staging area and a tape library are cost-effective solutions ideal for long-term storage.
A GA4GH Data Repository Service for native iRODS
Last year, we presented a way to build a tiered storage system which employs tape storage in the backend transparent to iRODS. Since our tape archive system already has a disk cache, we explicitly did not make use of an iRODS compound resource for integration which would have required an additional cache layer.
Tape storage is inherently asynchronous, meaning that data can reside in different states, online on disk cache, or offline only on tape. If the data is offline, it is not readily available and needs to be staged to disk. Our solution in iRODS can (automatically) trigger state transitions between offline and online. However, the user still experienced the asynchronisity of the different states of data, which left room for improvement in the user friendliness of handling such data.
To fill this gap, we implemented a set of command line applications which makes it easier for the user to download, upload, and retrieve information about the state of data. The iRODS python client provides the base of the tool, which alleviates the need for icommands or the DMF tape tools and as such also broadens the compatibility on different systems. The application is split into a set of CLI tools and a daemon-like application that handles requests and file transfers in the background. The daemon is automatically spawned as a non-root process upon the first request and stopped when idle for a specific time.
The command line tool can be extended to other types of storage resources with similar asynchronous staging of data. Additionally, the performance can possibly be improved by allowing for parallel transfer of data.
This paper and presentation will premier implementation of a GA4GH Data Repository Service, (formerly the GA4GH Data Object Service), that can run on a base iRODS server. The Data Repository Service implementation uses standard iRODS collections to house a Data Bundle, and utilizes Attribute-Value-Unit metadata to mark data objects and bundles with auxiliary information.
Providing validated, templated and richer metadata using a bidirectional conversion between JSON and iRODS AVUs
Using this service allows iRODS to integrate with workflow and data access services in the area of genomics and bio-sciences that follow the emerging standards of the GA4GH consortium. These standards have been identified in the NIH Commons effort as an important interoperability standard.
The development approach of this implementation sets a very low barrier of entry and allows any genomic data set stored in iRODS to be exposed via the GA4GH Data Repository Service API. A presentation, paper, and source code release are planned for the User Group Meeting.
J. Paul van Schayck, Ton Smeele, Daniël Theunissen, and Lazlo Westerhof
Maastricht UMC+ - DataHub; Utrecht University - ITS/RDM
iRODS allows attaching any number of attribute, value, unit (AVU) triples to any iRODS object. Very few limits or restrictions are placed on the content and usage of these AVU triples. However, a frequently recurring question in research data management is to structure metadata according to some standard and to provide end-user interfaces for this. Moreover, the interoperability and reusability guidelines of the FAIR principles state that metadata should use controlled vocabularies and meet community standards. In order to start addressing these questions in a generic way, we have developed a new method for storing richer and templated metadata in AVUs.
Rodinaut: A tool for metadata management
We have chosen to work with metadata in a JSON representation and use JSON-schema for metadata structuring and validation. The associated JSON-schema also provides the template for metadata forms for end-users. JSON is a flexible and easy to use format for serializing (nested) data, while maintaining human and developer readability. We have developed and implemented a bidirectional conversion between JSON and AVUs. This conversion method strives to (a) be a bijection between JSON and AVUs, (b) be as lean as possible and (c) keep attribute/value pairs the same between JSON and AVUs. The conversion method is used as part of two Python rules that allow users to respectively set and retrieve AVU metadata on an iRODS object using a JSON structure. Optionally, a policy can be installed and activated to validate metadata updates against a JSON-schema that rules the object.
We hope that this work will provide other developers with a generic method for a conversion between JSON and AVU. We are encouraging others to use the JSON-schema for building dynamic metadata forms and making iRODS more accessible to use with a metadata standard of choice.
For scientists who need to ensure compliance with data security/privacy, and find information in iRODS, Rodinaut is a web application that enables viewing and managing metadata. Unlike the existing command-line tool, our product is self-explaining, easy and fast to use, and improves user experience with iRODS.
iRODS at KTH and SNIC - Status and Prospects
KTH Royal Institute of Technology
The current state of iRODS operations at KTH PDC Center for High Performance Computing is presented with a two-pronged approach. Firstly, the status of our national iRODS-based data infrastructure is laid out with respect to both the deployment at KTH PDC as well as our collaboration with our partner center NSC at Linköping University. We are currently a distributed operation between two centers, with the possibility of more Swedish HPC centers to join in the future. Secondly, our development efforts for an HPC-adjacent iRODS deployment at our center is discussed. The focus here is to provide access to several tiers of iRODS storage environments in the high performance computing environments available at our center, in a secure, efficient, low-latency and high-throughput configuration. In addition to providing iRODS clients in our compute cluster to enable access to (federated) iRODS grids while using (federated) Kerberos authentication between the administrative domains, we are also collaborating with the iRODS Consortium on the testing of a new iRODS Lustre interface. Eventually this would enable us to provide Lustre-backed iRODS resources for storage tiering within the local iRODS grid, for the staging of HPC data into and especially out of the compute filesystem. This combined with federated access to several tiers of storage resources and the ability to publish data out of iRODS, possibly in the future referenced by persistent identifiers, gives a working solution for research data lifecycle management.
Surgical Critical Care Initiative: Leveraging iRODS to Accomplish Multi-Site Data Collection, Harmonization, and Analytics to Generate Clinical Decision Support Tools
Andy MacKelfresh and Justin James
Duke University; iRODS Consortium, RENCI at UNC-Chapel Hill
To support the development of Clinical Decision Support Tools (CDSTs) in both civilian and military health systems, the Surgical Critical Care Initiative (SC2i), a Department of Defense funded consortium of Federal and non-Federal institutions, leverages iRODS to harmonize clinical, laboratory, and bio-bank data and centralize 30+ million high-quality data elements to enhance complex decision making in acute and trauma care.
Migrating data when decommissioning PetaBytes of storage
Wellcome Sanger Institute
Wellcome Sanger Institute has ~18PB of genomic data in 399 resources on 76 resource servers across six Zones. This is the story of what happened when we needed to retire some of the servers.