iRODS

iRODS User Group Meeting 2019

Utrecht University
Utrecht, Netherlands
June 25 - June 28, 2019

Original Agenda (pdf) (png)

Photos and Videos

Conference Videos hosted on LectureNet (Utrecht University)

Conference Videos hosted on YouTube

Articles

iRODS UGM 2019 Proceedings (PDF)

Providing validated, templated, and richer metadata using a bidirectional conversion between JSON and iRODS AVUs
J. Paul van Schayck, Daniël Theunissen – Maastricht University
Ton Smeele, Lazlo Westerhof – Utrecht University

An authentication solution for iRODS based on the OpenID Connect protocol
Claudio Cacciari, Giuseppa Muscianisi, Michele Carpené, Mattia D'Antonio, Giuseppe Fiameni – CINECA

NFSRODS: Presenting iRODS as NFSv4.1
Kory Draughn, Terrell Russell, Alek Mieczkowski, Jason Coposky – iRODS Consortium
Mike Conway - NIH / NIEHS

Integration of iRODS data workflows in an extensible HTTP REST API framework
Mattia D'Antonio, Claudio Cacciari, Giuseppa Muscianisi, Michele Carpené, Guiseppe Fiameni – CINECA

iRODS S3 Resource Plugin: Cacheless and Detached Mode
Justin James, Terrell Russell, Jason Coposky – iRODS Consortium

Training (June 25)

iRODS Training: Beginner

"Getting Started with iRODS" covered iRODS vocabulary, mental models, capabilities, and basic interactions.

Overview
Four Core Competencies
Planning an iRODS Deployment
Installation
LUNCH
iCommands and Cloud Browser
Virtualization
Basic Metadata
Basic Rule Engine

iRODS Training: Administration

"iRODS Care and Feeding" will cover scoping an iRODS installation, deployment, various integrations, and how they work.

Resource Hierarchies and Composition
Rule Engine Plugins
Data to Compute
Compute to Data
Automated Ingest
Storage Tiering
Auditing
Q&A: Troubleshooting / Bottlenecks / Performance

iRODS Training: Policy

"Data Commons in a Day" will cover how to create a system that provides ingest to analysis, publication, and tiering out to archive.

Overview
Automated Ingest
Storage Tiering
Indexing
Compute to Data
Publishing

Presentations (June 26)

Welcome [UU video] [YouTube video]
Jason Coposky - iRODS Consortium

iRODS UGM 2019 Keynote [slides] [UU video] [YouTube video]
Folkert-Jan de Groot - Utrecht University

Consortium Update [slides] [UU video] [YouTube video]
Jason Coposky - iRODS Consortium

Technology Update [slides] [UU video] [YouTube video]
Terrell Russell - iRODS Consortium
Kory Draughn - iRODS Consortium
Alan King - iRODS Consortium
Jaspreet Gill - iRODS Consortium

Providing validated, templated and richer metadata using a bidirectional conversion between JSON and iRODS AVUs [slides] [paper] [UU video] [YouTube video]
J. Paul van Schayck - Maastricht University

A frequently recurring question in research data management is to structure metadata according to a standard and to provide the corresponding user interface to it. This has only become more urgent since the introduction of the FAIR principles which state that metadata should use controlled vocabularies and meet community standards.

The iRODS data grid technology is well positioned as a core layer within an infrastructure to manage research data. One of its strengths is the ability to attach any number of attribute, value, unit (AVU) triples as metadata to any iRODS object. This makes iRODS adaptable to very diverse use cases in research data management. However, the challenge of working with more structured metadata is not being addressed by the default capabilities of iRODS. Our aim is to develop a new method for storing richer, templated and validated metadata in AVUs.

JSON is a popular, flexible and easy to use format for serializing (nested) data, while maintaining human and developer readability. Furthermore, a JSON Schema can be used to validate a JSON structure and it can also be used to obtain a dynamically generated form on the basis of this schema. This combination of functionalities makes it an excellent format for metadata. We have therefore designed and implemented a bidirectional conversion between JSON and AVUs. The conversion method has been implemented as Python iRODS rules that allow to set and retrieve AVU metadata on an iRODS object using a JSON structure. Optionally, a policy can be installed to validate metadata entry and updates against the JSON Schema that governs the object.

With this work we provide other iRODS developers with a generic method for conversion between JSON and AVUs. We are encouraging others to use the conversion method in their deployments.

iRODS at KTH and SNIC - Status and Prospects [slides] [UU video] [YouTube video]
Ilari Korhonen - KTH Royal Institute of Technology

The current state of iRODS operations at KTH PDC Center for High Performance Computing is presented with a two-pronged approach. Firstly, the status of our national iRODS-based data infrastructure is laid out with respect to both the deployment at KTH PDC as well as our collaboration with our partner center NSC at Linköping University. We are currently a distributed operation between two centers, with the possibility of more Swedish HPC centers to join in the future. Secondly, our development efforts for an HPC-adjacent iRODS deployment at our center is discussed. The focus here is to provide access to several tiers of iRODS storage environments in the high performance computing environments available at our center, in a secure, efficient, low-latency and high-throughput configuration. In addition to providing iRODS clients in our compute cluster to enable access to (federated) iRODS grids while using (federated) Kerberos authentication between the administrative domains, we are also collaborating with the iRODS Consortium on the testing of a new iRODS Lustre interface. Eventually this would enable us to provide Lustre-backed iRODS resources for storage tiering within the local iRODS grid, for the staging of HPC data into and especially out of the compute filesystem. This combined with federated access to several tiers of storage resources and the ability to publish data out of iRODS, possibly in the future referenced by persistent identifiers, gives a working solution for research data lifecycle management.

An authentication solution for iRODS based on the OpenID Connect protocol [slides] [paper] [UU video] [YouTube video]
Michele Carpené - CINECA

We are going to describe an authentication solution for iRODS based on the OpenID Connect (OIDC) protocol. In the context of European data infrastructures, like EUDAT, and projects, like EOSC-hub, iRODS must interoperate with other services, which support OIDC and OAuth2 protocols. The typical usage workflows encompass both direct user interaction via icommand and other clients and service-to-service interaction on behalf of the user. While in the first case we can rely on the already existing iRODS OpenID plugin, in the second one it is not possible because of two main reasons. The first is that the service-to-service process implies that the user is not requested to generate a token for iRODS, but that iRODS is able to re-use an existing token from another service. The second is that the other service needs to get access to iRODS using multiple authentication protocols in a dynamic way, not fixing one of them in the configuration. For example, we have instances of DavRODS that allow to log-in via plain username and password or via OIDC token. In order to achieve those results, we implemented a Pluggable Authentication Module (PAM), which allows iRODS to accept an OIDC token, re-using the password parameter of the PAM based authentication, validate it against an Authentication Service and map the user to a local account relying on the attributes provided back by the Authentication Service, once validated the token. Given the flexibility of the PAM approach, in this way it is possible to stack multiple PAM modules together, enabling a single iRODS instance to support multiple OIDC providers and even to create dynamically the local accounts, without any pre-configured mapping.

Asynchronous file handling with iRODS tape resources [slides] [UU video] [YouTube video]
Arthur Newton - SURFsara

Tiered storage systems comprised of a disk cache staging area and a tape library are cost-effective solutions ideal for long-term storage.

Last year, we presented a way to build a tiered storage system which employs tape storage in the backend transparent to iRODS. Since our tape archive system already has a disk cache, we explicitly did not make use of an iRODS compound resource for integration which would have required an additional cache layer.

Tape storage is inherently asynchronous, meaning that data can reside in different states, online on disk cache, or offline only on tape. If the data is offline, it is not readily available and needs to be staged to disk. Our solution in iRODS can (automatically) trigger state transitions between offline and online. However, the user still experienced the asynchronisity of the different states of data, which left room for improvement in the user friendliness of handling such data.

To fill this gap, we implemented a set of command line applications which makes it easier for the user to download, upload, and retrieve information about the state of data. The iRODS python client provides the base of the tool, which alleviates the need for icommands or the DMF tape tools and as such also broadens the compatibility on different systems. The application is split into a set of CLI tools and a daemon-like application that handles requests and file transfers in the background. The daemon is automatically spawned as a non-root process upon the first request and stopped when idle for a specific time.

The command line tool can be extended to other types of storage resources with similar asynchronous staging of data. Additionally, the performance can possibly be improved by allowing for parallel transfer of data.

A GA4GH Data Repository Service for native iRODS [slides] [UU video] [YouTube video]
Mike Conway - NIH/NIEHS

This paper and presentation will premier implementation of a GA4GH Data Repository Service, (formerly the GA4GH Data Object Service), that can run on a base iRODS server. The Data Repository Service implementation uses standard iRODS collections to house a Data Bundle, and utilizes Attribute-Value-Unit metadata to mark data objects and bundles with auxiliary information.

Using this service allows iRODS to integrate with workflow and data access services in the area of genomics and bio-sciences that follow the emerging standards of the GA4GH consortium. These standards have been identified in the NIH Commons effort as an important interoperability standard.

The development approach of this implementation sets a very low barrier of entry and allows any genomic data set stored in iRODS to be exposed via the GA4GH Data Repository Service API. A presentation, paper, and source code release are planned for the User Group Meeting.

SPONSOR MESSAGE [slides] [UU video] [YouTube video]
Niels Rotert - Cloudian

SODAR - the iRODS-powered System for Omics Data Access and Retrieval [slides] [UU video] [YouTube video]
Mikko Nieminen - Berlin Institute of Health

In the past years, a growing number of high-throughput omics assays in the areas of genomics, proteomics, and metabolomics have become widespread in life science research. This creates increasing demand for handling the large amounts of data as well as models for the complex experimental designs. Further challenges include the FAIR principles for making the data findable, accessible, interoperable, and reusable. Collaboration between multiple institutes further complicates data management.

Here we present SODAR (System for Omics Data Access and Retrieval), our effort of fulfilling these requirements. The modular system allows for the curation of complex studies with the required meta data as well as for the storage of large bulk data. To facilitate effective and efficient data management workflows, SODAR provides project-based access control, a web-based graphical user interface for data management, programmatic data access, ID management for study objects as well as various tools for omics data management.

The system is based on open source solutions. iRODS is used for large data storage while Davrods allows for providing access through the widely supported HTTP(S) protocol (e.g., for integration with the IGV software). Graphical interfaces and APIs are implemented in Python using the Django framework. Our data model is based on the ISA-Tools data model ISA-tab is used as the meta data file exchange format. A transaction sub system integrates activities spanning both data and meta data. Core parts of SODAR are available as reusable libraries for creating project-based data management systems that share access control with SODAR.

We will demonstrate our flag ship rare disease genetics use case, starting from bulk data import to the browsing of study design and metadata and interacting with the data through IGV.

A beta version of SODAR is currently deployed in our institutes. The system will be made available as open source under a permissive license.

iRODS in context: Exploring integrations between iRODS and OwnCloud [slides] [UU video] [YouTube video]
Hylke Koers - SURFsara

Within the Netherlands, iRODS is gaining substantial traction with universities and other research institutes as a tool to help manage large amounts of heterogeneous research data. In this context, iRODS is usually used as middleware, providing value through data virtualization, metadata management and/or rule-driven workflows. This is then typically combined with other tools and technology to fully support the diverse needs of researchers, data stewards, IT managers, etc.

While integrations with other RDM tools are facilitated by iRODS' flexibility, a significant amount of work is usually still required to develop and test them with users in their specific context. For this reason, SURF – as the collaborative ICT organisation for Dutch education and research – sees a role for itself to spearhead the development of such integrations as that effectively means pooling of resources which lowers the collective development cost and accelerates the pace of adoption.

In this contribution, we will focus on a recent project undertaken by SURF to explore the integration between OwnCloud and iRODS. OwnCloud is an open-source, "sync and share" solution to manage data as an individual or as a research team. OwnCloud is the technology behind two successful existing SURF products: SURFdrive and Research Drive. Offering a GUI, versioning, off-line sync and link-based sharing, its functionality is in many ways complementary to iRODS. This makes integrating the two technologies attractive, yet there are several challenges in terms of file inventory synchronization, metadata management, and access control. In the demo, we'd like to share how we have addressed these challenges and discuss a proposed way forward.

As an outlook into future work, this integration could be extended to support seamless publication of research data in trusted, long-term data repositories. Existing data publication workflows have many common tasks, but also significant variance in the "details" of how these tasks are stringed together and how they need to be operationalized. To address this balance, we are exploring an approach that essentially abstracts data publication tasks into an overarching workflow framework, so as to allow for flexibility yet also benefit from standards and common patterns.

iRODS and use case of Bristol-Myers Squibb to manage genomics data [slides] [UU video] [YouTube video]
Oleg Moiseyenko - Bristol-Myers Squibb Company

This presentation will discuss how iRODS helps Bristol-Myers Squibb manage petabytes of NGS data, synchronizing it from different on-premise locations with AWS Cloud store and challenges it presents.

In the days of high speed internet and cloud computing, the old paradigms in drugs discovery went through significant changes. The shifts in IT landscape and medicine economics force "big pharma" companies to seek better routes for innovation and efficiencies. Since DNA has gone digital, various scientific communities around the globe routinely run multiple tests, take high-resolution medical images, and use big data in health research on daily basis.

New cloud computing infrastructure contributes to swift increases in research partnerships in bioanalysis via collaboration consortia, which at the end leads to data fragmentation in terms of sources, data types, and storage. In addition, this data should be also securely stored, well maintained through the lifetime, and access-controlled in accordance with latest local regulations and data compliance requirements.

Presentations (June 27)

Welcome [UU video] [YouTube video]
Jason Coposky - iRODS Consortium

iRODS UGM 2019 Keynote [slides] [UU video] [YouTube video]
Ton Smeele - Utrecht University

iRODS Capabilities: Indexing and Publishing [slides] [UU video] [YouTube video]
Jason Coposky - iRODS Consortium

This presentation shares the fourth and fifth iRODS Capabilities with working code. Indexing and Publishing are very similar in that they instrument the iRODS Zone to watch for particular metadata and then engage external API endpoints to populate external systems. The candidate code demonstrates indexing via Elasticsearch and publishing via Data.World.

NFSRODS: Presenting iRODS as NFSv4.1 [slides] [paper] [UU video] [YouTube video]
Kory Draughn - iRODS Consortium

NFSRODS is a new iRODS Client that presents iRODS as NFSv4.1. This release handles multi-owner mapping to POSIX permissions and has been deployed within an enterprise environment.

SPONSOR MESSAGE [slides] [UU video] [YouTube video]
Tom D'Hont - SUSE

Integration of iRODS data workflows in an extensible HTTP REST API framework [slides] [paper] [UU video] [YouTube video]
Mattia D'Antonio - CINECA

We developed a set of HTTP REST APIs on top of iRODS to support users of different communities to automate both ingestion and retrieval data workflows. We built a common REST APIs layer implementing basic functionalities, including the interaction with iRODS, within an extensible framework (RAPyDO: Rest Apis with Python on Docker) that we developed and adopted to build communities-specific REST APIs.

More in details, we are collaborating with the EUropean DATa infrastructure EUDAT; European projects like EOSC-hub and SeaDataNet; national initiatives in collaboration with Telethon (a non-profit organization for genetic diseases research) and SIGU (Italian Society for Human Genomics).

All endpoints are written by using the Python language through the Flask framework and served by an uWSGI web server deployed within a Docker container.

We created a wrapper of the python irods client (PRC) to let both the core framework and communities specific APIs to easily interact with iRODS by supporting all main authentication protocols (native passwords, Pluggable authentication modules (PAM), Grid Security Infrastructure (GSI) by the Globus Toolkit. To be able to support all required authentication methods we also contributed to the PRC by developing authentication modules for both GSI and PAM.

Most of iRODS-based functions that we developed can be mapped against corresponding icommands like ils, iget, iput, imv, icp, imeta, irule, iticket but also more complex functionalities have been realized, for instance streamed read/write operations from/to network sockets.

To be able to execute data intensive and complex workflows, we also introduced an asynchronous layer implemented on Celery, a task management queue based on distributed message passing.

Rodinaut: A tool for metadata management [slides] [UU video] [YouTube video]
Othmar Weber - Bayer

For scientists who need to ensure compliance with data security/privacy, and find information in iRODS, Rodinaut is a web application that enables viewing and managing metadata. Unlike the existing command-line tool, our product is self-explaining, easy and fast to use, and improves user experience with iRODS.

More than Just Load Balancing iRODS Using HAProxy [slides] [UU video] [YouTube video]
Tony Edgin - University of Arizona

The iRODS community has demonstrated that HAProxy works well for horizontally scaling iRODS catalog providers, but HAProxy can provide other functionality. This talk will demonstrate how HAProxy can access information about the user and client application and use it to control quality of service through throttling per user access, providing fast lanes for certain applications, and routing data transfer sessions to specific catalog providers. The talk will also show how HAProxy can be configured to filter out port scanners. Finally, the talk will show how to configure HAProxy to allow a single canonical host name for iRODS and other services like davrods.

Migrating data when decommissioning PetaBytes of storage [slides] [UU video] [YouTube video]
John Constable - Wellcome Sanger Institute

Wellcome Sanger Institute has ~18PB of genomic data in 399 resources on 76 resource servers across six Zones. This is the story of what happened when we needed to retire some of the servers.

Surgical Critical Care Initiative: Leveraging iRODS to Accomplish Multi-Site Data Collection, Harmonization, and Analytics to Generate Clinical Decision Support Tools [slides] [UU video] [YouTube video]
Andy MacKelfresh - Duke University

To support the development of Clinical Decision Support Tools (CDSTs) in both civilian and military health systems, the Surgical Critical Care Initiative (SC2i), a Department of Defense funded consortium of Federal and non-Federal institutions, leverages iRODS to harmonize clinical, laboratory, and bio-bank data and centralize 30+ million high-quality data elements to enhance complex decision making in acute and trauma care.

iRODS UGM 2020 Announcement [slides] [UU video] [YouTube video]
Jason Coposky - iRODS Consortium

We are headed back to the University of Arizona in 2020.

iRODS S3 Resource Plugin: Cacheless and Detached Mode [slides] [paper] [UU video] [YouTube video]
Justin James - iRODS Consortium

The iRODS S3 Resource Plugin will soon be able to be configured without being a child of a compound resource and without a sibling cache resource. This standalone functionality will significantly reduce administrative overhead.

Lightning Talk - Monitoring iRODS [UU video] [YouTube video]
John Constable - Wellcome Sanger Institute

Lightning Talk - iRODS CI Demo [UU video] [YouTube video]
Jaspreet Gill - iRODS Consortium

Lightning Talk - iRODS Install on OLPC schoolserver (Intel NUC) [UU video] [YouTube video]
Tony Anderson - Care4Kids

Lightning Talk - Why Uploading tar Files Is Terrible [UU video] [YouTube video]
John Constable - Wellcome Sanger Institute

Lightning Talk - Introducing DataHog [UU video] [YouTube video]
Tony Edgin - University of Arizona

Lightning Talk - NetCDF Header Extraction [UU video] [YouTube video]
Daniel Moore - iRODS Consortium

Lightning Talk - Parallel Transfer Engine [UU video] [YouTube video]
Kory Draughn - iRODS Consortium

Lightning Talk - iRODS in Cloudy Cluster [UU video] [YouTube video]
Boyd Wilson - Omnibond