iRODS

iRODS User Group Meeting 2024

Amsterdam, Netherlands
Hosted by SURF
May 28 - May 31, 2024

Original Agenda (pdf)

Group Photo

Videos

Presentations (May 29)

iRODS UGM 2024 Keynote [slides] [video]
Erik van den Bergh - Wageningen University / Yoda Consortium

iRODS Consortium Update [slides] [video]
Terrell Russell - iRODS Consortium

iRODS Technology Update [slides] [video]
Kory Draughn - iRODS Consortium

iRODS replication management and our future iRODS world at Wellcome Sanger Institute [slides] [video]
Sai Poomdla - Wellcome Sanger Institute

At Wellcome Sanger Institute, we manage 20 Petabytes of genomic data using iRODS. We currently use iRODS 4.2.7 in our production (and our tales of upgrading it to a later version) and due to various scenarios, whether it might be in hiccup in the network connection or data object corruption in the transit, we end up with failure in our replica creation process leading to Single, Dirty and Triple replicas. To mitigate these issues, we have developed in-house solutions using python iRODS client. Also, we are evolving into a new iRODS world, where we want to replace our storage servers with single large file-system and allow our users directly to read from this file-system instead of downloading the files into a high-performance file system.

Python package for metadata schemas [slides] [video]
Mariana Montes - KU Leuven
Ronny Moreas - KU Leuven

In this talk we will present a Python package to validate, write and read structured metadata based on schemas. The schemas are described in a JSON file following the specifications of the metadata schema manager developed in the context of the ManGO portal, but this implementation is independent from the web application. Given a schema, users can provide hierarchical metadata in a Python dictionary, validate its contents and add it to an iRODS data object or collection as AVUs with the appropriate namespacing. The same package allows to read back the AVUs and parse them into a Python dictionary preserving the original hierarchy.

iRODS-based system turbocharged next-gen sequencing analysis during pandemic and beyond [slides] [video]
Robert Verhagen - Dutch National Institute for Public Health and the Environment (RIVM)
Erwin van Wieringen - Dutch National Institute for Public Health and the Environment (RIVM)

The National Institute for Public Health and the Environment (RIVM) has numerous projects in various scientific domains that generate next generation sequencing data. Bioinformatics plays an important role in analysing and interpreting this sequencing data. To support these analyses we developed a platform that consists of a High Performance Compute (HPC) cluster, a Linux Scientific Workspace for software development and a Data Management System (DMS) based on iRODS. On top of this DMS, we also created a Job Engine: a tightly integrated process automation tool that manages the automated analyses of sequencing data on the HPC.

The development of this robust and scalable platform started two years prior to the COVID-19 pandemic. As the pandemic unfolded, the RIVM found itself tasked with extensive analyses of vast amounts of COVID-19 surveillance data. We were able to quickly and seamlessly scale the system up to meet this increased demand.

For better automation and user experience, we created additional components within and alongside iRODS. The aforementioned Job Engine and a tool to import data from a variety of internal and external sources are used for automation purposes, while an extensive web interface with collection and document viewer functionalities, a data lineage viewer and (meta)data search capabilities were added for easy user interaction. We also created rules to pack and archive iRODS collections to an iRODS consumer at SURF (collaborative organisation for IT in Dutch education and research). With the DMS and these additional components, we have made the data analysis component of many vital processes within the RIVM reproducible.

By leveraging the iRODS software both the efficiency and the quality of data analyses within the RIVM has been improved significantly. The platform has proven to be crucial, notably during the pandemic, for processing otherwise unimaginable amounts of data.

iRODS Build and Test v9: Automation via GitHub and Kubernetes [slides] [video]
Phil Owen - Renaissance Computing Institute (RENCI)
Terrell Russell - iRODS Consortium
Kory Draughn - iRODS Consortium
Alan King - iRODS Consortium

A reliable and efficient testing environment is vital to foster the quality of iRODS products and enhance developer productivity. It has become a moderately complex and time-consuming undertaking to create, maintain, and execute testing suites across numerous iRODS environment variants in order to comprehensively test iRODS source code.

Typically, unit and topology testing of iRODS components begins with building packages from modified source code. Next, iRODS packages are installed on various target environments that are defined within a matrix of versioned operating systems and DBMS types. Finally, test suites are executed in each environment where log files of all test results are compiled and analyzed.

In this talk we will discuss our newest approach (and future vision) of automating iRODS testing using GitHub and Kubernetes. This discussion will cover some details on how developers will request testing runs, how the build/test process works, and the forensics performed on the test results.

Sharing data in a multi-system multi-role environment centered on iRODS [slides] [video]
Claudio Cacciari - SURF
Eduard Klapwijk - Erasmus University Rotterdam
Simone Mulder - Erasmus University Rotterdam
Mark Mulder - Erasmus University Rotterdam
Stephan Heunis - Erasmus University Rotterdam
Frans van der Zijde - Erasmus University Rotterdam
Wouter Timmer - Erasmus University Rotterdam

SURF, the cooperative association of Dutch educational and research institutions, offers data infrastructure and services to the research communities. Some of its services are based on iRODS and are often used as building blocks for data platforms. One increasingly common architectural component in those platforms is a web portal where researchers can discover data using project specific queries. Once the data are found, they are made available to the researcher, directly, for example, with a download link or indirectly, triggering a copy to a computing environment where they are analyzed. The implementation of such workflow is time consuming. Its maintenance in the long term is often jeopardized by limited support available within the project and design choices too tailored for that use case makes its adoption by other organizations too difficult. We think that it is possible to model that workflow in a generic way as a re-usable modular component and in a way flexible enough to support even the more stringent requirements associated with sensitive data. The component relies on iRODS and links together multiple web portals and repositories through an API layer based on FastAPI. We present here a proof of concept developed within the GUTS project, in collaboration with the project’s data management team and the research support.

iRODS Security Challenges Within an Enterprise Environment [slides] [video]
Ryan Blome - Dow
Simon Cook - Dow

Dow's focus on data security necessitates a tailored approach for our internal users, leading to the development of the Scientific Data Management System (SDMS) Query Tool (SQT) — a user-friendly tool designed to facilitate secure access to specific datasets. The current gap with Metalnx for general users is that there is too much control over modifying data and collections. Additionally, it is difficult to synchronize the iRODS users to our existing Azure Security groups for permission management. This talk outlines the development of a Querying Tool utilizing the iRODS C++ API as a backend to communicate with iRODS. The talk will highlight the need for robust security architecture for Enterprise scale applications and where we are hoping to take the project to in the future (HTTP API, etc.)

Testing iRODS-based applications: Experiences with Yoda [slides] [video]
Sirjan Kaur - Utrecht University
Claire Saliers - Utrecht University
Lazlo Westerhof - Utrecht University
Sietse Snel - Utrecht University

Yoda is a research data management system based on iRODS that enables researchers to deposit, share, publish and preserve data. It is developed at Utrecht University, and used at multiple universities, as well as other organizations. In this contribution, we share our approach for testing Yoda and planned future work.

Our testing approach primarily revolves around automated scripted testing, encompassing unit tests, integration tests, API tests and UI tests. Additionally, we have explored use of scriptless testing and fuzzing to supplement scripted tests.

Planned future work includes improving test coverage, introducing accessibility tests and exploring ways to improve reliability of UI tests.

Drawing from our experiences, we share recommendations regarding how to improve testability and reliability of iRODS-based applications.

iRODS Build and Packaging: 2024 Update [slides] [video]
Markus Kitsinger - iRODS Consortium

This talk will provide an update on our journey to 'Normal and Boring' with regard to CMake, libstdc++, and building for various platforms.

iRODS HTTP API v0.3.0 with OpenID Connect [slides] [video]
Kory Draughn - iRODS Consortium
Martin Flores - iRODS Consortium

The new iRODS HTTP API was introduced last year as an idea to increase developer accessibility to an iRODS namespace, along with early research into how OpenID Connect would fit within the iRODS ecosystem. This talk will share updates through the first three releases, including optimizations and having the iRODS server be an OpenID Connect Protected Resource.

Safeguard your sensitive data in iRODS using data encryption feature available in GoCommands [slides] [video]
Illyoung Choi - CyVerse / University of Arizona
Edwin Skidmore - CyVerse / University of Arizona
Tony Edgin - CyVerse / University of Arizona
Nirav Merchant - CyVerse / University of Arizona

iRODS has been utilized across various scientific domains for storing and sharing research data. However, certain domains, such as health sciences, require stringent confidentiality measures for utilizing iRODS as their data management solution. Compliance with regulations such as HIPAA (Health Insurance Portability and Accountability Act) necessitates end-to-end encryption, meaning that data must be encrypted for both storage and transmission. While iRODS provides encryption in transit using SSL (Secure Socket Layer), the responsibility for encrypted data storage at rest lies with the iRODS infrastructure provider and presents a significant limitation. Encrypting data before uploading it and decrypting it after download adds additional steps and complexity for manual encryption and decryption process.

To address these multiple challenges and to meet the requirement for encrypted data storage, we have simplified the process of encrypting and decrypting data as built-in capability in GoCommands. GoCommands is a cross platform command-line utility for iRODS, offering data access and management capabilities similar to iCommands. The encryption at rest algorithm employs AES-256-CTR to encrypt both file names and content before uploading, thereby ensuring that data remains encrypted within iRODS. When listing and downloading encrypted files, they are automatically decrypted using the encryption key provided by the user. Since GoCommands and WinSCP share the same encryption algorithm, users who are not familiar with the command-line interface can utilize WinSCP, a GUI-based tool, making it more user-friendly. This end-to-end encryption functionality requires no server-side changes to iRODS, ensuring compatibility with existing iRODS servers.

This new data encryption feature will empower researchers to utilize iRODS for storing and accessing their sensitive data and meeting regulatory compliance, thereby facilitating broader adoption of iRODS across various research domains.

Presentations (May 30)

ManGO Portal (iRODS PRC based web UI): Feature updates, including extension points and workflows [slides] [video]
Paul Borgermans - KU Leuven
Mariana Montes - KU Leuven
Ingrid Barcena Roig - KU Leuven

As our Research Data Management user base is growing, so is our in-house software stack including ManGO, the iRODS PRC based web portal. Over the past year, more extension points were added in order to tailor the functions and user experience along the various requirements sets. Also a more generalised base to build workflows / data processing pipelines was added that can be triggered by events inside iRODS as well as external drivers (including ad hoc workflows).

Besides these rather fundamental aspects that make ManGO usuable for many iRODS installations, many smaller features and improvements were added as well.

Optimizing Data Management for AI: The Synergistic Role of Metadata-Hub and iRODS [slides] [video]
David Cerf - GRAU DATA

Metadata-Hub, when combined with iRODS, becomes a transformative solution for managing escalating volumes of machine-generated and unstructured data. It places embedded metadata at the heart of storage management and data preparation, offering a more efficient, cost-effective, and scalable approach. At the time of data creation, Metadata-Hub extracts embedded metadata, allowing raw data files to be immediately archived to affordable storage solutions. This drastically cuts the need for expensive primary storage. Metadata-Hub becomes the repository for all metadata and makes it readily available. It reduces data preparation time for computational applications, analytics, and artificial intelligence (AI), accelerating the transition from data acquisition to insight derivation. Furthermore, the integration of Metadata-Hub into iRODS provides a robust, contextually-aware framework for automated and intelligent data workflows driven by embedded metadata, including archiving, replication, and distribution.

RSpace - iRODS integration: Update and next steps [slides] [video]
Rory Macneil - Research Space
Ander Astudillo - SURF
Tilo Mathes - Research Space

At the 2023 UGM the Stage 1 integration, which ensures that when external files linked to RSpace documents change location, the link integrity is maintained, solving the ‘broken links’ problem and enhancing the reliability of the research record maintained in RSpace, was presented. This year Stage 2 of the integration, which enables export of datasets and associated metadata from RSpace to iRODS, will be presented. This will open the possibility of association of data from RSpace with the potentially vast range of data – e.g. all the data tracked by iRODS and captured in a university’s data storage facilities or all the data in a multi-university or a national project such as a biodiversity initiative. The RSpace data in iRODS will be discoverable via the associated metadata, enabling meaningful association with related data from other research resources, and re-use and repurposing of the data via, e.g., AI engines.

We will also introduce an idea for a Stage 3 integration that will enable RSpace to become a 'research data management front end' for iRODS. The working idea for this concept is as follows: The RSpace Gallery would act as a front end for iRODS to allow upload of data to iRODS (possibly even defaulting to storage in iRODS when files are added to the Gallery). Previewing/inline viewing of certain data types from iRODS should be possible, for example for image files (mirroring the current functionality for files stored in the RSpace Gallery).

Finally, in order to allow retrieval of data from iRODS (outside of RSpace), RSpace should be able to update a specific metadata field in the iRODS catalogue when files are linked to experiment entries in RSpace, so that querying the metadata catalogue for a particular document identifier could retrieve all data associated with that document (similar to seeing the linked documents in RSpace itself).

iRODS S3 API v0.2.0 with Multipart [slides] [video]
Justin James - iRODS Consortium

The new iRODS S3 API can now present iRODS via the S3 protocol. This talk will share details about the first two releases, the implementation of various endpoints, and the state of Multipart transfers.

Streamlining iRODS: Kafka-based Data Pipelines [slides] [video]
Peter Verraedt - KU Leuven
Jo Wijnant - KU Leuven

We sketch how changes in the iRODS catalog can be captured in realtime into Apache Kafka, and how we materialize those master-records as a readonly 'view' in OpenSearch using Apache Flink stream processor. We show that the same tools can be used to continuously monitor project usage within iRODS. As a last application, we sketch how audit logs can be collected and processed, and show future challenges.

DAViDD: Initial data management solution for UNC's READDI AViDD Center [slides] [video]
Terrell Russell - iRODS Consortium

The Rapidly Emerging Antiviral Drug Discovery Initiative AViDD Center (READDI-AC) is an NIH-funded public-private partnership focused on developing effective antiviral drugs to combat emerging viruses. The READDI-AC is one of nine Antiviral Drug Discovery (AViDD) Centers funded by the National Institute of Allergy and Infectious Disease (NIAID) at the National Institutes of Health.

The data management preparation, storage, and analysis are handled with DAViDD, a combination of iRODS, the HTTP API, and a custom Angular application. This talk will discuss the development of laboratory and researcher use cases; designing, building, and staging the application; and its early production use.

The LEXIS Platform V2 and using iRODS for large scale data management [slides] [video]
Martin Golasowski - IT4Innovations / VSB-TU Ostrava
Mohamad Hayek - Leibniz Supercomputing Centre

The LEXIS Platform allows easy access to HPC and Cloud resources with complex workflow orchestration and distributed data management based on iRODS. In our talk we present the improved version of the LEXIS Platform, including centralised metadata index, direct staging from iRODS zones and application container support. Extension of the LEXIS data management features are planned within the Horizon Europe EXA4MIND project, which focuses on handling large amounts of data between various data sources like databases or object storage and HPC infrastructure, including data publication features following FAIR principles. We also describe various use-cases for iRODS in the HPC environment, including using iRODS as data transfer gateway and preservation service based on EUDAT B2SAFE and testing of new iRODS interfaces such as S3 and HTTP API.

A Rust Library Crate for iRODS [slides] [video]
Phillip Davis - Appalachian State University
Best Student Technology Award Winner

Rust has a number of features that make it desirable for an iRODS client. For example, its strong type system and the borrow checker provide compile-time guarantees against unexpected runtime errors. I present an iRODS client library written in Rust. Connections are presently coupled with their connection pool, implemented using the deadpool library, but could be used independently with slight additions. However, deadpool allows connections to make limited use of async. Connection pools are generic over iRODS protocol encoding (currently only XML is supported), transport protocol, and authentication strategy. Message serialization and deserialization will only allocate heap memory if users explicitly opt in by using message types which own their data. Finally, TLS/SSL is delegated to operating system implementations using the native-tls library.

Integration of iRODS in a Federated IT Service through HTTP and Python API [slides] [video]
Gautier Debaecker - CC-IN2P3
Mathia Pagani - CC-IN2P3

The Federated IT Service (FITS) project, a collaborative endeavor between the IN2P3 (Institut national de physique nucléaire et de physique des particules) computing center and French national HPC Center named IDRIS (Institut du développement et des ressources en informatique scientifique), addresses the challenge of managing the escalating data volumes generated by research infrastructures. The project aims to consolidate computing and storage resources while maintaining control over hosting expenses and minimizing the ecological footprint of digital technologies.

Within the FITS project, iRODS was selected as the storage pooling solution, leveraging its established use within the IN2P3 Computing Centre. This implementation enables project users to seamlessly access their data without being aware of its physical location. The project outlines three primary access interfaces:

1. A 'simple' utilization through icommands for streamlined data sending and processing via the command line interface.

2. A python API-based graphical interface for local data management and processing.

3. A graphical/web interface utilizing HTTP API accessible through a user portal, facilitating user authentication through identity managers such as Indigo IAM or Keycloak. Initial functionalities will include data viewing and collection creation, with plans for expanded features such as group creation, iRODS rule configurations, and more in subsequent phases. While initial constraints of the HTTP API limit direct browser-based data transfers, the system will enable essential data interactions, marking the beginning of an evolving suite of functionalities for enhanced user accessibility and data management within the FITS project framework.

Update: The Intersection Between Policy-Based Data Management and Emerging Health Science Data Standards [slides] [video]
Deep Patel - NIH / NIEHS
Mike Conway - NIH / NIEHS

iRODS is a domain-agnostic platform. The flexibility of iRODS flows from the fact that policy-based data management is at its core. By encoding metadata standards and management policies appropriate to a research domain into an iRODS data grid, an organization can create a powerful tool for data-driven research.

The Health Science domain is increasingly characterized by the co-location of data and compute, often 'bringing the compute to the data' and increasingly in a hybrid environment that spans cloud providers and bridges premises and cloud computing. This overview will look at GA4GH standards that intersect with the policy-based data management capabilities of iRODS for compute-to-data, access control, and federated search.

iRODS Metadata Templates Working Group: Building Blocks and Lessons Learned [slides] [video]
Terrell Russell - iRODS Consortium

With starts and stops over more than six years, the working group's efforts have spanned a number of ideas and projects. This talk will cover that history, the group's decision making, and some reusable code.

Lightning Talk - Helping users keep filesystems clean [slides] [video]
Emyr James - Centre for Genomic Regulation (CRG)

Lightning Talk - iRODS Dataverse Integration [slides] [video]
Danai Kafetzaki - KU Leuven

Lightning Talk - Bridging iRODS and supercomputing for RDM-driven data-to-compute [slides] [video]
Mher Kazandjian - SURF

Lightning Talk - iRODS: A Financial Perspective [video]
Jan de Graaf - Netherlands Cancer Institute (NKI)

Lightning Talk - iRODS Testing K8s Demo [video]
Phil Owen - Renaissance Computing Institute (RENCI)

Lightning Talk - Metadata and Data Landscape Report - Demo [video]
David Cerf - GRAU DATA

Lightning Talk - iBridges Update and Demo [slides] [video]
Raoul Schram - Utrecht University