Utrecht University
Utrecht, Netherlands
June 13 - June 15, 2017
Conference Videos hosted on LectureNet (Utrecht University)
Conference Videos hosted on YouTube
iRODS UGM 2017 Proceedings (PDF)
Beginner Training
This hands-on workshop demonstrated how to plan and deploy an iRODS 4.2 installation. We explored storage resource composition, metadata operations, and rule development using graphical and command line interfaces.
Advanced Training
The iRODS development team focused on advanced data management topics, design patterns, and system administration and reporting.
For a long time radio astronomy, in particular when using interferometric instruments, has been one of the most data-intensive research areas pushing the boundaries of data and compute intensive IT infrastructure. The LOFAR telescope, operated by ASTRON, is generating 7 PB of science data each year. Its archive is built on EGI infrastructure and many users are challenged by the complicated technology this introduces. For the Apertif telescope, a major upgrade of the WSRT telescope also operated by ASTRON, we have looked into alternative solutions to improve the usability and user-experience of the data management layer which lead us to the iRODS system. The Apertif Long Term Archive (ALTA) is a distributed system responsible for the distribution and long-term storage of both raw and science-ready data-products. The data-products are generated by the Apertif instrument as well as by science teams that will be processing the data into data cubes, source catalogs, and other derived products. One site will host a petabyte scale online storage resource to support the observation and processing workflows that are running on connected systems. Another site will provide long-term storage of all data and provide an alternative distribution point for archived data. By the end of the planned survey, in five years time, ALTA will be hosting a data volume of 20 PB and tens of millions of data-products. Apertif, as LOFAR, is a demonstrator facility for the next generation radio telescope SKA (Square Kilometre Array) which will bring radio astronomical data handling into the Exa-scale regime. It's technologies and operational models will be evaluated for applicability to the SKA.
Life science depends more and more on the collection and analysis of comprehensive datasets. Concomitantly, most life science research is being performed in small short term project groups where the same individuals are responsible for both data collection and data management. The widespread call for Open Science has made it increasingly important to disclose data together with a scientific publication. However, the acceptance and usage of digital repositories is still relatively low within life science. This has been attributed to the lack of reward and incentives for depositing your data (Kidwell et al., 2016). It has also been speculated that the ability to exploit one's own data to the fullest is a prerequisite to sharing them (Borgman, 2012). Therefore Maastricht UMC+ DataHub aims at providing early help and services in research data management to facilitate the process towards the goals set by Open Science.
In this work we present our design choices and lessons learned for building a central institutional research data management infrastructure across Maastricht UMC+ and Maastricht University, the Netherlands. The services provided by our infrastructure are project data organization and findability, metadata modelling based on FAIR principles, secure data storage and the ability to share data easily with external collaborators. All services have been designed to work with federated authentication and with the modern day data volumes of life science in mind. iRODS plays a key role in this infrastructure by being the central namespace for data storage and management. The iRODS rule engine is the central hub connecting all third party applications, such as source systems, ETL-tooling and search index.
Providing early help in research data management using a central infrastructure has the danger of not being able to adapt and connect to the local practice of researchers. Therefore, one of the project's goals is to evaluate the infrastructure's ability in tempting researchers to start research data management. In order to do so, the major design goal was to be easily extendable and use a microservice approach in building services. The first version of our infrastructure is currently being evaluated in several research groups and we aim to continuously improve it based on the user feedback.
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover, the (meta)data does not meet the FAIR principles. Consequently, a central FAIR-compliant repository is required to support genomics research. We have selected iRODS (Rule-Oriented Data management systems) to implement a sequencing data repository for our organization because it allows to store both data and metadata together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized way, as well as it supports different types of clients. This repository will be part of an ecosystem of RDM solutions that cover complementary phases of the research data life cycle in our organization. We also selected Virtuoso to enrich the metadata from iRODS. Virtuoso is a hybrid database engine for the management of relational databases and also of a triplestore for linked data. The metadata in iCAT and the ontology in Virtuoso are kept in synchronized with the enforcement of strict data manipulation policies. We have implemented prototype for a first use case to preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes: Davrods for data and metadata ingestion, data retrieval; Metalnx-web for administration, data curation, and repository browsing; and iCommands for advanced users in all tasks. Different user profiles are defined (principal investigator, data curator, repository administrator), with diverse access rights. New data is ingested by copying raw sequence files and the corresponding metadata sheet (sample sheet) to the “landing” collection on iRODS. An iRODS rule is triggered by the sample sheet file, which extracts the metadata and registers them on iCAT as Attribute, Value and Unit. Ontology files are registered into Virtuoso. The files are automatically renamed to an unique name based on metadata and copied to the “persistent” collection. All the steps are recorded into a report file, that enables monitoring and tracking progress and faults. In this presentation we describe the design and implementation of the prototype, and discuss the initial assessment results based on the use case. Initial results indicate that the proposed solution is acceptable and fits well the researchers workflow.
The European project EUDAT built a data e-infrastructure, called Collaborative Data Infrastructure (CDI), connecting 16 data and computing centres to support over 50 research communities spanning across many different scientific disciplines. One of the main challenges to implement such infrastructure was to enable the users to manage their data in the same way across the different data centres because each centres has its own peculiarities at hardware, software and policy level. Therefore, EUDAT adopted iRODS to deal with this heterogeneity relying on its features:
- To define a common abstraction layer on top of the difference storage systems.
- To provide a shared set of software interfaces and clients to perform data management operations.
- To enforce a common set of policies.
- To federate different administrative regions.
On the other hand, each community has its own characteristics and often it requires specific customizations to cope with its data life cycle. Hence, beyond this common horizontal layer, through iRODS, EUDAT can offer the flexibility of a vertical integration with the community's tools and policies. In order to implement those policies and functions, the project extended iRODS with a set of rules and scripts, which form, together with the underlying software stack, the B2SAFE service. It allows the replication of data collections across different iRODS zones, takes care to assign a unique identifier to each data object and collection, to log every failed transfers and to store a minimal set of metadata together with the data themselves. The unique identifiers are stored in a centralized registry, called B2HANDLE, which makes them globally resolvable and persistent. We will introduce the B2SAFE architecture and highlight the integration between iRODS and the B2HANDLE system and the corresponding workflows.
Research Data Management (RDM) aims to improve the efficiency and transparency in the scientific process and aims to fulfill the requirements of the funding agencies and (local) regulations. Failures in reproducing some key empirical phenomena have resulted in the research process being questioned. As a consequence, the development of RDM is becoming urgent for many research institutions. RDM in the neuroscience and neuroimaging domains is confronted with challenges in managing diverse types of data, which furthermore contain sensitive information. At the Donders Institute (DI), we have developed an iRODS-based research data repository, an essential component for realising a RDM workflow that spans the whole research lifecycle. The objectives of this workflow are: (1) long-term data preservation for internal reuse, (2) documenting the analysis pipeline, allowing for reuse, collaboration, and reproduction of the published results, and (3) easy sharing of data and analysis pipelines with colleagues around the world.
The Euro-Mediterranean Center on Climate Change (CMCC) is a non-profit research institution whose mission is to investigate and model the climate system and its interactions with society and the environment, and to develop science-driven adaptation and mitigation policies in a changing climate. In order to support CMCC activities, such as data archiving and retrieving, operational services, collaborative environments, scientific portal gateways, a user-friendly and easy-to-use environment for data-centric application provisioning, named CLIMA, has been developing. CLIMA provides flexible and high available iRODS-based data services through the use of cloud computing (OpenNebula, Amazon AWS) and container management platforms (Rancher). Each data service consists of the iRODS data management platform extended and integrated with other suitable components (such as THREDDS Data Servers, NetCDF analysis tools, Solr indexing etc), according to the user requirements and needs. As an example of CLIMA data service, the CMCC data portal will be presented and discussed addressing integration and performance issues.
The integrated Rule Oriented Data System (iRODS) is an open source data management software used by research organizations worldwide. iRODS virtualizes data storage resources and has a plug-in architecture that supports microservices, storage systems, authentication, networking, databases, rule engines, and an extensible API. Here, we demonstrate two use cases of iRODS as a data management platform to impact science: BRAIN-I and SC2I.
BRAIN-I is an archive of three-dimensional images of the brain. Understanding the three-dimensional structure of the brain at a high resolution is critical to advance our understanding of the nervous system. Recent innovations in sample preparation and imaging methods have made it possible to capture images of the intact brain at a subcellular level, and transforms research in neuropsychiatric illnesses such as Alzheimer's disease and Autism. However, at this time, we lack the infrastructure for neuroscientists to work easily with these high resolution, three-dimensional images. The images are very large in size, analysis methods to study the images are computationally intensive, and specialized methods are needed for viewing the images. Furthermore, scientists need to work collaboratively to advance brain research by sharing images, analysis methods, and research results. BRAIN-I will create the infrastructure to enable neuroscientists to share, discover, analyze and visualize three-dimensional brain images by utilizing iRODS for knowledge management and data curation.
iRODS rules and microservices provide a number of critical functions for BRAIN-I, such as owner-specified access control for individual data sets, automatic migration of data between computer systems, automatic backup and replication of data sets, searchable metadata to locate and discover data sets, and creation of workflows for data analysis. The BRAIN-I data ingestion process will be designed to provide the scientist who collects the data complete flexibility in controlling when the data is moved to BRAIN-I, who has access to the data while the study is underway, and when the data is published and assigned a Data Object Identifier (DOI) and made available to other researchers. Once the data is published, BRAIN-I will define two levels of metadata that can be used for data discovery – system metadata and science metadata. System metadata refers to system information about the data files and other objects that are maintained in BRAIN-I; for example, ownership of a data file, or the date it was last accessed. Additionally, BRAIN-I will provide a means for federation with other BRAIN archives by using the APIs exposed by the different archives to create a unified data grid across the archives. Finally, BRAIN-I will provide for remote replication and backup of all data in the archive, as data replication, archiving, and backup is one of the most widely prevalent use case for iRODS.
A second use case for iRODS is The Surgical Critical Care Initiative (SC2i), which develops clinical decision support tools to improve medical outcomes for critically-ill patients. As a joint effort among the Henry M. Jackson Foundation, Walter Reed Medical Center, Uniformed Services University, Duke, and Emory, SC2i brings together cutting-edge science and evidence-based clinical data. Supporting this research and development agenda are state of the art data management technologies that comprise SC2i's technology platform.
The core of the SC2i platform is the Central Data Repository (CDR), which provides a hub for SC2i's data collection and analysis capabilities. CDR consists of a number of integrated technology components, with iRODS as a key piece of the puzzle. For CDR, iRODS provides a secure point of ingress for research data from SC2i's institutional partners, a set of configurable rules and policies for transforming this incoming data, and a staging area for the data's ingest into the CDR's data model.
Due to the sensitive nature of the data used by SC2i, security and privacy are important requirements for the CDR platform. iRODS's configurable access control, customizable rules and policies, and secure user management features fulfill these requirements. Additionally, the integration features of iRODS enable secure integration with other technologies used in the platform such as LDAP, SAS, and relational databases.
We describe the current use of iRODS at University College London as a live data store for researchers across all disciplines. We explain the interfaces presented, our approach to integration with the university's identity management system and the challenges of cache management for a Compound Resource.
Modern system applications often need to interact with data from multiple, heterogeneous metadata stores. An ad hoc solution for metadata aggregation from multiple metadata stores by issuing an individual query in the query language of each database, runs the risk of semantic incompatibilities. This paper describes QueryArrow, a generic software that provides a semantically unified query and update interface to a wide range of metadata stores. QueryArrow has an algebra-based language called QueryArrow Language (QAL), which can be partially translated to different database languages. We described design of QueryArrow, the syntax and semantics of QAL, how QAL is translated to different database languages, and demonstrated its applications.
In September 2015, The Wellcome Trust Sanger Institute had an iRODS deployment consisting of six federated zones managing 12PB of Genomic data, running iRODS 3.3.1. We took the decision to upgrade to what was then 4.1.7. This talk will cover how we did it, how we prepared, time taken, what went wrong, what we've learned, how long we thought it would take, the fact that 4.1.8, 4.1.9, 4.1.10 are kind of our fault, and that paying our Consortium membership was well worth it.
Background
The Digital Farming department at Bayer Crop Science generates massive amounts of field analytics data. These large data sets consists of diverse content types like text, CSV, JSON, XML, Shape files and images spread across geographies – Germany, France, Brazil etc. Major drawback when dealing with these files is their lack of associated meta-data useful for information retrieval, exploratory analysis and generally analytical tasks.
iRODS is a distributed big data management system, that serves as a powerful open source solution for storing and retrieval of structured and unstructured data and associate meta-data to it. The major challenge for any data analyst or scientist is to expose this data store to analytics platforms like R for finding insights, performing various statistical analysis and data downstream processing operations. However, exposing a data store to R environment is a multistep process and there is a huge compromise on preserving the security and integrity of the exposed data.
Results
Our approach addresses this issue by implementing a platform independent R-iRODS package that helps to access, modify, perform stateful navigation, CRUD operations on iRODS collections, data objects and meta-data, search collections and data objects based on meta-data from R language. The R based commands function similar to iRODS icommands integrated by IRODS REST services. The R-iRODS package has been engineered to have semantics equivalent to the icommands with business defined access rights preserving iRODS ACLs and can easily be used as a basis for further customization.
iMetaExploreR is a Shiny based rich web interface that supports (a) iRODS file system interaction (b) File system- and meta-data-based search integrated with text mining word cloud (c) Recently modified and frequently accessed meta-data, and (d) spatial meta-data views. The prospects for more advanced future developments to be discussed are: a fully extensible support for data and visual analytics, powerful regex based search and meta-data associations.
I work as a volunteer supporting OLPC deployments in Africa and Asia. These deployments use the XO laptop as the client and in some cases also have a local server to supplement the limited storage of the XO (XO-1 1GB).
Sugar keeps all user work in a Journal which consists of a file and associated metadata. Sadly, this work can be lost if the laptop runs out of storage.
The goal of this project is to maintain an archive of the Journal on the local server with adequate storage (1TB). iRODS with its ability to store metadata simplifies setting up this archive.
One tricky aspect of OLPC deployments is that the user may only have access to the server while at school. This means that current work must be copied to the laptop.
The implementation allows the user to click a radio button to show an object to be archived (and not kept locally) or to request a local copy of a file needed for current work.
Integrative research requires extensive multi-level approaches to enrich and expose data and workflows so that informatics infrastructures can process them effectively. The Grassroots Infrastructure is developed at the Earlham Institute (EI) to consolidate data and analyses, facilitating consistent approaches to generating, processing, and disseminating public datasets in the plant sciences across a distributed set of interoperable servers.
The Grassroots lightweight reusable software stack comprises: an iRODS data management layer to provide structure to unstructured filesystems; an Apache web server layer to deliver content and provide access to public programmatic interfaces; analytical services such as BLAST to search on multiple databases across different sites; user-facing libraries to build larger web services, e.g. GeoJSON mapping tool to show spatial data. The stack can be run locally or packaged in virtual containers and deployed on a variety of hardware thus representing a decentralised system, allowing information generators to retain control over their resources but allowing interconnected resources to access each other consistently. Separate web servers running the Grassroots Infrastructure can be connected together so that their services and data can be shared seamlessly and appear as a single unified instance from a user's perspective.
The iRODS data management layer is an integral component of the Grassroots Infrastructure. Libraries are included within Grassroots for performing common file management tasks such as adding, editing, and deleting data objects as well as accessing metadata within an iRODS instance. This allows any Grassroots Infrastructure service to use iRODS instances transparently, accessing data objects through lightweight web-friendly interfaces to share or use in downstream analyses. We have included an iRODS metadata search service using both the iCAT metadata querying API and an ElasticSearch integration, to search against all of the metadata AVUs and present them in a user-friendly way.
We provide an Apache module based on mod_davrods(1) to expose iRODS data as WebDAV, extended(2) to allow metadata display, modularised searching and skinnable listings. This module allows the management and dissemination of files that are the results of scientific experiments, automatically adding them to iRODS along with a set of auto-generated metadata AVUs from the experiments along with extra values to aid institutional data management, such as accession searching and downloading of datasets thus avoiding duplication and using pre-existing curation metadata.
Grassroots is open source and available on GitHub. More information can be found at https://wheatis.tgac.ac.uk
[1] https://github.com/UtrechtUniversity/davrods
[2] https://github.com/billyfish/davrods
Research Data Management is a hot topic at all Dutch research institutes, universities and university medical centers, and are struggling with it at the moment. Whether it is to promote scientific integrity or to ensure data is available for future research, RDM is a fact of life for funders and researchers. SURF, the collaborative ICT organization for Dutch education and research, has taken the initiative to combine the efforts in the area of RDM together with our member organizations.The goal is to create a sustainable Dutch Research Data Infrastructure making use of our combined knowledge and by connecting local and national facilities to each other.
Many member organizations are considering iRODS as underlying technology and three already have chosen to develop their RDM solution based on iRODS. In addition, a large astronomic research project is investigating the use of iRODS to manage a dataset that will grow with 4 PB a year. This presentation will provide an insight in RDM development in the Netherlands and the use of iRODS, will outline the challenges that have to be overcome, from the political to the technical and will give an insight in the solution roadmap we foresee.
Outline
- SURF structure & governance
- Status RDM in the Netherlands
- Dutch Research Data Infrastructure
- High level developments iRODS in NL
- Challenges
- Future outlook
Traditionally, fault tolerant iRODS data grids have been constructed using dedicated physical or virtual servers for each service component, with data and service availability in the event of failure often maintained through replicas and active-passive service configurations. When replicating and distributing the iCAT, its backing relational database, iRES resource servers, and corresponding managed data it has been difficult to balance efficient use of storage media and server hardware while simultaneously achieving high availability and performance. One of iRODS key strengths is its ability to flexibly fit into nearly any storage architecture, but that strength has become a weakness for organizations new to iRODS who are seeking a reference design that provides performance and high-availability within a minimal initial hardware footprint. This talk will present a new reference system level architecture that distributes and parallelizes the iRODS resource server workload across a system backed by a high-performance clustered filesystem. This architecture is able to maintain a coherent view of single replica data across iRODS resources with as little as a single iRODS resource server instance available. Furthermore, failover and failback of resource servers, iCAT servers, and the backing relational database has been automated. An example real-world deployment will be presented where the distributed parallel resource server and highly-available iCAT service configuration have been virtualized and deployed onto a single active-active storage controller pair along with an embedded clustered filesystem.
iRODS is a powerful core component of data management systems. By harnessing the ideas of data cataloging, workflow automation, and abstraction, iRODS provides developers a comprehensive set of features that act as a framework for intricate, automated, data storage systems. However, the same attributes that give iRODS it's power and flexibility, also add to its complexity and steep learning curve. In today's constantly evolving, rapid-pace software landscape, developers expect a certain level of accessibility and low barrier of entry when learning new technologies. Typically, new developers would not associate these attributes with iRODS or C++. This is because a paradigm shift has taken place in how computing, new ideas, methodologies, and architectures are being developed and shared. Golang and its surrounding community is one of the modern programming languages at the epicenter of this advancement. Usage of Golang in production deployments has been growing at an unprecedented pace, because the language fills the niche of being a simplistic, fast, scalable, and productive language to write code in. Golang's simple, opinionated, and idiomatic design allows common patterns and principles to be shared between software of distinct domains. By combining the benefits of Golang and its developer community with the extendable power of iRODS, developers may find themselves happier and more productive. This talk will introduce the audience to several efforts completed and underway that harness the power of iRODS with Golang, and will conclude with a demonstration of how easy it is to write efficient code when building iRODS based data systems using Golang.
The Swedish National Infrastructure for Computing (SNIC) has decided to invest in a large scale distributed iRODS-based storage infrastructure for complementing its national storage offering for academic research. Currently the SNIC Swestore national storage service is relying on dCache as its storage solution while it has grown into a petascale operation. However, many users and research groups have expressed interest in different access methods and functionalities (such as the use of multiple different user authentication methods and metadata management) than what can be easily accommodated with dCache. This prompted the investigation of extending the national storage service with the use of different storage technologies. A project was commenced by SNIC for an iRODS-based distributed scalable storage service complementing Swestore. The SNIC iRODS project has now been concluded and the resulting system is being installed into a production environment and integrations are being set in place with other SNIC services. Our accomplishments including but not limited to: a model for the deployment of a geo-replicated iRODS iCAT over two administrative domains with a DNS-based failover mechanism; a model for the deployment of existing and future distributed storage resources within SNIC with iRODS; a novel iRODS interface for tape resources written against the IBM Spectrum Protect (TSM) API; improvement for iRODS logging capabilities with a syslog forwarder; contributions for iRODS user authentication via an alternative PAM authenticator; automated provisioning of iRODS grids and associated services with Ansible, optionally provisioning clusters of VM's with Vagrant for testing; integration of the SNIC iRODS storage service with the SNIC User and Project Repository (SUPR) for provisioning of iRODS users and groups for approved proposals. This solution enables the easy integration of local HPC storage solutions as well as EUDAT, which delivers data to the HPC, HTC and cloud services in Europe.
The Dutch Universities and associated Medical Centers are developing research data management environments built on iRODS to support their scientists. The underlying storage is currently primarily located at the premises and under the control of said institutes. However, some local storage systems offer too little capacity. Moreover, there is a need for a variety of storage systems to offer efficient and cost effective data storage solutions that may differ per use case. Because requirements towards the storage backend between single research institutes overlap, a national approach can add significant value.
We present a use case study how such a scenario can be supported using iRODS. In our use case scenario SURFsara, the national HPC and data centre, provides storage resources connecting local data to European infrastructures such as EUDAT, EGI and PRACE. We highlight the infrastructural aspects and which data policies can be supported. The scenarios are substantiated with performance tests executed with the underlying transfer protocol to the different storage systems.
The iRODS middleware provides federation, virtualization, metadata integration and policy-oriented data management for static files. Real-time data streams (RTDS) from sensors and other sources pose a different challenge compared to static files. They are infinite in length and time line, consist of discrete packets of information which are time-specific, and the concept of byte-oriented i/o is not at all suited for accessing and managing real-time data streams. Moreover, because of the nature of the sources, the volume and velocity of the flow can be very low (temperature data) to very high (HDTV). They are also time-sensitive in two ways. One has to capture the data right away or it will be lost forever; and in many cases, the analysis has to be done immediately as the decision making can be time sensitive (eg. earthquake detection). We have developed new features in the iRODS system to capture, store and archive RTDS. Our model captures RTDS into a continuum of discrete files and archives them in a few different standardized formats. It also provides packet-based and time-oriented access for replay of sensor data. By folding in the management of RTDS into iRODS we have extended the four main functionalities - federation, virtualization, metadata integration and policy-oriented data management – for real-time data.
Users seek real-time responses to expressed needs. This can be a challenge when it involves privileged operations typically executed by designated staff for reasons of process quality or security. Utrecht University's implementation of iRODS aims to put users in control of their data.
We present a solution for empowering users based on new microservices that we have named "Sudo" microservices. Combined with application of the iRODS programmable policies they facilitate fine-grained delegation of authority.
IBM Cloud Object Storage (COS) is a software platform designed to provide massive scale for unstructured, object-based data storage. IBM COS decentralized, shared-nothing architecture is an ideal complement to iRODS data management capabilities. Using IBM Cloud Object Storage as a storage resource in the cloud or on client's data centers, iRODS users have an information lifecycle management (ILM) solution with outstanding scalability, performance, security, availability, and cost benefits.
iRODS stores its catalog in a relational database management system (RDBMS) for a variety of historical reasons. Chief among these was that there were few other options at the time that could provide the functionality of a mature RDBMS, functionality that was required to provide the strong assurances of policy enforcement across multiple iRODS servers and locations. This meant that the iRODS catalog was hosted in a single location, usually on site or very near to most of its userbase.
The increasing demand for geo-distributed data sharing infrastructure over the Wide Area Network (WAN) demands novel scalable approaches to amortize the network latency observed between the user and the catalog server in such environments. One such approach is to distribute the iRODS catalog through the use of a clustered database technology thus providing local authentication and improving metadata read performance, while still offering locality of reference for data at rest.
This talk describes the use of MariaDB Galera Cluster to provide a multi-master, distributed iRODS catalog. It will cover the code changes required, the test setup and results, as well as future work and implications.
Only iRODS provides the mechanisms needed to meet ISO 16363 requirements for versioning, logging of actions, and auditing of operations while managing description, provenance, and representation information.