iRODS

iRODS User Group Meeting 2016

Kenan Center, 300 Kenan Center Drive
Chapel Hill, North Carolina, USA
June 7 - June 9, 2016

Photos and Videos

Conference Photos hosted on Flickr

Conference Videos hosted on YouTube

Articles

iRODS UGM 2016 Proceedings (PDF)

iRODS Audit (C++) Rule Engine Plugin and AMQP
Terrell Russell, Jason Coposky, Justin James - RENCI at UNC Chapel Hill
Speed Research Discovery with Comprehensive Storage and Data Management
HGST | PANASAS | iRODS
Integrating HUBzero and iRODS
Rajesh Kalyanam, Robert A. Campbell, Samuel P. Wilson, Pascal Meunier, Lan Zhao, Elizabett A. Hillery, Carol Song - Purdue University
An R Package to Access iRODS Directly
Radovan Chytracek, Bernhard Sonderegger, Richard Coté - Nestlé Institute of Health Sciences
Davrods, an Apache WebDAV Interface to iRODS
Ton Smeele, Chris Smeele - Utrecht University
NFS-RODS: A Tool for Accessing iRODS Repositories via the NFS Protocol
D. Oliveira, A. Lobo Jr., F. Silva, G. Callou, I. Sousa, V. Alves, P. Maciel - UFPE
Stephen Worth - EMC Corporation
Jason Coposky - iRODS Consortium
Academic Workflow for Research Repositories Using iRODS and Object Storage
Randall Splinter - DDN
Application of iRODS Metadata Management for Cancer Genome Analysis Workflow
Lech Nieroda, Martin Peifer, Viktor Achter, Janna Velder, Ulrich Lang - University of Cologne
Status and Prospects of Kanki: An Open Source Cross-Platform Native iRODS Client Application
Ilari Korhonen, Miika Nurminen - University of Jyväskylä

Presentations (June 8-9)

The iRODS Consortium in 2016 [slides] [video]
Jason Coposky, iRODS Consortium

iRODS 4.2 Overview [slides] [video]
Terrell Russell, iRODS Consortium

Auditing with the Pluggable Rule Engine [slides] [video]
Terrell Russell, iRODS Consortium

iRODS 4.2 has introduced the new rule engine plugin interface. This interface offers the possibility of rule engines which support iRODS rules written in various languages. This paper introduces an audit plugin that emits a single AMQP message for every policy enforcement point within the iRODS server. We illustrate both the breadth and depth of these messages as well as some introductory analytics. This plugin may prove useful from instrumentation of a production iRODS installation to helping debug a confusing emergent distributed rule engine behavior.

A Geo-Distributed Active Archive Tier [slides] [video]
Earle Philhower, III, Western Digital

iRODS makes it incredibly easy to preserve and share data generated by researchers, but as data volumes increase, the costs of maintaining all that data on primary storage becomes prohibitive. We present an advanced architecture that enables long term data retention and high availability in iRODS using a two-tiered design comprised of primary storage and a geographically distributed, object store-based, HGST Active Archive System for easy active archival. This model employs an existing high performance (expensive) primary storage system, coupled with an affordable, ultra-high capacity HGST Active Archive System back-end. Thanks to the iRODS abstraction layer and rules engine, this data tiering is automated and completely transparent to end users. We will discuss the solution architecture, provide a brief description of active archives in general and the HGST Active Archive System specifically, including its synchronous geographic replication capabilities, and present performance statistics for the available archived data.

Testing Object Storage Systems with iRODS at Bayer [slides] [video]
Othmar Weber, Bayer Business Systems

Advancing the Life Cycle of iRODS for Data [slides] [video]
David Sallak, Panasas

Having it Both Ways: Bringing Data to Computation & Computation to Data with iRODS [slides] [video]
Nirav Merchant, University of Arizona

Integrating HUBzero and iRODS [slides] [video]
Rajesh Kalyanam, Purdue University

Geospatial data is now increasingly used with tools in diverse fields such as agronomy, hydrology and sociology to gain a better understanding of scientific data. Funded by the NSF DIBBS program, the GABBS project seeks to create reusable building blocks aiding researchers in adding geospatial data processing, visualization and curation to their tools. GABBS leverages the HUBzero cyberinfrastructure platform and iRODS to build a web-based collaborative research platform with enhanced geospatial capabilities. HUBzero is unique in its availability of a rapid tool development kit that simplifies web-enabling existing tools. Its support for dataset DOI association enables citable tool results. In short, it provides a seamless path from data collection, to simulation and publication and can benefit from iRODS data management at each step. Scientific tools often require and generate metadata with their outputs. Given the structured nature of geospatial data, automatic metadata capture is vital in avoiding repetitive work. iRODS microservices enable this automation of data processing, metadata capture and indexing for searchability. They also allow for similar offline ingestion of external research data. The iRODS Fuse filesystem mounts directly onto the hub, enabling tools to refer to local file paths, simplifying development. We will discuss this work of integrating iRODS with HUBzero in the GABBS project and share our experience and lessons learned with the iRODS user community.

iRODS Data Integration with CloudyCluster Cloud-Based HPC [slides]
Boyd Wilson, Omnibond

This talk, with interactive Q&A, is presented in anticipation of integrating iRODS with CloudyCluster to add simplified data management to CloudyCluster's easy, self-service, on-demand, public, cloud-based HPC provisioning. An overview of CloudyCluster will be provided with goals of the pending integration. We will also seek feedback from the community to help direct the integration. The end goal is to provide advanced computational and data management resources to the long tail of science and those without easy access to computational resources.

This presentation will focus on our efforts to develop a comprehensively secure cyberinfrastructure including iRODS, addressing issues from the datacenter level through to iRODS auditing, to provide a perspective on the effort required and the areas of most concern when developing secure infrastructure.

The challenges of using iRODS to support a broad community with data privacy levels ranging from HIPAA to open access will be discussed, and techniques for data segregation and auditing will be presented, to address a range of potential use cases. We will also present on the policies and rules used to support data management generally and HIPAA specifically in distributed iRODS installations.

Getting R to talk to iRODS
Bernhard Sonderegger, Nestlé Institute of Health Sciences

The R language is an environment with a large and highly active user community in the field of data science. At NIHS we have developed the R-irods package which allows user-friendly access to irods data objects and metadata from the R language. Information is passed to the R functions as native R objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using standard R constructs.

To maximize performance and maintain a simple architecture, the implementation heavily relies on the icommands C++ code wrapped using Rcpp bindings.

The R-irods package has been engineered to have semantics equivalent to the icommands and can easily be used as a basis for further customization. At the NIHS we have created an ontology aware package on top of R-irods to ensure consistent metadata annotations and to facilitate query construction.

Davrods, an Apache WebDAV Interface to iRODS [slides] [video]
Chris Smeele & Ton Smeele, Utrecht University

Utrecht University has developed a WebDAV compliant interface to IRODS 4.1 to facilitate drag-n-drop moving data in and out of iRODS 4 using an operating system's native interface. The presentation highlights the solution's design principles and the resulting architecture. Davrods builds on Apache's mod_dav capabilities. We will share benchmark data that we have collected and conclude with a demonstration.

iRODS 4.3 [slides]
Terrell Russell, iRODS Consortium

Bidirectional Integration of Multiple Metadata Sources [slides] [PDF]
Hao Xu, DICE Group

We describe a generalize query language that allows us to integrate multiple types of data sources. The language provides both upquery and update, and customizable data policy. We demonstrate its use in iRODS, with applications such as graph based metadata, indexing, and metadata access. We also show that it can be proven that using both relational database and graph database provide the same behavior.

DFC architecture & An iRODS Client for Mobile Devices [slides]
Jonathan Crabtree, Odum Institute
Mike Conway, DICE Group
Matthew Krause, DICE Group

In a collaboration between CyVerse, DataNet Federation Consortium, and Odum Institute developers, the CyVerse Discovery Environment has been ported as a general infrastructure to support data-driven research. This is the first step towards a broader community effort to standardize and adopt these tools, extending iRODS as a full service data management and computation environment.

The session will include a review of recent extensions to Jargon to power Virtual Collections, Metadata Templates, and other facilities that are powering DataNet Federation interfaces.

NFS-RODS: A Tool for Accessing iRODS Repositories via the NFS Protocol [slides] [video]
Danilo Oliveira, UFPE

Data center and data evolution has been dramatic in the last few years with the advent of cloud computing and the massive increase of data due to the Internet of Everything. The Integrated Rule-Oriented Data System (iRODS) helps in this changing world with virtualizing data storage resources regardless the location where the data is stored.

This paper presents a tool implemented for accessing iRODS repositories through the NFS protocol. This tool integrates NFS to the iRODS server through common operating system commands on a remote iRODS repository via the NFS protocol.

MetaLnx: An Administrative and Metadata UI for iRODS [slides] [video]
Stephen Worth, EMC

Academic Workflow for Research Repositories [slides]
Randy Splinter, DDN

Traditionally, the sharing and retention of research data has been a contentious issue. Sharing data over WANs has been limited by the available storage technologies. NAS solutions while excellent for sharing data over a LAN have never had the same success over WANs. The successful implementation of object storage solutions has opened a door into the ability to share data over WAN links.

By coupling that ability to share objects over a WAN with middleware like iRODS provides the research community with the ability to provide more stringent controls over the data including:

Better control of ACLs including:
- Implementing data retention policies to meet regulatory regulations
- Loss of IP due to faulty loss
Virtualization of multiple storage silos under a single namespace
Extensive metadata tags and searching of those tags
Extensible rules engine to implement functionality such as:
- HSM style functionality between storage devices
- Data migration based upon set criterion

Some of the advantages of this approach include:

Ease of administration: Once rules are tested and in place the system can be managed with a minimum of administrative overhead
Automating workflows to guarantee consistency and reproducibility in the science that is produced
Ease of auditing for both usage and back charging and for maintaining adequate data security compliance
Using storage platforms like DDN WOS remote replication becomes simple and provides a straightforward way

Application of iRODS Metadata Manager for Cancer Genome Analysis Workflow [slides]
Lech Nieroda, University of Cologne

NGS is an increasingly cost efficient and reliable method to provide whole genomes or exomes in a relatively short time.

The massive amounts of resulting data pose challenges during various stages of its lifecycle: organizing and storing of input data, high throughput processing and analysis in an HPC Cluster and effective reviewing and secure sharing of the results.

Traditional file systems quickly meet their limits when content based metadata handling is required.

As a computing center, that has been driving NGS workflows for many years, we are constantly looking for solutions to optimize these workflows to maximize output and quality. We have decided to use IRODS, a comprehensive data management system that would allow customized metadata attributes, fine grained protection rules as well as a query system to quickly organize and review the results.

In this paper we describe our design and experiences with the integration of iRODS with an automated pipeline, which was developed within our participation in the BMBF funded project SMOOSE to optimize relevant workflows for cancer studies to clinical use. The workflow was taken from the department of Translational Genomics from the University of Cologne. The focus of the workflow lies in sequencing and analysis of cancer genomes with the goal of identifying novel and potentially clinically relevant alterations. The gained insights can lead to personalized therapy with higher efficacy and reduced toxicity.

Status and Prospects of Kanki: An Open Source Cross-Platform Native iRODS Client Application [slides] [video]
Ilari Korhonen, University of Jyväskylä

The current state of development of project Kanki is discussed and additionally some prospects for future development of Kanki are laid out. Kanki is an open source cross-platform native iRODS client application which was introduced to the iRODS community at the 7th Annual iRODS Users Group Meeting in 2015. Afterwards project Kanki was released open source with a 3-clause BSD license in September 2015. Since then 9 releases have been made, from which the latest 6 have been available in addition to the source code as pre-built binary packages for x86-64 CentOS Linux 6/7 and OS X 10.10+. The Kanki build environment at the University of Jyväskylä is running out of Jenkins continuous integration for both previously mentioned platforms and the Linux builds are currently being executed in disposable containers instantiated from pre-built Docker images. This provides an excellent framework for (regression) testing of the client suite. Currently the immediate goals of development to be discussed are: stability, testing, ease of install and use, a complete iRODS basic feature set for graphical icommands alternatives. The prospects for more advanced future developments to be discussed are: a fully extensible modular metadata editor with pluggable attribute editor widgets, a fully extensible modular search user interface with pluggable condition widgets, data grid analytics and visualization with VTK integration.

DAM Secure File System [slides] [video]
Paul Evans, Daystrom Technology Group

iRODS Feature Requests and Discussion [PDF]
Reagan Moore, DICE Group

Training (June 7)

8:00 – 9:00 AM: Registration & Breakfast
9:00 AM – 5:00 PM: Training
5:30 – 6:30 PM: Members Meet & Greet, Kenan Center Terrace
5:30 – 6:00 PM: Omnibond Birds of a Feather, Room 204

Beginner Training

This hands-on workshop taught how to plan and deploy an iRODS 4.2 installation and explored storage resource composition, metadata operations, and rule development using graphical and command line interfaces.

Introduction / Overview
Four Pillars
Planning an iRODS Deployment
Installation
LUNCH
iCommands and Cloud Browser
Virtualization
Basic Metadata
Basic Rule Engine

Advanced Training

In-depth experience with iRODS 4.2. The iRODS development team at RENCI guided students through advanced topics such as using multiple rule engine plugins, PAM authentication, federation, and configuration for high availability.
Prerequisites (4.1.9, installed via ftp on EC2 via slipofpaper).

Upgrades
PAM/SSL Configuration
LUNCH
Plugins
- Composable Resource Plugins
  - Object store compound example
  - Delayed replication to archive
  - Cache purging rule
  - Load balancing example
- Rule Engine Plugins
  - Native
  - Python
  - C++
- Microservice Plugins
- API Plugins
High Availability
Troubleshooting / Bottlenecks / Monitoring / Performance / Extras