William and Ida Friday Center
Chapel Hill, North Carolina, USA
June 9 - June 11, 2015
A recap of the past year and plans for the next year, including the announcement of several new initiatives. Consortium membership has more than doubled over the past year, and we have several ongoing support and services contracts that we will discuss in general terms. We will evaluate our year-over-year progress with respect to community participation and describe our plans to broaden the user community. And we will announce two new programs that will make it easier to discover iRODS code examples (e.g., rules and microservices) and iRODS service providers.
This talk will cover the changes and features from 4.0.2 up to the latest iRODS release, 4.1.0.
This talk and demonstration will cover the iRODS Consortium's ongoing efforts to provide confidence in the hardened capabilities of iRODS. The Consortium has developed new technology for automated topological testing of iRODS in a distributed deployment. The Consortium will demonstrate this new open source cloud infrastructure and testing framework running at RENCI.
EMC MetaLnx is a graphical, user intuitive tool and enables easy administration for IRODS users, groups, collections, and resources. The same interface supports iRODS users with tools for automated genome metadata extraction, metadata templates to assist standardizing metadata annotation, metadata editing and search tools to assist genomic research. While the design was targeted with genomics research in mind the tool is completely applicable for general iRODS use. This talk provides details pf the design concepts behind MetaLnx and a demonstration of its operational capabilities.
Background: The bioinformatics department at Bayer HealthCare generates massive amounts of genomic sequencing data. There has been an emerging need to store, categorize and work with data of various levels in the sequencing pipeline. There is a need to preserve the data security, privacy policies and procedures across dynamic, heterogeneous, decentralized, distributed data systems as per organization policies.
Results: Our approach describes the design and use (from both system and user viewpoints) of iRODS as a data management system to enable efficient data storage, access and use to ensure compliance and the maintenance of data provenance with iRODS. Implemented user requirements are: a) ensure data security and integrity, b) replication across distributed systems, c) manage iRODS user groups with LDAP to iRODS synchronization and access control, d) enable biologists to define and set their metadata, with customized bulk upload tools, and e) query the metadata to find and track data. Details are given about the problems faced and the solutions found when iRODS was implemented, managing the data integration with the existing High Performance Compute cluster. The next step is to expose iRODS Virtual File System to Application Layer, preserving ACLs.
In this presentation we will report on a collaboration project with a major global pharma company with the goal to establish a comprehensive, scalable translational-research eco-system which accelerates patient profiling processes.
As part of the project we have integrated our enterprise software platform "Genedata Profiler" for patient and compound profiling with the pharma data management infrastructure which is based on iRODs. The integrated system allows the Pharma company to process, manage, and analyze, next generation sequencing (NGS) and related omics data from patient samples in a research setting while ensuring patient privacy and complying to highest quality standards.
We will detail our latest enhancement to Genedata Profiler including direct communication with iRODs, audit trailing, method management, and fine granular access control. In particular we will describe how we have established the mandatory chain of custody in high-performance computing environments and integrated metadata management throughout the entire data lifecycle.
In summary, we will show how our end-to-end solution combines a high-performance NGS raw data processing pipeline and unparalleled data visualization with advanced iRODS core functionalities.
We will present an IRODS implementation that is connected to the supercomputers of the University of Grenoble (the "CIMENT" HPC center) They are linked together by a local computing grid with the CiGri middleware. We provide then a local data grid environment.
Then, we will present a panel of applications of many scientific domains using this architecture, such as the Whisper project. Whisper is a data intensive application that involves a large number of files and a large volume of more than 200TB.
Bruno Bzeznik is the technical manager of the HPC center, the main developer of the CiGri middleware and also main IRODS system administrator of Université Joseph Fourier-Grenoble.
Briand Xavier is the computer engineer of the Whisper project.
Publicly funded research is increasingly coming with strings attached regarding the preservation and dissemination of associated data and papers. For example, the UK EPSRC has moved to require that institutions maintain research data online for 10 years from publication, allowing the relevant scrutiny and access by the public and other researchers. A scalable, highly automated system that integrates into the researcher’s workflow and fits within an enterprise storage environment is required. We discuss the practical implementations with reference to examples.
The Texas Advanced Computing Center operates a large iRODS installation on the Corral storage resource and is a collaborator on the Wrangler data-intensive compute system, the iPlant Collaborative and other projects utilizing iRODS as a component in data infrastructure. These resources serve a large and diverse community, including researchers in biomedicine and other fields for which the security and confidentiality of data represent a paramount concern.
This presentation will focus on our efforts to develop a comprehensively secure cyberinfrastructure including iRODS, addressing issues from the datacenter level through to iRODS auditing, to provide a perspective on the effort required and the areas of most concern when developing secure infrastructure.
The challenges of using iRODS to support a broad community with data privacy levels ranging from HIPAA to open access will be discussed, and techniques for data segregation and auditing will be presented, to address a range of potential use cases. We will also present on the policies and rules used to support data management generally and HIPAA specifically in distributed iRODS installations.
The iPlant Data Commons will be a layer of additional services resting on top of iPlant Collaborative's iRODS-based Data Store. The goals of Data Commons are to empower users with tools to facilitate the dissemination, discovery, and reproduction of high-value biological datasets and to allow users to move beyond data management to project management. It will use iRODS rules, resource hierarchies and other features to provide the following capabilities in order to achieve this goal. It will provide a project-oriented interface to the Data Store instead of data-oriented. While the research and analysis portions of a project are under way, Data Commons will provide non-intrusive shepherding through a flexible metadata collection process, semi-automated provenance tracking, and suggestions of related projects. When the results of a project are ready for publication, through a staging area, Data Commons will support the assignment of the persistent identifiers, such as DOIs, and the deposition to canonical repositories like the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI). Finally, once the results are published, Data Commons will increase the discoverability of the results through search and suggestion services. It will also aid in results reproduction by allowing others to regenerate the published data by applying the generating workflows to the same raw data or applying novel workflows to original raw data.
A brief status of iRODS usage at CC-IN2P3 at the moment, where we are heading with iRODS, what we would need, what is missing, and possible limitations in term of scalability.
YODA, developed by Utrecht University in the Netherlands, is an infrastructure based on iRODS to safely store, manage and provision sensitive research datasets. Initially it is deployed to support interdisciplinary and inter-university longitudinal research on the development of 3000 infants and 3000 adolescents storing millions of files. YODA will be redeployed to support more research areas and is shared with other universities through the U2CONNECT initiative (www.u2connect.eu).
Our presentation discusses some of the challenges we have addressed such as to ingest and enrich primary data, to provide intuitive and role based access to datasets, to facilitate researchers to manage dynamic groups of iRODS users, and to connect YODA with data warehouse facilities for dataset selection, with high performance computing environments and with EUDAT, the European common data infrastructure.
Existing components reused in YODA include amongst others PAM/RADIUS, IRODS-WEBDAV, PRODS/PHP, EUDAT-B2SAFE, EPIC-PID. The use of composable resources is planned.
This talk will cover the architectural changes and new features planned for iRODS 4.2 and later releases. This will cover changes supporting the development of the new API, new plugin interfaces, and other features in support of the 5.0 release.
This session will present prototypes and high fidelity representations of the new and improved iRODS web interface, which is currently in development. The session will explain the planned capabilities and interaction style, and will be a vehicle to solicit feedback from attendees throughout the conference that can inform the final design of the system.
We will discuss major use cases, and the status and roadmap of this important project.
This session will be a hands on tutorial and demonstration for Java developers to obtain, build, run unit tests on, and develop iRODS based applications using the Jargon API. We will demonstrate all of these through simple, clear, interactive scripting sessions.
The session will include a review of recent extensions to Jargon to power Virtual Collections, Metadata Templates, and other facilities that are powering DataNet Federation interfaces.
The development of a native cross-platform iRODS client application with rich graphical user interface is now being introduced. The client application is fully native and is built from a single C++ codebase on all of the platforms on which iRODS 4.0 is supported. The application thus has a very responsive UI with native look & feel on all of the supported platforms and enables drag & drop integration to the desktop. This is made possible by basing the development of the client application on top of the Qt 5 framework and an object-oriented C++ framework for iRODS which is being developed alongside with the client application. The object-oriented framework being developed here wraps around the native C/C++ client API library bundled with iRODS 4.0 and provides object-oriented interfaces to iRODS protocol operations. For example a fully object-oriented iRODS General Query (GenQuery) interface has been implemented in this C++ framework. Many features are again built on top of the GenQuery interface. The rich features of the Qt framework enable the rapid development of graphical user interfaces that harness the underlying framework. And last but not the least, by developing on top of the native C/C++ iRODS API library we can fully leverage the plugin architecture of iRODS 4.0 in authentication modules and network transport modules. This means that we can integrate with e.g. Kerberos authentication and SSL transports without any additional complexity.
In this talk, we are going to describe a new development in the next release of iRODS. The pluggable rule engine architecture allows us to easily create new rule engines as plugins and run rule engines side-by-side. The pluggable rule engine architecture allows easy implementation and maintenance of rule engine plugin code and offer significant performance gains in some use cases. The pluggable rule engine architecture enables modular incorporation of features from other programming languages, allows efficient auditing of interactions of user-defined rules and the iRODS system, and supports full interoperability between rules and libraries written in different languages. This design allows us to easily incorporate libraries designed for different programming language, for example, Python, C++, etc., into the policy sets, significant enhancing the capabilities of iRODS with out syntactic overhead. This new architecture enables a wide range of important applications including auditing, indexing, and modular distribution of policies. We demonstrate how to create the Python rule engine plugin and how to create user defined policy plugins.
Leveraging the advances in genomics sequencing, high definition non-invasive imaging techniques, biometric and other wearable devices, researchers and Clinicians today can get a more complete, scientific based, data set insight into the patient’s body. These new data sets allow clinicians to provide a more straightforward and successful treatment pathway typically labeled as personalized medicine. Compared to traditional screening techniques the data sets delivered from these instruments and modalities are massive. In this presentation we will compare and contrast how traditional Healthcare and Life Sciences datasets are managed, emerging rules based or orchestrated data management architectures and we will discuss how these Big Data sets coupled with predictive analytics techniques can yield very successful treatment outcomes. We will also discuss how modern software defined, converged storage architectures coupled with a high throughput file systems and an integrated predictive analytics engine can be deployed to solve most of these data management problems.
The evolution of the data center and data has been dramatic in the last few years with the advent of cloud computing and the massive increase of data due to the Internet of Everything. The Integrated Rule-Oriented Data System (iRODS) helps in this changing world with virtualizing data storage resources regardless the location where the data is stored. This paper explains and demonstrates a library that extends Qt filesystem management package to provide direct access to the iRODS data system from within the Qt framework. Qt is widely used for developing graphical user interface software applications that are display platform agnostic. This library will allow Qt developers to implement applications that access iRODS data system as a standard Qt tree like structure.
RECOVER (Rehabilitation Capability Convergence for Ecosystem Recovery) is a site-specific decision support system (DSS) that automatically brings together in a single analysis environment the information necessary for post-fire rehabilitation decision-making. RECOVER is the result of a close collaboration between NASA's Applied Sciences Program, Idaho State University’s GIS Research and Training Center, the Bureau of Land Management (BLM), and Idaho Department of Lands (IDL). Our initial target audience for RECOVER has been BLM's Emergency Stabilization and Rehabilitation (ESR) Teams. RECOVER uses the rapid resource allocation capabilities of cloud computing to automatically collect Earth observational data, derived decision products, and historic biophysical data so that when the fire is contained, wildfire management agencies will have at hand a complete and ready-to-use RECOVER dataset and GIS analysis environment that is customized for the target wildfire. RECOVER is transforming this information-intensive process by reducing from days to a matter of minutes the time required to assemble and deliver crucial wildfire-related data.
The RECOVER DSS is made up of a RECOVER Server and a RECOVER Client. The RECOVER Server is a specialized Integrated Rule-Oriented Data System (iRODS) data grid server deployed in the Amazon Elastic Compute Cloud (EC2). The RECOVER Client is a full-featured Adobe Flex Web Map GIS analysis environment. When provided a wildfire name and geospatial extent, the RECOVER Server aggregates site- specific data from pre-designated, geographically distributed data archives. It then does the necessary transformations and re-projections required for the data to be used by the RECOVER Client. It exposes the tailored collection of site-specific data to the RECOVER Client through web services residing on the Server. This automatic aggregation can take place in a matter of minutes.
Last year, we successfully demonstrated RECOVER’s feasibility and identified new areas where RECOVER could be useful. Over the next three years, we will to deploy RECOVER into operational use in the Western US and will expand participation to include USGS, the US Forest Service, and Burned Area Emergency Response (BAER) Teams. We will focus on enabling four key work processes: pre-fire, active-fire, and post- fire decision making and long-term recovery monitoring. We believe the RECOVER project is making a highly-leveraged, high-value contribution to our national wildfire efforts. Although difficult to quantify, we believe that RECOVER provides an even larger return on investment when viewed from a broader perspective that takes into account societal benefits, improved ecosystem services, and more effective natural resource management.
This presentation will cover software and frameworks being developed and integrated with iRODS to power the DataNet Federation Consortium. This includes an explanation of the new message based indexing framework, and work in progress on integrating the iPlant Discovery Environment with iRODS.
HydroShare is a collaborative environment under development for the sharing of hydrologic data and models. It supports the sharing of and collaboration around "resources" which are defined to include both data and models in a structured standardized format. It also supports web services and server/cloud based computation operating on resources for the execution of hydrologic models and analysis and visualization of hydrologic data. HydroShare uses iRODS as its network file system. Collaboration is enabled by casting datasets and models as "social objects". Social functions for datasets and models, include both private and public sharing, formation of collaborative groups, and value-added annotation of digital content. The HydroShare web interface and social media functions are being developed using the Django web application framework coupled to iRODS. HydroShare is being developed by a multi institution NSF collaborative project comprising RENCI, eight universities and the Consortium of Universities for the Advancement of Hydrologic Science, Inc, CUAHSI. This presentation will introduce HydroShare functionality developed to date and describe how iRODS supports this functionality.
The traditional iRODS mechanisms for file sharing, including user groups, often require some form of iRODS administrative privilege. In the HydroShare project for enabling hydrology research, we perceived a need for more flexible file sharing, including unprivileged creation and management of user groups, according to policies quite distinct from the Linux/Unix policies that initially motivated iRODS protections. This is enabled by a policy database in PostgreSQL, with management API written in Python, that runs in parallel to iCAT. Innovations in iRODS 4.1 allow us to implement access control based upon this PostgreSQL database rather than the default iCAT server, by interposing access control code before the access event using Policy Enforcement Points. The result is an access control mechanism that closely matches scientific needs for file sharing, and brings "dropbox-like" sharing semantics to the filesystem level.
EMC has created two iRODS resource drivers. One for the EMC Isilon and the second for the EMC ECS Appliance. The first driver unlocks the potential of EMC Isilon scale-out storage supporting HDFS access, improved performance, and load distribution amongst the nodes of a clustered storage system. This is an improvement over current NFS based access. The ECS driver supports direct connection to this geo-replicated, highly available object store. In this talk we demonstrate the use of these resource drivers and explain their advantages and usage details. The drivers were developed by EMC's Skolkovo R&D Center and EMC's ECS Solutions Engineering team.
Getting Started with iRODS 4.1
The iRODS Consortium is pleased to announce our first iRODS training session, Getting Started with iRODS 4.1, to be held at the 2015 iRODS User Group Meeting on June 9th. This training is a day-long, hands-on workshop where attendees will learn how to:
This training is primarily targeted toward those who are new to iRODS or who have limited experience with iRODS but want to learn more. However, users at all levels of iRODS experience are encouraged to attend.
What to Expect
Students will leave this training with an installation of iRODS, a foundational understanding of the overall technical structure and policy capabilities of iRODS, and the ability to execute a variety of core commands.
IMPORTANT: Preparation for the Training
Before you come to the training, you will need to install a VM and a machine image on the laptop you plan to bring to the training. Please follow the instructions so you are prepared when you arrive for training.
Course Content and Goals
The training will cover the following five modules:
Planning an iRODS Deployment
Attendees will design an iRODS deployment, including:
Attendees will install iRODS on their computers. This will entail gaining an understanding of the distinction between the relational database (iCAT) and the iRODS server software, and user roles and privileges in iRODS and the host operating system.
Attendees will gain an understanding of the concept of data virtualization and how it manifests in iRODS, including:
Students will gain an understanding of what metadata is, why it is useful, and how it is represented in iRODS. Students will be able to ingest a sample data set, create metadata for the data objects in that data set, and perform metadata-driven searches for data objects.
Students will gain an understanding of: