This Technology Roadmap documents the state of the technical direction chosen for the iRODS data management software. It is the product of community feedback, our collective experience, and the guidance of the iRODS Consortium Technology Working Group.
If you have an opinion about where iRODS should go next, I encourage you to participate in this open software development process. Join the iRODS chat Google Group. Raise issues on GitHub. And if you are a Consortium Member, get involved in the Technology Working Group.
The iRODS Consortium
(Last Updated March 2017)
The focus for iRODS 4.3 will be on two things: the logging framework, and hardening the existing features and new work introduced in 4.2.
Alongside this work, but not part of the core codebase, the Consortium will be working on supporting features and functionality provided by additional external tools and features delivered as plugins.
The logging in iRODS needs to be modernized and standardized with regards to best practices around log rotation, timestamps, and human and machine readable formats. We plan to investigate a variety of logging frameworks and select one that delivers flexibility powerful enough to afford easy aggregation of logs from our distributed system.
The everpresent backlog of bugfixes and edge cases will continue to be addressed. This release will mark the first of a new tick-tock release pattern, meaning one release will focus on features, and the next will focus more on hardening. 4.2 was the feature release, and 4.3 will be the hardening release.
As file sizes continue to grow, moving them around monolithically becomes subject to a set of failure modes that are more gracefully handled with the use of smaller files. iRODS will implement multipart data objects whereby large files are broken up into parts before transfer over the network, and then reconstituted on the other end.
Each part of the file will be transferred and tracked as a regular data object within the iRODS protocol. Any individual part that does not transfer correctly can be retransmitted. This will allow iRODS to more gracefully overcome network instability, and affords cleaner restarts of large files. This work will also improve transport speeds by eventually serving as the backbone for multi-target transfers into iRODS and multi-source transfers out of iRODS.
Because each part of a larger file will now be addressable, iRODS will be able to provide POSIX-like semantics for object storage technologies. This means that these object stores can be connected to iRODS in a cache-free configuration.
iRODS strives to serve as the glue that can tie together many existing storage technologies with a unified namespace, discovery mechanism, and policy engine. Usually, when an organization decides they need to incorporate iRODS into their infrastructure, they already have a significant amount of data, usually in disparate physical systems.
iRODS needs a way to keep track of changes occurring on filesystems, where existing scripts, tools, and instruments are making changes outside of iRODS. This can be done for moderate-sized systems with a pattern we have named the "Landing Zone" whereby iRODS watches a particular location in the filesystem and takes action when new files appear. For much larger systems handling a very large number of files concurrently, this approach may become an unreasonable bottleneck.
We are working with Intel's Lustre group to investigate and then implement best practices around keeping the iRODS catalog up-to-date for such high-velocity systems.
iRODS metadata has traditionally been stored within the relational database known as the iCAT. Permissions on that metadata was originally designed to be readable by everyone because the primary use case was discoverability. As use cases have become more diverse and sophisticated, the metadata is being used to drive policy decisions as well as store additional information that should not be made available to all users of the system.
QueryArrow has been developed to help alleviate this restriction and allow iRODS to store and query metadata in a variety of other database or indexing technologies.
We plan to release the QueryArrow database plugin which will afford this abstraction of metadata.
The long term vision of the Consortium for iRODS is to create a hardened core broker which invokes services across groups of plugins accessed through a suite of thin and easy-to-use client APIs. iRODS and its source code should focus on data management. A hardened core broker will ultimately allow iRODS to grow in any direction the Membership and Community may envision. To that end, we envision a set of capabilities necessary for iRODS to reach this level of enterprise readiness.
An iRODS Zone should be able to communicate to a client the capabilities of the Zone, and which servers or groups of servers may provide the best service for the given API request and data object or collection in question. A client may request a transformation on a particular data set, a replication of that data to a computation resource or the refresh of an index of the data. The iRODS Zone needs to be able to provide the existence and location of these requested capabilities to the clients or federations in order to best service the request.
iRODS, originally designed around POSIX storage semantics, needs to move to a higher level of storage abstraction. With the emergence of requirements to treat object stores as first class resources, iRODS needs to further encapsulate the underlying storage behavior behind a more abstract storage interface.
As iRODS continues to move further toward a full implementation of the plugin model, the plugins themselves will come to depend on features of other plugins. In order to fully realize the ability to ship a set of plugins as a new feature, these plugins need the ability to declare their dependencies to the data grid administrator. Plugins should have the ability to be packaged together as a single unit, and declare to the system both their requirements as well as capabilities.
iRODS plugins, when installed, need a place to record several things: their state, their capabilities, and their dependencies. We envision a registry within the iCAT catalog where plugins may centralize this information. This feature will provide the ability for the Zone to introspect and report its capabilities to the client and other Zones. A plugin may also determine if all dependencies are met and whether it is capable of performing its operations, at which point an administrator may enable or disable the plugin.
The existing iRODS API relies heavily on the client for various operations, such as recursive collection manipulation. This has caused unnecessary network traffic as well as additional burden on the client API developer, which has historically led to various client APIs in differing states of support. Further, the existing C API is the only client capable of utilizing the binary protocol. In order to bring the support of the various client APIs into lockstep, we intend to standardize on a single serialization and messaging technology which is supported by all target languages. From this point we may begin building a new iRODS API alongside the existing API and bring the support for this new API up to date for all client implementations. We will also be able to reduce the set of exposed storage operations, more in line with object stores, as we push as much code back from the client into the server implementations.
During the creation of the next generation API, it is our intention to further abstract the transportation of data within iRODS. There is a wealth of existing technologies, such as Aspera and other parallel I/O APIs provided by various storage systems, that can benefit iRODS. One can envision an implementation in which, during a put or get, iRODS simply brokers the operation and selects the best transport mechanism available for the operation. This would permit HPC clusters storing data in iRODS on high performance parallel file systems to take full advantage of the native speed of their storage technologies.
For iRODS 4.2, our focus shifted from hardening the core to hardening the software engineering infrastructure and enhancing integration with external services. We chose a standard build system and a compiler across all supported platforms, and we moved the system to a dynamic linking strategy, once and for all allowing for the separation of client and server libraries.
For integration with external services, we worked with existing implementations from our collaborators in our membership and community to capture best practices and standardize message formats and payloads. The resulting Messaging Framework leveraged the new Rule Engine Plugin Framework, which allows for the use of new programming languages as well as special purpose plugins.
The messaging framework is a means for iRODS to asynchronously reach out to external services via an AMQP message bus. Several messaging implementations currently exist within the community, namely at CyVerse and DFC. As this technology has become increasingly important, we intend to gather best practices and standardize message formats, serialization technology, topics, and the services expected to be available for those topics.
The message bus is typically reached through iRODS via policy enforcement points (PEPs) in the rule engine. With the addition of the new Rule Engine Plugin Framework, we now have a fine-grained and fast system for message dispatch.
The iRODS Consortium provides and supports reference implementations across a variety of use cases and vertical integrations. The first three were auditing, indexing, and metadata storage.
Together, these reference implementations provide a glimpse into the power of iRODS as a cohesive system to track every operation and to provide custom search across arbitrarily detailed amounts of metadata and fulltext.
This framework allows the definition of rules in languages other than the Rule Language, such as Python or C++. It also provides the ability to run multiple rule engine plugins and rule bases simultaneously for a given data management operation, which enables the creation of special purpose Rule Engines. Coupled with the use of dynamic policy enforcement points (PEPs), this creates a powerful and versatile system for a variety of applications. In concert with the new Messaging Framework, a special rule engine plugin may publish any and all policy enforcement points and associated data to the AMQP message bus for ingest or action.
iRODS continues to improve under the hood due to newer software engineering techniques and C++ features being introduced. With the addition of the Rule Engine Plugin framework, new C++ features are now required, and they needed to be included in the infrastructure used to build and deploy iRODS. The new build system has been rewritten using CMake. The build system standardizes on a C++14-capable compiler across all supported platforms. All of the setup, upgrade, and control scripts are standardized on Python. The dynamic libraries for the server and the client have been separated and versioned, and the C API cleanly defined and isolated. This was a major departure from the prior bash and make-based build system but now provides unprecedented flexibility, power, and control as features are added in 4.2+.
Each release of iRODS begins with a general philosophy which shapes the decisions about the overall goals as well as particular features to include. As of the iRODS Consortium 4.0 release, marking the merge between Enterprise iRODS and the Community Edition, the Consortium has been focused on making this product a stable, reliable, Enterprise Ready data management system. For the 4.1 release this objective limited our scope to activities around hardening the code, fixing bugs, and building Enterprise quality infrastructure around our core product: the iRODS Server.