- CC-IN2P3 is hosting 80M files on iRODS at a peak access rate of 800k files per day.
- The iPlant collaborative uses iRODS to server over 110M files, nearly 1PB, to over 20k users.
- The Wellcome Trust Sanger Institute uses iRODS to host over 24PB of data, replicated to make nearly 48 PB of data.
There are specific adaptations the above institutions have made to their iRODS configurations to operate at scale. To understand how these adaptations work, consider the following diagram of a small iRODS deployment.
Potential bottlenecks exist wherever a workload is not distributed over multiple services. Most efforts to prepare iRODS for scalability focus on distributing the load on the ICAT-enabled server (IES) or the ICAT database. Three methods have been identified to provide this load distribution: zone segmentation, database load balancing, and round-robin DNS.
One way to lessen the load on an IES and ICAT database is to divide the deployment into multiple administrative zones. Because iRODS can connect multiple zones through federation, users logged into zoneA can access files stored on zoneB using their zoneA login credentials.
Drawbacks: Federation is not completely seamless. Separate zones can have separate sets of policies and separate sets of users. This is extraordinarily flexible, because federation is meant to permit completely separate organizations to share data. But it also increases the complexity of the system, particularly if one person is responsible for administering all the zones. Another limitation of this approach is that metadata does not get transferred when files are copied between zones.
For more about iRODS federation, see the iRODS manual.
Database Load Balancing
The ICAT database is critically important for the operation of a zone. All three of the presently supported database management systems (DBMS)–PostgreSQL, MySQL, and Oracle–have load balancing capabilities available, through pgpool, HAProxy, or Oracle RAC, respectively. A load-balanced database appears the same to iRODS as an ordinary database, so no additional configuration of iRODS is necessary.
Consult the support resources for the load balancing system of your specific DBMS for more information.
A third technique used to scale iRODS deployments is to use round-robin DNS with multiple IES servers. In this configuration, the IES servers are replicas of one another, and they are all connected to the same ICAT database, which may be configured with its own load balancing scheme. The DNS server is configured to resolve addresses such that requests are distributed between the IES servers.
Consult the documentation for your DNS server (or an alternative, similar high availability solution) for more information about this approach.
Future iRODS Architecture
The solutions discussed above are techniques that have been proven to work when deploying iRODS at scale. The iRODS Consortium is also investigating the use of nontraditional (e.g., NoSQL) databases as a means of scaling the ICAT database to handle a massive number of records. However, significant architectural changes are required to make this possible.
Alternative database technologies and other iRODS improvements are topics of active discussion at iRODS Consortium Technology Working Group meetings, which bring together Consortium members each month to discuss technical issues and roadmaps for iRODS.