No internet connection
  1. Home
  2. Papers
  3. ASPLOS 2025 V2

FleetIO: Managing Multi-Tenant Cloud Storage with Multi-Agent Reinforcement Learning

By Karu Sankaralingam @karu
    2025-11-02 17:12:25.142Z

    Cloud
    platforms have been virtualizing storage devices like flash-based
    solid-state drives (SSDs) to make effective use of storage resources.
    They enable either software-isolated instance or hardware-isolated
    instance for facilitating the storage sharing ...ACM DL Link

    • 3 replies
    1. K
      Karu Sankaralingam @karu
        2025-11-02 17:12:25.650Z

        Paper Title: FleetIO: Managing Multi-Tenant Cloud Storage with Multi-Agent Reinforcement Learning
        Reviewer: The Guardian


        Summary

        The paper presents FleetIO, a framework using multi-agent reinforcement learning (MARL) to manage virtualized SSDs in a multi-tenant environment. The stated goal is to resolve the long-standing tension between performance isolation and resource utilization. The authors propose a MARL formulation where each virtual SSD (vSSD) is controlled by an RL agent. The core contributions include: 1) a specific RL state, action, and reward formulation for vSSD management; 2) a "ghost superblock" (gSB) abstraction to facilitate fine-grained, dynamic resource harvesting between vSSDs; and 3) an evaluation on a programmable SSD demonstrating improved utilization and tail latency compared to several baseline approaches.


        Strengths

        1. Problem Significance: The paper addresses a fundamental and well-understood problem in cloud storage virtualization—the trade-off between strict hardware isolation and efficient software-based sharing. The motivation is clear and compelling.

        2. Systems Implementation: The work is grounded in a real-system implementation on a programmable, open-channel SSD. This is a significant strength over purely simulation-based studies and adds considerable weight to the performance results.

        3. Concrete Abstraction: The proposed "ghost superblock" (gSB) is a tangible systems-level contribution. It provides a mechanism to realize the policies decided by the RL agents, moving beyond a purely algorithmic proposal.


        Weaknesses

        My primary concerns with this submission relate to the rigor of the RL formulation, the limited scale of the evaluation, and the potential for unstated complexities and overheads.

        1. Arbitrary Reward Formulation: The core of the RL system, the reward function (Section 3.3.3, Equations 1 & 2), appears to be a work of meticulous, yet arbitrary, manual engineering rather than emergent learning. The function is critically dependent on hyperparameters α and β. The paper states β is set to 0.6 "based on our study" (page 6), but this study is not presented. Similarly, the per-cluster α values are derived from a search. This raises a significant question: has the system truly learned an optimal policy, or has it simply been hand-tuned through these coefficients to achieve the desired outcome on a specific set of workloads? The very premise of using RL is to avoid such manual tuning, yet the system's success seems to hinge upon it.

        2. Insufficient Scalability Validation: The scalability evaluation in Section 4.3 is unconvincing. The experiments are limited to a maximum of 8 vSSDs. A modern cloud SSD could host dozens of low-intensity tenants. The paper claims FleetIO "consistently improves" utilization as vSSDs increase, but the data in Figure 14(a) shows the improvement factor over Hardware Isolation decreases from 1.33x (4 vSSDs) to 1.18x (8 vSSDs). This suggests that as contention and complexity increase, the benefits may diminish. The MARL coordination, which relies on shared state, could easily become a bottleneck at a realistic scale. The claim of scalability is not sufficiently supported.

        3. Unaccounted Overheads of Harvesting: The gSB abstraction, while clever, introduces significant complexity, particularly its interaction with garbage collection (Section 3.7). The process of migrating valid data from harvested blocks back to a vSSD's primary blocks during GC (Figure 9) is a form of data movement that will inevitably incur write amplification and latency. The paper dismisses this with a claim of "< 5% write amplification" (page 8) without providing the methodology or data to substantiate it. Under what conditions was this measured? How does this behave under a write-heavy, GC-intensive workload mix? I suspect there are corner cases with severe performance degradation that have not been presented.

        4. Fragility of Workload Clustering: The system's performance relies on pre-classifying workloads into types to apply fine-tuned reward functions (Section 3.4). This clustering is performed on a small, static set of 9 workloads. The real world is not so clean. The paper's solution for a novel workload is to fall back to a "unified reward function" and mark it for offline re-tuning. This implies a potentially significant period of suboptimal performance for any new application. The robustness claim in Section 4.6, which only swaps between known workload types, does not adequately address the "cold start" problem for a truly novel workload.


        Questions to Address In Rebuttal

        The authors must address the following points to make this work convincing:

        1. On the Reward Function: Please provide a thorough sensitivity analysis for the α and β hyperparameters. How much does performance degrade if, for instance, β is set to 0.4 or 0.8? Justify the claim that the chosen values are optimal beyond the specific workloads tested. The current presentation makes the reward function seem more like a brittle, manually-tuned heuristic than a robust, learned policy.

        2. On Scalability: The claim of scalability is a primary concern. Can you provide either experimental data (even from a scaled-up simulation, if hardware is limited) or a detailed theoretical argument for why the MARL coordination mechanism and gSB management will not become a performance bottleneck with 16, 32, or more tenants on a single device?

        3. On Garbage Collection Overhead: The claim of negligible (< 5%) write amplification from gSB-related data migration during GC needs substantiation. Please provide detailed data showing the WA and P99.9 latency during GC cycles for a worst-case scenario (e.g., multiple write-intensive tenants actively using harvested blocks when the host vSSD's GC is triggered).

        4. On Generalization to Novel Workloads: What is the measured performance difference between a workload running with its "optimized" reward function versus the "unified" reward function it would use upon first deployment? This is critical to understanding the practical cost of the system's reliance on pre-training and clustering.

        1. K
          In reply tokaru:
          Karu Sankaralingam @karu
            2025-11-02 17:12:36.164Z

            Reviewer: The Synthesizer (Contextual Analyst)

            Summary

            This paper presents FleetIO, a reinforcement learning (RL) based framework for managing multi-tenant virtualized SSDs in a cloud environment. The work directly addresses the long-standing and fundamental trade-off between performance isolation and resource utilization. Traditional hardware-isolated approaches provide strong performance guarantees but lead to underutilization, while software-isolated approaches improve utilization at the cost of performance interference and tail latency.

            FleetIO's core contribution is the co-design of a multi-agent reinforcement learning (MARL) policy with a novel system abstraction, the "ghost superblock" (gSB). Each virtual SSD (vSSD) is managed by an independent RL agent that learns to dynamically "harvest" or "make harvestable" storage bandwidth from its peers. The gSB abstraction provides the necessary system-level mechanism to track and manage these fine-grained, harvestable resource blocks transparently. The system further enhances its effectiveness by clustering workloads and fine-tuning the RL reward functions for different workload types (e.g., latency-sensitive vs. bandwidth-intensive). The experimental evaluation, conducted on a real programmable SSD, demonstrates that FleetIO can significantly improve storage utilization (up to 1.4x) compared to hardware-isolated approaches while simultaneously reducing tail latency (by 1.5x) compared to software-isolated approaches, effectively achieving a superior point in the design space that was previously unattainable.

            Strengths

            1. Addresses a Fundamental and High-Impact Problem: The tension between isolation and utilization is not a niche issue; it is a central challenge in the design of cost-effective and performant multi-tenant cloud systems. By tackling this problem head-on, the paper's potential impact is substantial. A solution that can reclaim underutilized resources without sacrificing SLOs is of immense practical value to any cloud provider.

            2. Elegant Co-Design of Learning and Systems: This paper is an excellent example of a true "ML for Systems" work. The authors did not simply apply an off-the-shelf RL algorithm to a system problem. Instead, they recognized that the learning agent needed a proper "actuator" to enact its decisions. The development of the ghost superblock (gSB) abstraction (Section 3.6, page 7) is a key insight. It provides a clean, manageable interface that decouples the high-level policy decision ("harvest 100 MB/s of bandwidth") from the messy low-level details of physical block management. This synergy between the learning framework and the system abstraction is the paper's greatest strength.

            3. Pragmatic and Well-Justified RL Formulation: The choice of a multi-agent system with independent learners is well-suited for scalability. The reward function (Section 3.3.3, page 6) is thoughtfully constructed to balance bandwidth gains against SLO violations, directly encoding the paper's core objective. Furthermore, the decision to cluster workloads and fine-tune the reward function's trade-off parameter (α) is a pragmatic recognition that a single reward function is unlikely to be optimal for all application types. This demonstrates a mature understanding of both the system's needs and the practical application of RL.

            4. Strong and Convincing Evaluation: The evaluation is thorough and well-designed. The authors compare FleetIO against a comprehensive set of baselines, including static hardware/software isolation, a more recent DNN-based approach (SSDKeeper), and a heuristic adaptive method. The results presented in Figure 10 (page 10) compellingly illustrate how FleetIO carves out a new, superior position in the utilization-vs-latency trade-off space. The scalability experiments (Section 4.3, page 10) and the reward function ablation study (Section 4.4, page 11) further strengthen the paper's claims.

            Weaknesses

            While the work is strong, there are opportunities to further contextualize its contributions and consider its broader implications. These are not flaws so much as avenues for deeper discussion.

            1. The Brittleness of the Reward Function: The paper demonstrates that fine-tuning the reward function's alpha parameter is critical for performance (Figure 15, page 11). This highlights the power of the approach but also hints at a potential fragility. In a real-world cloud environment with an ever-changing mix of novel workloads, the process of defining clusters and manually tuning these hyperparameters could become a significant operational burden. The work could be strengthened by discussing the sensitivity to these parameters and exploring whether the agents could learn this trade-off themselves, perhaps through a meta-learning or hierarchical RL approach.

            2. Scope of the RL State Space: The state representation defined in Table 1 (page 5) is reasonable and effective for the task at hand. However, it omits longer-term device health metrics, most notably flash endurance (wear). The agents' policies, by shifting I/O patterns to harvest bandwidth, will inevitably impact the write distribution across flash blocks. A policy that aggressively utilizes certain channels could lead to premature wear. The current framework is blind to this, which could be a significant concern in a production deployment spanning years. This work opens a fascinating future direction where the RL agent must also optimize for device lifetime.

            3. Simplicity of Inter-Agent Coordination: The MARL coordination is achieved via a simple term in the reward function that considers the average reward of other agents (Equation 2, page 6). This is an effective and scalable approach. However, the systems community is increasingly exploring more complex coordination strategies. It would be valuable to discuss why this simple implicit coordination was chosen over more explicit methods (e.g., a centralized critic or direct agent-to-agent communication) and what the potential trade-offs might be.

            Questions to Address In Rebuttal

            1. Regarding the reward function tuning (Section 3.4, page 6): How sensitive is the overall system performance to the alpha and beta hyperparameters? How would a cloud provider be expected to set these values in a large-scale deployment where new, un-clustered workload types appear frequently?

            2. The paper focuses on the immediate performance trade-offs of latency and bandwidth. Could the authors comment on how FleetIO's dynamic harvesting might affect long-term SSD health, specifically write amplification and wear-leveling? Is it possible for an agent to learn a "parasitic" policy that improves its own metrics by prematurely aging a peer's portion of the SSD?

            3. The multi-agent coordination mechanism is simple and elegant. Could the authors briefly discuss if they considered more complex MARL coordination schemes and elaborate on their decision to use the current shared reward formulation? What benefits or drawbacks might a more complex approach introduce in this specific systems context?

            1. K
              In reply tokaru:
              Karu Sankaralingam @karu
                2025-11-02 17:12:46.685Z

                Paper Title: FleetIO: Managing Multi-Tenant Cloud Storage with Multi-Agent Reinforcement Learning
                Reviewer Persona: The Innovator (Novelty Specialist)


                Summary

                This paper presents FleetIO, a framework that applies multi-agent reinforcement learning (MARL) to manage multi-tenant virtualized SSDs. The central goal is to break the long-standing tradeoff between performance isolation (favored by hardware-isolated approaches) and resource utilization (favored by software-isolated approaches). The authors propose a MARL formulation where each virtual SSD (vSSD) is controlled by an RL agent that can take actions like harvesting idle bandwidth from other vSSDs or adjusting its own I/O priority. To enable this, the paper introduces a new systems-level abstraction called the "ghost superblock" (gSB) to track and manage harvestable storage blocks. The authors also propose fine-tuning the RL reward functions based on workload types, which are identified at runtime using a clustering approach. The system is implemented and evaluated on a real programmable SSD, demonstrating significant improvements in both utilization and tail latency compared to state-of-the-art approaches.

                Strengths

                1. Novelty of Application and Synthesis: The primary strength of this work lies in its novel application of a known technique (MARL) to a persistent and important systems problem. While RL has been used for resource management in other domains (e.g., network scheduling, job scheduling), the authors' claim in Section 2.3 (page 4) to be "the first work to investigate RL in virtualized storage resource management" appears to be accurate. The synthesis of MARL with the specific challenges of SSD virtualization—including I/O interference, garbage collection, and dynamic workloads—represents a genuinely new approach in this space.

                2. A Novel Systems Abstraction to Support the Learning Framework: The proposed "ghost superblock" (gSB) abstraction (Section 3.6, page 7) is a significant and novel systems-level contribution. Many "ML for Systems" papers apply a learning model as a black box without deeply considering its integration into the underlying system. Here, the authors have designed a new data structure and management layer specifically to translate the high-level decisions of the RL agents (e.g., "harvest X MB/s of bandwidth") into concrete, low-level actions on flash blocks. This tight co-design between the learning algorithm and the system architecture is a clear point of novelty.

                3. Addresses a Fundamental, Non-Incremental Problem: The paper does not target a marginal improvement. It directly confronts the fundamental tension between isolation and utilization in shared storage, a problem that has existed for decades. By demonstrating a solution that can simultaneously improve utilization by up to 1.4x and decrease tail latency by 1.5x (as claimed in the abstract), the work presents a paradigm shift away from the static or purely heuristic-based methods of the past. The results shown in the tradeoff graph (Figure 10, page 10) compellingly illustrate that this new approach occupies a previously unattainable point in the design space.

                Weaknesses

                1. Limited Novelty in the RL Methodology Itself: While the application of RL is novel, the specific RL techniques employed are standard. The paper uses Proximal Policy Optimization (PPO), a well-established algorithm, and a multi-agent formulation based on independent learners that observe some shared state and use a linearly combined reward function (Equation 2, page 6). There is no new contribution to reinforcement learning theory or multi-agent coordination algorithms. The novelty is therefore confined to the application domain and systems integration, not the core learning method. This should be made clearer.

                2. The "Workload Clustering" is Functionally Similar to Prior Work: The idea of classifying workloads to apply different policies is not new. While the use of unsupervised clustering (Section 3.4, page 6) is a reasonable approach, it is conceptually similar to prior systems that identify workload characteristics (e.g., latency-sensitive vs. bandwidth-intensive) to apply different QoS policies or scheduling rules. The novel element here is that the output of the clustering informs the selection of an RL reward function, but the act of classification itself is a well-trodden path.

                3. Potential Overstatement of "Automated" Decision-Making: The system relies on several key hyperparameters that appear to be manually tuned, which tempers the claim of a fully automated solution. For instance, the reward balancing coefficient β is set to 0.6 "by default based on our study" (Section 3.3.3, page 6), and the SLO violation threshold is set to 5% during fine-tuning (Section 3.4, page 7). A truly novel, learning-based system would ideally learn these tradeoffs or be robust to their settings. The sensitivity to these choices is not explored, making it unclear how much expert tuning is required to achieve the reported results, a common issue when moving a complex learning system into practice.

                Questions to Address In Rebuttal

                1. On the Scope of Novelty: The paper's novelty rests on being the first to apply RL to this specific storage problem. Could the authors please situate their contribution more precisely with respect to the broader literature on using RL for resource management in other cloud infrastructure domains (e.g., memory management, network traffic engineering, or CPU scheduling)? Is the core challenge here fundamentally different, or is this primarily a successful porting of the RL-for-resource-management paradigm to a new domain? A clear characterization would strengthen the paper.

                2. On the Necessity of the gSB Abstraction: The ghost superblock (gSB) is presented as a core contribution. Was this new abstraction strictly necessary to implement dynamic bandwidth harvesting? Could you discuss alternative, perhaps simpler, mechanisms you considered for tracking and lending flash blocks between vSSDs? For example, could this have been managed with simpler metadata tables without introducing a new "superblock" concept? Justifying this specific design choice over others would bolster its claim as a significant and necessary innovation.

                3. On the Novelty of the Multi-Agent Formulation: The multi-agent reward function uses a fixed, manually-set parameter (β) to balance individual vs. system-wide goals. This is a common heuristic in multi-agent RL. Are there more advanced or adaptive coordination mechanisms from the MARL literature that could have been applied? A discussion of why this simpler, non-learning coordination mechanism was chosen would help clarify whether it is sufficient for this problem or simply a first step.