# ADSTS-Ceph自动调参

- ICPP，4/9日

## DAC Review

```
============================================================================
DAC 2022 Reviews for Submission #885
============================================================================
Title: ACSTS: Automatic Cloud Storage Tuning System via Parameter Identifcation and Reinforcement Learning
Authors: Kai Lu, Jiguang Wan, Nannan Zhao, Wei Zhao and Ruixiang Ma
============================================================================
REVIEWER #1
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Clarity / Writing Style (1-5): 3
Originality / Innovativeness (1-5): 3
Impact of Ideas and/or Results (1-5): 3
OVERALL RECOMMENDATION (1-5): 2
Summarize shortly the contributions of the paper in your own words.
---------------------------------------------------------------------------
This work presents an RL-based framework for automatic cloud storage parameter tuning to achieve system configurations which optimize system latency. It proposes a multi-step approach which comprises of a parameter selection step to focus auto-tuning on the highest impact parameters followed by a double-DQN offline training protocol with experience reply and then online tuning to recommend the optimal parameter configurations characterized by “cautious” stage-wise training.
---------------------------------------------------------------------------
Strengths
---------------------------------------------------------------------------
+ Real and relevant problem
+ Reasonable proposal of an RL formulation of the auto-tuning problem
---------------------------------------------------------------------------
Weaknesses
---------------------------------------------------------------------------
- Lack of motivation for various design decisions (ex. Number of tunable parameters, K; parameter boundary search space initialization)
- Action space is severely limited, without motivation
- Paper organization
---------------------------------------------------------------------------
Main Discussion of Paper
---------------------------------------------------------------------------
The key contribution of this paper appears to be the proposal of the RL framework. The first step of parameter processing and selection appears very similar to prior approach as described by Sapphire [6]
The paper presents a number of important design decisions, for example the number of tunable parameters K; the parameter search space boundary initialization; “performance satisfaction” for tuning stopping criteria. These decisions are not well motivated nor is the process well described. For example,
- The tunable parameter number is possibly the most important design decision as it defines the performance and runtime of the auto-tuning system. No discussion is provided on how to select the optimal number, how the proposed number of 32 was arrived at, or the shortcomings of setting a fixed number. It seems an important discussion point and also a very relevant extension would be to learn the optimal number as part of the learning process.
- A dynamic boundary is used to handle the continuous search space; however the initialize scheme for such an approach is not described.
- Tuning stopping criteria is not defined nor is it defined whether it is something which should be set by a given user or automatically set by the auto-tune system
The parameter tree tree-based indexing structure is critical to the system’s ability to easily prune the first level of irrelevant parameters and quickly narrow the search space. It seems the parameter tree may vary for different cloud storage systems. The method for building the tree is not described, as such it is not clear how robust the method is to generalizing across different systems or if this tree construction must be done by hand for a given system. The later would detract from the automation of the proposed work.
The work describes that “an action is to increase or decrease all 𝐾 tunable parameter values at a time”. It seems sub-optimal that all parameters should only ever move in the same direction. There is no discussion or defense why the action space was limited in such a way and what the recognized limitations may be with such an approach.
Writing/clarity: The paper is not well organized. Ideas and architecture components are introduced, and the re-introduced in repetition, each time with a bit more depth. This results in a feeling of backtracking and losing place of where one is at with respect to the architecture. It also contains a set of minor typos.
---------------------------------------------------------------------------
Most prominent Strength or Weakness
---------------------------------------------------------------------------
This recommendation is primarily motivated by the lack of motivation or description of the many design decisions involved in the process. Some baked in assumptions about the design process e.g. the parameter tree tree-based index structuring may limit the automation of the proposed work.
---------------------------------------------------------------------------
============================================================================
REVIEWER #2
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Clarity / Writing Style (1-5): 3
Originality / Innovativeness (1-5): 1
Impact of Ideas and/or Results (1-5): 2
OVERALL RECOMMENDATION (1-5): 2
Summarize shortly the contributions of the paper in your own words.
---------------------------------------------------------------------------
This paper introduces a cloud storage tuning system based on reinforcement learning. Techniques like pre-training feature pruning and stage-wise training are used to shorten training time.
---------------------------------------------------------------------------
Strengths
---------------------------------------------------------------------------
+ 1.5x-2.5x better performance compared to previous work
+ Cautious evaluation: longer evaluation time for each sample for stable state result
---------------------------------------------------------------------------
Weaknesses
---------------------------------------------------------------------------
- No obvious innovation, out-of-box algorithm and common ML techniques used
- No discussion on how the proposed method solves main problems that other methods face
- Experiment method is simple
---------------------------------------------------------------------------
Main Discussion of Paper
---------------------------------------------------------------------------
This paper introduces a cloud storage tuning system based on reinforcement learning. It uses TD3 as the tuning algorithm. Also, techniques like parameter pruning/identification and stagewise training is used to speed up training. To avoid evaluation pitfalls, the authors proposed longer evaluation period for each sample (waiting for the system to reach stable state). Compared to manual tuning and past work (SAPPHIRE), the proposed method has 1.5x-2.5x better performance in terms of bandwidth and average latency.
There are several fields that are not clear or not clearly discussed:
1. Will the proposed method solve or alleviate the three common problems listed as observations in section 1, especially the first two? If so, why?
2. Is there any reason of choosing Lasso regularization over the more popular Ridge?
3. On the last page the authors said there are four workload mixes but only 3 are shown in table 1 and figure 6. 3 mixes are too few experiment samples and it would need better reasoning on why they are sufficient.
4. The selected TD3 algorithm is somewhat highlighted. Is it expected to deliver better recommendation, or faster training, or both?
---------------------------------------------------------------------------
Most prominent Strength or Weakness
---------------------------------------------------------------------------
Weakness: this paper mainly uses off-the-shelf ML algorithm and training technique. The most important adaption the authors make is cautious evaluation (increasing evaluation runtime for each sample)
---------------------------------------------------------------------------
============================================================================
REVIEWER #3
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Clarity / Writing Style (1-5): 4
Originality / Innovativeness (1-5): 3
Impact of Ideas and/or Results (1-5): 3
OVERALL RECOMMENDATION (1-5): 2
Summarize shortly the contributions of the paper in your own words.
---------------------------------------------------------------------------
According to the observation of this paper, it can be known that parameter is one of the main reasons affecting performance, and reinforcement learning is used for the cloud storage tunning system to improve performance.
---------------------------------------------------------------------------
Strengths
---------------------------------------------------------------------------
+ This paper has a good observation to find the main challenge of cloud storage, and the author proposed the system architecture with clear strategies for the dynamic workloads.
+ The paper presents a complete architecture for the cloud storage system.
+ The system overview and parameter model are organized well.
---------------------------------------------------------------------------
Weaknesses
---------------------------------------------------------------------------
- The pronouns in the picture cannot be explained clearly, and it is impossible for people to know what the comparison standard is behind it, including Figure 2 and Fiugre3.
- Due to unclarity in observation 2, the problems described in the introduction are not convincing.
---------------------------------------------------------------------------
Main Discussion of Paper
---------------------------------------------------------------------------
The paper points the main challenge in the cloud storage system. Along this direction, the author defines parameter in the model and propose an impressive architecture to improve the issues. Although the paper touches upon almost all implementation aspects, I still have the following concerns.
1. Many recent works [1, 2] proposed related research in this field; The authors need to provide more previous work and compare it.
2. The authors should include one section to discuss the related works about the could storage tuning system development.
3. The authors only run three workloads in the experiments. The reviewer thinks it is not sufficient. The authors should include more workloads and some workloads in real applications.
4. The authors did not mention the overhead analysis. If the authors can analyze the computing overhead and memory footprint of the proposed method, it is appreciated.
5. As this paper focuses on cloud storage tuning systems, it seems to be a little out of scope.
[1] Haoyu Wang, Haiying Shen, Qi Liu, Kevin Zheng, and Jie Xu. 2020. A Reinforcement Learning Based System for Minimizing Cloud Storage Service Cost. In 49th International Conference on Parallel Processing - ICPP (ICPP '20). Association for Computing Machinery, New York, NY, USA, Article 30, 1–10. DOI:https://doi.org/10.1145/3404397.3404466
[2] R. R. Noel, R. Mehra and P. Lama, "Towards Self-Managing Cloud Storage with Reinforcement Learning," 2019 IEEE International Conference on Cloud Engineering (IC2E), 2019, pp. 34-44, doi: 10.1109/IC2E.2019.000-9.
In addition to the major concerns above, the paper also has language/grammar issues. Here are a few.
6. it takes nearly 4 hours to reach the optima and using 16 parameters will lead to suboptimal performance. -> ‘’…the optima, and using 16 parameters…’’
7. In conclusion: “ACSTS outperforms existing auto-tuning strategies” -> “ACSTS outperforms existing auto-tuning strategies.”
---------------------------------------------------------------------------
Most prominent Strength or Weakness
---------------------------------------------------------------------------
The paper does not compare the proposed solution with some recent studies.
---------------------------------------------------------------------------
============================================================================
REVIEWER #4
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Clarity / Writing Style (1-5): 3
Originality / Innovativeness (1-5): 3
Impact of Ideas and/or Results (1-5): 3
OVERALL RECOMMENDATION (1-5): 2
Summarize shortly the contributions of the paper in your own words.
---------------------------------------------------------------------------
The authors developed ACSTS to solve the limitations of classical auto-tuning systems with their proposed strategies. ACSTS is with a Parameter Model to identify important parameters and adopts DRL method to find optimal configurations under dynamic workloads
---------------------------------------------------------------------------
Strengths
---------------------------------------------------------------------------
+ The performance improvement is impressive;
+ This paper is well organized and understandable;
---------------------------------------------------------------------------
Weaknesses
---------------------------------------------------------------------------
- The idea lacks novelty;
- The stability of the DRL algorithm needs further discussion;
---------------------------------------------------------------------------
Main Discussion of Paper
---------------------------------------------------------------------------
The authors developed ACSTS to solve the limitations of classical auto-tuning systems with their proposed strategies. The experimental results demonstrate the effectiveness of ACSTS in terms of performance (Bandwitth/Latency) and search cost. However, the novelity of this paper needs to be improved. The algorithm of DRL may face the problem of stability, which should be discussed in detail.
---------------------------------------------------------------------------
Most prominent Strength or Weakness
---------------------------------------------------------------------------
The algorithm of DRL may face the problem of stability, which should be discussed in detail.
---------------------------------------------------------------------------
```

### Reviewer 1

缺乏各种设计决策的动机（例如可调参数的数量，K；参数边界搜索空间初始化）

参数处理和选择的第一步似乎与 Sapphire [6] 描述的先前方法非常相似。本文提出了一些重要的设计决策，例如可调参数 K 的数量；参数搜索空间边界初始化；调整停止标准的“性能满意度”。这些决定没有很好的动机，也没有很好地描述过程。例如，

- 可调参数编号可能是最重要的设计决策，因为它定义了自动调整系统的性能和运行时间。没有讨论如何选择最佳数字，建议的数字 32 是如何得出的，或者设置固定数字的缺点。这似乎是一个重要的讨论点，也是一个非常相关的扩展，将学习最佳数字作为学习过程的一部分。
- 动态边界用于处理连续搜索空间；但是没有描述这种方法的初始化方案。
- 调谐停止标准没有定义，也没有定义它是应该由给定用户设置还是由自动调谐系统自动设置

没有描述构建树的方法

基于参数树的索引结构对于系统轻松修剪第一级不相关参数和快速缩小搜索空间的能力至关重要。似乎参数树可能因不同的云存储系统而异。没有描述构建树的方法，因此不清楚该方法对于跨不同系统的泛化有多健壮，或者对于给定的系统是否必须手动完成这种树的构建。后者会减损拟议工作的自动化。

行动空间严重受限，没有动力

该工作描述了“一个动作是一次增加或减少所有 𝐾 可调参数值”。似乎所有参数都应该只朝同一个方向移动，这似乎不是最理想的。没有讨论或辩护为什么行动空间以这种方式受到限制，以及这种方法可能存在哪些公认的限制。

论文组织

论文组织得不好。引入了想法和架构组件，并在重复中重新引入，每次都有更多的深度。这导致了一种回溯的感觉，并失去了一个人在架构方面所处的位置。它还包含一组小错别字。

### Reviewer 2

- 没有明显的创新、开箱即用的算法和常用的机器学习技术
- 有什么理由选择 Lasso 正则化而不是更流行的 Ridge？
- 选择的TD3算法有些突出。是否期望提供更好的推荐或更快的培训，或两者兼而有之？

没有讨论所提出的方法如何解决其他方法面临的主要问题

所提出的方法能否解决或缓解第 1 节中作为观察结果列出的三个常见问题，尤其是前两个问题？如果是这样，为什么？

实验方法简单

在最后一页上，作者说有四种工作负载组合，但表 1 和图 6 中只显示了 3 种。3 种组合的实验样本太少，需要更好地推理为什么它们足够。

### Reviewer 3

图中的代词无法解释清楚，人们也不可能知道其背后的对比标准是什么，包括图2和Fiugre3。

由于观察 2 不明确，引言中描述的问题并不令人信服。

对比先前的工作

近期许多著作[1, 2]提出了该领域的相关研究；作者需要提供更多以前的工作并进行比较。

[1] 王浩宇，沉海英，刘琦，郑凯文，徐杰。 2020. 一种基于强化学习的系统，用于最大限度地降低云存储服务成本。第 49 届并行处理国际会议 - ICPP (ICPP ‘20)。 [2] RR Noel、R. Mehra 和 P. Lama，“Towards Self-Management Cloud Storage with Reinforcement Learning”，2019 年 IEEE 国际云工程会议 (IC2E)，2019 年，第 34-44 页，doi: 10.1109/IC2E .2019.000-9。

作者应包括一节讨论有关the could storage tuning system development的相关工作。

实验太少

- 作者在实验中只运行了三个工作负载。审稿人认为还不够。作者应该在实际应用程序中包含更多工作负载和一些工作负载。
- 作者没有提到开销分析。如果作者可以分析所提出方法的计算开销和内存占用，将不胜感激。
- 由于本文侧重于云存储调优系统，因此似乎有点超出范围。

除了上述主要问题外，该论文还存在语言/语法问题。这里有几个。

- 达到最优需要将近 4 个小时，使用 16 个参数会导致性能欠佳。 -> ‘‘…最优值，并使用 16 个参数…’’
- 总之：“ACSTS 优于现有的自动调整策略”->“ACSTS 优于现有的自动调整策略。”

### Reviewer 4

- 这个想法缺乏新意
- DRL算法的稳定性需要进一步讨论

## Abstract

Modern distributed storage systems with the immense number of configurations, unpredictable workloads and difficult performance evaluation pose higher requirements to configuration tuning. Providing an automatic configuration tuning solution for distributed storage systems is in demand. However, traditional auto-tuning strategies expose great limitations in the face of these requirement, including local optimum, poor adaptability and incurring subtle pitfalls in evaluation. In this paper, we present and evaluate the ADSTS, which is an \underline{a}utomatic \underline{d}istributed \underline{s}torage \underline{t}uning \underline{s}ystem based on deep reinforcement learning (RL). ADSTS constructs Parameter Model to identifies important configuration parameters that may become performance bottlenecks to reduce the parameter spaces. Besides, ADSTS utilizes a reinforcement learning-based adaption mechanism to find optimal configurations under dynamic workloads. Finally, ADSTS adopts a careful performance evaluation strategy to avoid pitfalls. ADSTS is implemented on Park and is used in the real-world system Ceph. The evaluation results show that ADSTS can recommend optimal configurations and improve system performance by $1.5\times \thicksim 2.5\times$.

\keywords{Distributed Storage, Auto-tuning, Reinforcement Learning, Parameter Identification}

## Introduction

Emerging distributed storage has become one of the most widely acceptable infrastructure services by providing cost-effective, highly scalable and reliable platforms for storing large scale data\cite{ceph,dynamo}.

Examples include distributed file systems (GFS, HDFS, Lustre, MooseFS, ect.), distributed object storage systems (Swift, Ceph, etc.) and distributed block storage systems (Sheepdog, Ursa, etc.).

分布式存储系统可以分为分布式文件系统，例子包括GFS，HDFS，Lustre，MooseFS；对象存储系统Swift，ceph，块存储系统Sheepdog，Ursa，其中Ceph是通一，得到广泛应用

Construction of a complex distributed storage system comes with many design choices, producing a large number of tunable configuration parameters \cite{sapphire,capes}. Superior configuration settings can provide significant gains in performance and high quality of service (QoS). However, with the explosive growth of data, distributed storage systems are becoming more and more complex, and it’s difficult for distributed operators to change the storage configurations to better support different user cases. Providing an automatic configuration tuning solution for distributed storage systems is in demand.

Modern distributed storgae systems

Modern distributed storage systems have the following characteristics, which bring higher requirements and challenges to determine the near-optimal configurations. \textbf{1) Larger parameter spaces.} Distributed storage systems often have multiple layered and highly modular software architectures, span large networks and distributed environments consisting of heterogeneous storage devices. There are hundreds or even thousands of tunable parameters that control module behavior \cite{sapphire,black}. For example, as a representative distributed storage system, Ceph\cite{ceph} comes with 1536 parameters in the latest version, near three times larger than the original version.

Worse, these parameters have many value constraints (affect each other) and can take on a wide variety of values, including continuous, discrete, non-numerical and so on. \textbf{2) More unpredictable workloads.} In a modern storage environment, workloads are often dynamic and unpredictable because of the large number of users and data. \textbf{3) More difficult evaluation.} Evaluation results often contain stochastic noises that become much noticeable in distributed environments. Taking the average of multiple tests is the conventional approach to deal with noise \cite{sapphire,black}. But in distributed storage systems, evaluating a single configuration can take many minutes or even hours, making such approaches unbearably time-consuming. \label{cha} % based on statistical reasoning and machine learning technology.

% \begin{table*} % \centering % \renewcommand\arraystretch{1.3} % \caption{Comparison of atuo-tuning systems.} % % \vspace{2.2mm} % \scalebox{1}{ % \begin{tabular}{ccccccc} % \hline % System & \textbf{Algorithm} & \textbf{\makecell[c]{Parameter\Space}} & % \textbf{\makecell[c]{Tuning\Result }} & % \textbf{Adaptivity} & % \textbf{\makecell[c]{Tuning\Time }} & % \textbf{\makecell[c]{Limitation }} \ %Usage Look-up % \hline % % \makecell[c]{Consistent\Hashing } % \textbf{Cao et al.} & Black-box & $\times$ & $\checkmark$ & $\checkmark$ & $\times$ & $\times$ % \hline % \textbf{Sapphire} & Bayesian & $\checkmark$ & $\checkmark$ & $\semichecked$ & $\times$ & $\checkmark$ % \hline % \textbf{Kinesis}& Random Forest & $\checkmark$ & $\checkmark$ & $\semichecked$ & $\checkmark$ & $\times$ % \hline % \textbf{Capes} & DQN & $\times$ & $\checkmark$ & $\semichecked$ & $\checkmark$ & $\semichecked$ % \hline % \textbf{ARM} & Q-learning & $\semichecked$ & $\checkmark$ & $\times$ & $\semichecked$ & $\checkmark$ % \hline % \textbf{ADSTS} & TD3 & $\checkmark$ & $\checkmark$ & $\checkmark$ & $\checkmark$ & $\checkmark$ % \hline % \multicolumn{7}{l}{$\checkmark$: Good; $\semichecked$: Moderate; $\times$: Poor.} % \end{tabular}}%} % \label{tab1} % \end{table*}

In recent years, there are many researches on automatic tuning of storage system. The most popular include: \textbf{1) Black-box Optimization.} Examples include Genetic Algorithms, Simulated Annealing, Bayesian Optimization (BO) \cite{carver}, Random Search, etc. Cao et al. \cite{black} provided a comparative study of applying multiple black-box optimization algorithms to auto-tune storage systems. Sapphire \cite{sapphire} uses BO to find near-optimal configurations for Ceph. \textbf{2) Reinforcement Learning} (RL). The researches on RL-based auto-tuning for storage system are just in the exploratory stage. Capes \cite{capes} adopts Deep Q-Networks (DQN) in optimizing performance for Lustre. DQN is a discrete-oriented control algorithm, which means the actions of output are discrete. Unfortunately, configuration combinations in real-world system are high-dimensional and many values are continuous.

% One of the more popular ones is black-box auto-tuning \cite{sapphire,carver,black} due to its obliviousness to a system’s internals. Examples include Genetic Algorithms (GA), Simulated Annealing (SA), Bayesian Optimization (BO), etc. % Examples include Genetic Algorithms, Simulated Annealing, Bayesian Optimization (BO) \cite{carver}, Deep Q-Networks \cite{capes}, Random Search, etc. Sapphire \cite{sapphire} uses BO to find near-optimal configurations for Ceph. Capes \cite{capes} adopts Deep Q-Networks (DQN) in optimizing performance for Lustre. However, These traditional solutions show great limitations in the cloud storage environment. In the experiments of automatic configuration tuning using Bayesian optimization in Ceph (Sapphire), three observations on auto-tuning of cloud storage are obtained, which can well illustrate the problems. All experimental settings and workloads can be found in Sec. \ref{eva}.

% In the experiments of automatic configuration tuning using Bayesian Optimization \cite{carver} in Ceph

\begin{figure*}[htbp] \begin{minipage}[t]{0.33\textwidth} \centering \includegraphics[width=5.45cm]{fig//Ob2.pdf} \caption{Non-linear performance effect} \label{Ob1} \end{minipage} \begin{minipage}[t]{0.33\textwidth} \centering \includegraphics[width=5.3cm]{fig//Ob1.pdf} \caption{Poor adaptability} \label{Ob2} \end{minipage} \begin{minipage}[t]{0.33\textwidth} \centering \includegraphics[width=5.3cm]{fig//Ob3.pdf} \caption{Evaluation pitfalls} \label{Ob3} \end{minipage} \end{figure*}

% 有些参数直接动态调整是达不到效果的，或者需要跑足够长时间的 workload 才能看出效果 %如果我们的评估时间小于十五分钟，我们会得出cache对于读性能没有影响的结论，然而结果恰恰相反。

%我们增大了cache的大小， %可以看到，1）两组实验

To address these problems, \textbf{ADSTS}, a new \underline{a}utomatic \underline{c}loud \underline{s}torage \underline{t}uning \underline{s}ystem is proposed in this paper. In ADSTS, \textbf{ 1) Parameter Model} % proposes general preprocessing guidelines to deal with large parameter space. Parameter Model identifies important parameters according to their impact on performance using an adaptive Lasso algorithm to reduce the parameter space. These important parameters reflect system performance \textit{bottlenecks} and provides guidance for optimizing the system. %识别重要的参数 \textbf{2) Deep Reinforcement Learning} is used to tune the values of these important parameters for optimal performance. RL model explores more optimal configurations that never tried, reducing the possibility of trapping in local optimum. And RL model can dynamically recommend suitable configurations for dynamic workloads. % due to the strong learning ability of RL model. % , which combines RL and neural networks to automatically learn the parameter values from limited samples, is used for configurations auto-tuning. % ADSTS can dynamically recommend suitable configurations for dynamic workloads due to the strong learning ability of RL model. % In addition, To accelerate the training of RL models, ADSTS adopts Stagewise Training strategy. \textbf{3) Performance evaluation} is cautious in ADSTS. In performance evaluation of updating parameters, ADSTS may increase the run time until the steady state to avoid pitfalls. %会增加运行时间直至性能达到平稳状态（大约30分钟） %不同的测试都会达到一个稳定的状态， ADSTS is implemented on Park\cite{park}, and is used in the real-world cloud system Ceph. The evaluation results show that ADSTS improves the Ceph performance by $1.5\times \thicksim 2.5\times$ compared with other strategies. % 3) %任务之间的相似性来加速收敛,和分布式训练来加速模型训练 % 4) %更谨慎的评估策略被使用，它使用

%通过构建参数模型，根据用户设置，通过自适应lasso算法筛选出影响性能关键参数 %缩减并标准化处理参数空间，提供一份干, %2) 使用深度强化学习，来自动调优 %3) %

## Background and Motivation

%在这一节中，我们首先展示存储系统自动调优的相关工作，主流的是使用传统机器学习算法（以贝叶斯为代表），也有部分尝试使用强化学习（起步阶段）。然后我们通过实验结果展示这些算法在云存储系统中暴露的问题，依次作为我们工作的动机。

In this section, we first present the related work of auto-tuning storage systems, mainly using traditional machine learning algorithms (represented by Bayesian Optimization), and some attempts to use Reinforcement Learning (in the initial stage). Then we show the problems exposed by these algorithms in the cloud storage system by the experiment results, which can be regarded as the motivation of our work.

### 1. Related Work

\textbf{Auto-tuning storage systems.} Many studies have built systems to automate the tuning of storage systems. Cao et al. \cite{black} provided a comparative study of applying multiple black-box optimization algorithms to auto-tune storage systems. Genetic Algorithm is adopted by Babak et al. \cite{GA} to optimize the I/O performance of HDFS applications. Friesz et al. \cite{SA} designs a Simulated Annealing approach to solve the network design problem. Among these algorithms, \textbf{Bayesian Optimization} is the most mature and widely used. Sapphire \cite{sapphire} uses BO to find near-optimal configurations for Ceph. BO has also been widely applied to auto-tuning for computer system\cite{BO2} and databases \cite{BO1}. However, Some of these traditional machine learning algorithms cannot deal with large parameter space, or some need large training samples, or need a long training time. They show great limitations in the cloud storage environment.

\textbf{Reinforcement Learning for auto-tuning.} In recent years, several researches utilized RL model to build auto-tuning systems, mainly in the database area, such as CDBTune \cite{cdbtune} and QTune \cite{qtune}. The researches on RL-based atuo-tuning for storage systems are just in the exploratory stage. Capes \cite{capes} adopts Deep Q-Networks (DQN) in optimizing performance for Lustre. Unfortunately, DQN is a discrete-oriented control algorithm, which means the actions of output are discrete. However, Capes does not work in the real storage systems, because configuration combinations in real-world system are high-dimensional and the values for many of them are continuous (Section \ref{cha}).

% CDBTune尝试用DDPG来优化，但是

### 2. Problem Discovery

In this section, we show a series of experiment results in Ceph (v12.2.13) that execute variations of the Rados$_$Bench workloads released with Ceph with different configuration settings and use Sapphire to obtain optimized configurations. The details of these experiment setup are present in Section \ref{eva}. From these results, we find many problems of using traditional auto-tuning in cloud storage systems:

% \textbf{Local Optimum Configurations.} % %图1：非线性

% %图2：局部最优

% %图3：适应性很差

每次需要对更新的工作负载重新遍历配置参数空间，导致耗时过大。同时，相似特征的工作负载具有类似的最优配置参数。

对于多数情况下的数据负载而言，根据工作负载的数据特征，快速匹配类似的最优配置参数，可以有效提高系统资源利用率。

% %图4：测试困难

% \textbf{Non-reusable Configurations.}

% \textbf{Evaluation Pitfalls.}

\textbf{Observation 1. \textit{Auto-tuning may lead to local optimum results in the face of huge parameter spaces.}} Causing system performance changes not simply linear to the parameter value. For example, Fig.\ref{Ob1} shows the 16K object random read performance of Ceph while changing only the value of one parameter, the number of PG (Placement Group, the logical grouping of objects) from 50 to 1000, and setting all the others to their default. The IOPS is not directly proportional to the number of PG. Such irregular and multi-peak correlations make it hard to achieve global optimal performance and may fall into local optima. We compared the bandwidth of Ceph using default configurations, BO-based auto-tuning and manual-tuning under different workloads. As shown in Fig.\ref{Ob2}, auto-tuning cannot find global optimum parameters because it repeatedly recommends the best configurations that have been explored and lacks of ability to explore more untried configuration combinations.

% In order to explore the whole parameter space, we only recommended a few excellent configurations, which led to average performance

%

%我们对比了默认配置，用Ob的自动调优和费时费力的人工调优，发现自动调优效果一般，由于为探索全部的参数空间而仅限于推荐几个优秀的配置，陷入局部最优而导致性能一般

% As shown in Fig.\ref{Ob1},

\textbf{Observation 2. \textit{The recommended configurations usually are non-reusable in the face of dynamic workloads.}} The optimal settings are dependent on hardware and workload characteristics, while traditional algorithm has poor adaptability. As shown in Fig.\ref{Ob2}, auto-tuning performs well on workload A and worse than the default configurations on workload C. %在负载A表现优异，而在负载C还不如默认配置

\textbf{Observation 3. \textit{The configuration evaluation may easily incur subtle pitfalls, which will lead to suboptimal results.}}\label{Observation 3} These “pitfalls” do not only mean the inaccurate results caused by high noise, but they may also come from the experiment settings including run time, workloads, etc. Fig.\ref{Ob3} shows the random read performance of Ceph when the store cache is increased in two groups of repeated experiments, in which two “pitfalls” can be found: 1) Benchmark results of storage systems often contain stochastic noises. In Fig.\ref{Ob3}, benchmark noises can deviate from system average performance by 10 KIOPS (5\%). 2) The effect of some parameters (such as cache size) cannot be seen immediately after changing. In Fig.\ref{Ob3}, if our evaluation time is less than 15 minutes, we may conclude that cache size has no effect on read performance, but the opposite is true.

%性能影响具有滞后性

Changing Some parameters (such as cache size) does not immediately affect performance. But running the workload for long enough, these parameters will improve performance.

## System Overview

In this section, a new RL-based auto-tuning system for cloud storage systems is described. The whole ADSTS framework and working mechanism are introduced in detail. % % first. Then, data preprocessing, RL model and training, and the system implementation are introduced in detail.

### 1. ADSTS Architecture

Fig.\ref{ADSTS} illustrates the architecture of ADSTS, a tuning service that works with any storage system in cloud. It uses offline log data collected from the storage system % including configurations and performance metrics in different workloads, to build parameter identification model for selecting important parameters, and RL model of how the cloud storage system responds to different configurations. When the client users send their own requests to the cloud server through the local interface, ADSTS collects user settings and workloads, which will be stored in the memory pool and fed into the RL network respectively, then recommends optimized configurations using the RL models online for storage system. ADSTS consists of four main parts, which will be described in detail as follows:

%分为离线的模型训练和在线的

%Parameter Preprocessor负责对采集的数据进行预处理，包括对旋钮配置进行分类，构建索引结构和裁剪，然后标准化和按照对性能影响程度排名。最后为RL模型和在线数据收集提供一个排序的干净的旋钮列表。Preprocessor propose general preprocessing guidelines to deal with parameter constraints, 很好的避免局部最优的问题。也通过关键参数的筛选，减少了模型训练和测试的难度。

\begin{figure}[htbp] \centerline{\includegraphics[width=0.4\textwidth]{fig//auto2.png}} \caption{The ADSTS Architecture.} \label{ADSTS} \end{figure}

\textbf{Preprocessor} is responsible for preprocessing the collected system data for constructing parameter identification model (\textbf{Parameter Model}), and providing training samples for the RL model. % is responsible for preprocessing the collected system data. It collects configuration parameters, system state metrics and performance data from the system log data. Preprocessor % has two main responsibilities: % ranking configuration parameters according to their impact on performance (\textbf{Parameter Model}) and providing training samples for the RL model. 1) Parameter Model selects key parameters and provides a clean, sorted list (\textit{Parameter List}) to online data collector according to user settings. 2) Preprocessor % collects system state metrics and performance data from the system log data, and generates the standard training samples for RL model.

%它将数据标准化处理，组织成训练样本数据用于RL模型训练

% 预处理所有数据，包括

%它主要有两个职责：筛选最重要的参数空间和为RL model提供训练样本 %对于庞大的参数空间，Preprocessor

%获取在列表中的参数初始值用于Tuning Agent进行动作空间的更新

Common interface is a canonical request/response interface online between ADSTS and the cloud storage system, including Metrics Collector and Action Controller. \textbf{Metrics Collector} is responsible for collecting and processing the metrics data that indicates the system status, and system performance/latency from the system. These metrics are sent to the RL Agent as the state information and reward information respectively. In addition, Collector will obtain configuration parameter information in Parameter List for updating the action space of the RL Agent. \textbf{Action Controller} applies recommended configurations suggested by the RL Agent after acquiring the user’s license.

%采用离线训练和在线调优 \textbf{Tuning Agent} can be regarded as the brain of the ADSTS system, which receives reward and state from the interface and updates the policy to guide how to adjust the configurations for getting higher reward. As shown in the Fig. \ref{ADSTS}, The base model in Agent is trained based on the training samples from the Preprocessor offline. Agent utilizes the model to recommend configuration settings and updates the model by taking the tuning request as training data online. To speed up RL training, Agent can generate the experience in parallel (experience storage in Memory Pool) and perform experience replay when the experience buffer reaches the batch size for training and parameter updating.

\textbf{Memory Pool} is used for two parts: 1) The experience replay memory (Replay Memory $D$), which used to store State-action-reward samples in RL agent. Each experience sample is a tuple $(s, a, r, s’)$ (called a transition), which contains four categories of information: the current system status $s$, the parameter updating action $a$, the reward value $r$, and the next state $s’$ after action $a$. The specific State-action-reward configurations will be introduced later. 2) Storing the training data, neural network parameters in model and the intermediate results updated during training.

% \setlength{\parskip}{0.5em}

### 2. Working Mechanism

%是一个用户设定到参数列表的映射

\textbf{Offline Training.} As shown in the black line in Fig.\ref{ADSTS}, offline ADSTS constructs Parameter Model and trains RL base model through system log data. \textbf{1) Parameter Model} is a mapping of user settings to Parameter List. For each user setting, Parameter Model is built by parameter pruning, preprocessing and key parameter selection to output a $K$ configurations list (Parameter List). $K$ (default 32, specified by users) is the number of selected key configurations that have a significant impact on system performance. %对于每个用户的设定，ADSTS基于参数域，通过pu处理参数 % ADSTS Parameter Model % Specifically, based on the multiple modules structure of the cloud storage system, Parameter Model arrange module selecting parameters into a tree-based indexing structure (\textit{Parameter Tree}) and classify parameters based on which module and sub-module they belong to. Then Parameter Model removes unconfigurable or not used parameters according to user settings to shrink the parameter space. % Finally, % Parameter Model produces Parameter List according to their impact on system performance by adaptive Lasso algorithm. \textbf{2) Training data} of RL model consists of configuration information, system state metrics and performance data. The state metrics are used to build the state space in RL model. These metrics can be obtained through the monitoring component of storage systems. For example, Ceph Monitor in Ceph is used for monitoring the overall cluster status. ADSTS chooses a certain number ($N$, default 20, specified by users) raw and secondary system statuses as metrics. Example raw statuses include buffer size, number of CPUs, CPU utilization, free memory, etc., and secondary system statuses include data reads, the total number of active threads, which needs to be calculated by counting the number of threads that are running, etc. All the collected metrics and parameters data will be stored in the memory pool. RL model of Tuning Agent is trained once offline with these training samples. %可以通过存储系统自带的监控功能获取，比如Ceph中，使用Ceph Monitor可以获取集群的整体状态。 %ADSTS % For example, % Ceph中，使用Ceph Monitor可以获取集群的

%如图1中黑线所示，离线下ADSTS主要通过系统日志数据，构建参数模型和训练RL初始模型。

\textbf{Online Tuning.} As shown in the red line in Fig.\ref{ADSTS}, ADSTS will work online when a user submits a tuning request to the cloud storage system. \textbf{1) Metrics Collector} collects the current system metrics to generate the current state $s$ and performance $r$. Then Collector gets Parameter List mapped by Parameter Model according to the user setting, and collects the $K$ configurations $a$. \textbf{2) Tuning Agent} uses the model obtained by offline training to do online tuning. Unlike offline training, the Agent replays the user’s current workloads to conduct stress testing on the storage system in order to fine-tune the model. Eventually, those configurations corresponding to the best performance in online tuning will be recommended to the user. When the tuning process is complete, ADSTS also needs to update the deep RL model and memory pool.

%它根据系统层次关系对参数建立树结构模型（参数树），并且基于用户设定，去除无法配置和不使用的参数，以缩小参数空间。在筛选之后，对所有参数定义边界，规范参数取值域。最后使用，produces a parameter ranking list （parameter List）according to their impact on system performance 通过自适应的lasso算法。

## Parameter Model

The most effective way to deal with complex and large configuration space is dimensionality reduction \cite{carver,black}. Since lots of researches have shown that some parameters have greater performance impact than others, focusing on a smaller number of more important parameters can speed up auto-tuning systems because they would have a smaller space to explore. Finding the most influential parameters is the main task.

% After pruning redundant parameters, standardizing parameter values, and ranking them according to their impact on performance, ADSTS provides a configurations ranking list (\textit{Parameter List}) and training samples (consist of system state metrics and performance data) for RL Agent.

%处理复杂庞大的参数空间，最有效的方式就是降维。并不是所有的参数对性能有影响，筛选出重要参数是Preprocessor的主要任务。但是云存储系统是多层次、分布式的，参数之间存在很大的约束关系，取值会互相影响，Preprocessor第一步要做的就是将参数空间构建为具象的层次关系。

%面对复杂的参数列表，

### 1. Parameter Pruning

% Based on the multiple modules structure of the cloud storage system, ADSTS arrange module selecting parameters into a tree-based indexing structure (\textit{Parameter Tree}) and classify parameters based on which module and sub-module they belong to.

ADSTS builds a tree-based indexing structure (\textit{Parameter Tree}) based on multiple layers of cloud storage system, and classifies parameters based on which module and sub-module they belong to. The Ceph example is shown in Fig. \ref{parameter}. \begin{figure}[htbp] \centerline{\includegraphics[width=0.35\textwidth]{fig//parameter2.png}} \caption{Parameter Pruning} \label{parameter} \end{figure} Based on the Parameter Tree, ADSTS firstly prunes the unconfigurable parameters, such as some static hardware and user information including IP addresses, port numbers, path strings, and debugging used parameters and so on. In addition, there are lots of categorical parameters. For the specific user case, some parameters are not used. For example, in Fig. \ref{parameter}, Ceph provides Bluestore and Filestore as two different store backends and uses \textit{osd_objectstore} to determine which to use. When the Bluestore is used, all parameters about Filestore will not be used. ADSTS will remove these unused parameters according to the user settings.

%另一方面，人为的静态限定参数边界不能保证能找到最优配置

### 2. Parameter Preprocessing

As mentioned above, the values of parameters are mixed with numerical and non-numerical ones. For the non-numerical values including boolean, string or categorical, ADSTS will firstly convert them to numbers by mapping candidate values to consecutive indexes. For example, for categorical/boolean parameters, ADSTS transforms each of them into dummy binary parameters that take values of 0 or 1. Finally, ADSTS normalizes the data values by log transformation to have the same order of magnitude.

In addition, most parameter values have no boundary, which may cause the “curse of dimensionality” in RL model action space. On the other hand, statically delimited parameter boundaries cannot guarantee the optimal setting is included. ADSTS adopts a \textbf{Dynamic Boundary}\cite{sapphire} approach, which dynamically enlarges the extent of parameters when the probing point comes near the boundary.

%补充算法

### 3. Parameter Identification

%以Lasso为代表的基于机器学习的特征选择算法被广泛应用于计算机系统中影响关键参数选择，但在 % % The important parameter identification can be regarded as a feature selection problem in machine learning. Many feature selection algorithms represented by Lasso are widely used to identify important parameters in compute systems. Lasso reduces the effect of irrelevant variables in regression models by penalizing models with large weights. % The Lasso is a regularization technique for simultaneous estimation and variable selection. Taking linear regression as an example, $y = X\beta + e$, and the Lasso estimates are defined as: % \begin{small} \begin{equation} \beta^{lasso}=\mathop\mathrm{argmin}\limits_{\beta} {||{\bf y}-{\bf X}\mbox{\boldmath $\beta$}||^2 +\lambda\sum_j|\beta_j|} \end{equation} % \end{small} where $\lambda$ is a nonnegative regularization parameter. However, Lasso provides sparse solutions that are biased, so the parameters that Lasso selects as meaningful can differ from the truly meaningful parameters. In ADSTS, \textbf{Adaptive Lasso}\cite{lasso}, which avoids this inconsistency by adding a weight, is used to identify key parameters. Adaptive Lasso is defined as: %通过添加一个权重来避免这种不一致问题 % \begin{small} \begin{equation} \hat\beta=\mathop\mathrm{argmin}\limits_{\beta} {||{\bf y}-{\bf X}\mbox{\boldmath $\beta$}||^2 +\lambda\sum_jw_j|\beta_j|} \end{equation} % \end{small} where $w$ is a known weights vector. The selection of weights can be data-driven and obtained through cross-validation.

After data preprocessing, ADSTS uses system parameters as independent variables ($X$) and performance metrics as dependent variables ($y$), as in the usual regression scenario. ADSTS then uses the Adaptive Lasso algorithm to determine the order of importance of the parameters and selects the most important $K$ parameters as \textit{Parameter List}. Finally, ADSTS build a mapping from user setting to Parameter List, i.e. Parameter Model.

%为了加速收敛，ADSTS采用递归方式从树根到树叶依次计算，根据参数树 %选出来的最影响性能的参数，其实往往是系统性能瓶颈所在，对开发人员优化系统有指导意义

% . Finally, Preprocessor selects the most important $K$ parameters to

% $K$ (default 32, specified by users) is the number of selected key configurations that have a significant impact on system performance.

%经过数据预处理后， %ADSTS将系统参数作为自变量，将性能指标作为因变量，然后再使用Adaptive Lasso algorithm to determine the order of importance of the DBMS’s parameters. %选择最重要的k个参数组成Parameter List输出给or online data collector, 并且将它和性能数据一起组成训练数据传到agent中 %and training samples (consist of Parameter List and performance data)

## Tuning Agent

Reinforcement Learning is an area of machine learning concerned with how artificial agents ought to take actions in a certain environment, with the goal of maximizing some notion of cumulative reward. RL problems can be formulated as a Markov decision process (MDP), which is used to describe the process of interaction between agent and environment, and the \textit{State}, \textit{Action}, \textit{Reward} constitute its basic elements.

### 1. Problem Formulation

ADSTS regard the instance of the cloud storage system which remains to be tuned as an \textit{environment}, and Tuning Agent can obtain normalized state metrics \textit{state} from the environment. Then Tuning Agent generates the configuration settings \textit{action} and will receive a \textit{reward} after deploying at on the instance. The State-action-reward configurations of our problem can be described as follows:

\textbf{State:} We describe state at time step $t$ as $S_t$, which means the current state of the agent, i.e., the $N$ system state metrics.

% $S_t$ can be defined by the list of the metrics: $S_t={V^t_0, V^t_1, … ,V^t_{N-1}}$, where $V^t_i$ is equal to the $i$-th metrics at step $t$.

\textbf{Action:} Action comes from the value space of $K$ parameters. We define the action at time step $t$ as $A_t$, which means the parameter tuning operation at step $t$. An action is to increase or decrease all $K$ tunable parameter values at a time. The system performs the corresponding action according to the newest policy under the corresponding state.

\textbf{Reward:} Reward is a scalar described as $R_t$. We use $P_t$ to represent the performance at $S_t$, which directly reflects the current performance. $R_t$ is equal to the diﬀerence between the performance at time $t$ and that at $t-1$ or the initial settings, i.e. $R_t = P_t - P_{t-1}$. $R_t$ means the performance change after the system performed the new configurations recommended by ADSTS at time $t$.

\textbf{RL Model}. Q-Learning and DQN are the most classic value-based RL methods. Q-learning records the Q-value by a Q-table, which is usually a two-dimensional table of states and actions. The policy is to choose the action, which has the maximum Q-value in the Q-table, according to state. DQN uses neural networks rather than Q-tables in Q-Learning to estimate Q-value. However, they are discrete-oriented control algorithms and cannot deal with the continuous parameter space, which occurs in auto-tuning system. To solve these problems, Deep Deterministic Policy Gradient (DDPG), which combines DQN and actor-critic algorithm and performs well in continuous space, becomes one of the most popular RL algorithms for auto-tuning\cite{qtune}, but it can cause local optimal due to the overestimation of rewards\cite{td3}. % cdbtune, % TD3 is actor-critic algorithm. It uses two Q network ($Q_{\theta_1}, Q_{\theta_2}$) to estimate critic network, with the smaller one as the update target.

### 2. TD3 for ADSTS

ADSTS utilize the optimized Twin-Delayed DDPG (\textbf{TD3}) algorithm\cite{td3}, which uses double Q network to avoid the overestimation problem in DDPG. TD3 is actor-critic algorithm, in which the \textit{actor network} outputs the policy mapping the state to the value of action and the \textit{critic network} aims to represent the value (\textit{score}) with specific action and state, which guides the learning of actor. TD3 uses two Q networks to estimate critic network, with the smaller one as the update target. TD3 is shown to be effective in learning over a large, complex optimization space. %一个输出策略，负责选择动作，我们把这个网络成为Actor； 一个负责计算每个动作的分数，我们把这个网络成为Critic。 % And TD3 uses two Q network to estimate critic network, with the smaller one as the update target.

% \begin{equation}

% \end{equation}

\textbf{Tuning Algorithm} (Algorithm \ref{alg:Framwork}) specifically describes our RL-based auto-tuning scheme. \begin{algorithm} \setstretch{0.9} \KwIn{ % The action-value function $Q$; State $s_t$, The batch size $\beta$, %\sim N(0,\sigma) Replay memory $\mathcal{D}$. } Initialize critic networks $Q_{\theta_1}, Q_{\theta_2}$, and actor network $\pi_\phi$ with random parameters $\theta_1, \theta_2, \phi$;

% Initialize target networks θ10 θ1, θ20 θ2, φ0 φ;

Define action $a_t$, reward $r_t$, next state $s_{t+1}$;

\SetKwFunction{UpdateNet}{UpdateNet}

\While{The tuning process is not terminal}{

```
$a_t \gets \pi_\phi(s_t) + \epsilon$;
\algorithmiccomment{Select with exploration noise}
Update the configurations in system based on $a$;
$s_{t+1} \gets$ Observe the state metrics from the system;
$r_t \gets$ Calculate performance change;
```

\If{$\mathcal{D}$ is full}{ Discard the earliest tuple in $\mathcal{D}$; }

Store experience $(s_t, a_t, r_t, s_{t+1})$ in $\mathcal{D}$;

% \If{$len(\mathcal{D}) > \beta$}{ \If{The number of tuples in $\mathcal{D}$ exceeds $\beta$}{ $\Call{\UpdateNet}{\mathcal{D}}$ \algorithmiccomment{Update network} } % if len(replay_buffer) > batch_size: % $\Call{\UpdateNet}{D}$ \algorithmiccomment{select with exploration noise} % Sample mini-batch of $N$ transitions $(s,a, r, s’)$ from $D$;

} \SetKwProg{Fn}{Function}{:}{} \Fn{\UpdateNet{$\mathcal{D}$}}{ \label{UpdateNet} Sample $N$ transitions $(s,a, r, s’)$ from $\mathcal{D}$;

```
Select action based on $s'$:
$a' \gets \pi_\phi(s') + \epsilon$;
% \algorithmiccomment{}
Calculate the score for state $s'$:
$y = r + \gamma \mathop\mathrm{min}\limits_{i=1,2} Q_{\theta'_{i}}(s', a')$
```

Update critics: \begin{small}$\theta_i \gets \mathrm{argmin}*{\theta_i}N^{-1} \sum(y-Q*{\theta_i}(s, a))^2;$ \end{small}

```
Update actor by the deterministic policy gradient:
$\nabla_\phi J(\phi) = N^{-1} \sum\nabla_{a}Q_{\theta_1}(s,a)|_{a=\pi_\phi(s)}\nabla_{\phi}\pi_\phi(s)$
Update target networks:
$\theta'_i \gets \tau\theta_i + (1- \tau)\theta'_i$;
$\phi' \gets \tau\phi + (1- \tau)\phi'$;
% \algorithmiccomment{Update critics}
% \algorithmiccomment{Update target networks}
```

% \KwRet $s’$, $r$; }

\caption{RL-Based Auto-Tuning}\label{alg:Framwork} \end{algorithm} To begin with, we initialize critic networks $Q_{\theta_1}(s,a), Q_{\theta_2}(s,a)$, and actor network $\pi_\phi(s)$ with random parameters $\theta_1, \theta_2, \phi$.

The tuning process terminates if the user obtains a satisfied performance or tuning time out. Before that, the parameter tuning action $a_t$ will be selected with exploration noise based on the state $s_t$ in each step $t$: \begin{equation} a = \pi_\phi(s) + \epsilon, \epsilon \sim \mathcal{N}(0,\delta) \label{action} \end{equation} where $\epsilon$ is the random exploration noise value. Then Tuning Agent will obtain system metrics $s_{t+1}$ and calculate performance change as a reward $r_t$ after deploying $a_t$ in the system. A tuple $(s_t, a_t, r_t, s_{t+1})$ will be stored in the replay memory $\mathcal{D}$, which can be used for training agent and updating neural network parameters in the training process.

\textbf{Training process} is shown by function \textit{UpdateNet()} (Line \ref{UpdateNet} in Algorithm \ref{alg:Framwork}). Firstly, we extract mini-batch of $N$ transitions $(s, a, r, s’)$ from the experience replay memory. Then we feed $s’$ to the actor network and output the configuration settings $a’$ based on the Equation \eqref{action}, and get the score $y$ (minimum of the two Q networks) in critic network: \begin{equation} y = r + \gamma \mathop\mathrm{min}\limits_{i=1,2} Q_{\theta’*{i}}(s’, \pi*\phi(s’) + \epsilon) \end{equation} where $\gamma$ is a discount factor which denotes the importance of the future reward relative to the current reward $r$. Then the parameters $\theta_1,\theta_2$ of critic network can be updated with gradient descent to minimize the training objective: \begin{equation} \mathrm{min} L(\theta) = \mathbb{E}[(y-Q_\theta(s,a))^2] \end{equation}

Similar to DDPG, the parameter $\phi$ of actor network is updated by the deterministic policy gradient: \begin{equation} \nabla_\phi J(\phi) = \mathbb{E}[\nabla_{a}Q_{\theta_1}(s,a)|*{a=\pi*\phi(s)}\nabla_{\phi}\pi_\phi(s)] \end{equation} Finally, the target networks use a slow-moving update rate, parametrized by $\tau$: $\theta’ \gets \tau\theta + (1- \tau)\theta’$.

% \textbf{Training Process}

## Optimization

When actually running Tuning Agent in the system, we find two problems: 1) When obtaining system performance as the reward of agent, the results are are volatile and inaccurate (\textit{Observation 3} in Sec.\ref{Observation 3}.) 2) The training time is too long. Although key parameters are selected by Parameter Model to reduce the action space of the Tuning Agent, the training time increases a lot in order to lower the the training error.

% In practical training, if a small sample is selected, the training result is not ideal; if a large sample is selected, thetraining time is very long.

%在实际运行Tuning Agent时，我们发现了两个问题：1)在获取系统性能作为 Agent的奖励时，我们发现性能波动较大且结果容易不准确（） % 2)训练时间较长，尽管尽管通过参数识别选出关键参数，缩小了tuning agent的动作空间。但是在RL模型的动作和状态空间都比较大

\textbf{Cautious Evaluation} is used in Tuning Agent to avoid pitfalls. As shown in Fig.\ref{Ob3}, we find that the different test groups would reach a stable state (about 30 minutes), which happens in any parameter evaluation. After deploying the tuning action in the system, Tuning Agent increases the \textit{runtime} until the stable state, which is judged by the absence of performance fluctuations over time. %我们发现性能评估中性能都会达到一个稳定的状态，

\textbf{Stagewise Training} is adopted to speed up training and ensure low error. In order to improve the training accuracy, a larger number of samples (with size $n$) are used as training data sets. We randomly divide into many small samples ($m$). Assuming $n=k*m+b$ ($b \leq m$), the large samples can be divided into $k+1$ small samples in total. % By default, $k$ is set to 10, $m$ and $b$ can be calculated by the number of large samples $n$. A small sample with $m$ size is trained first and the trained model became the \textit{base model}. Then base model will directly be tested with the second small sample. If the test result is qualified, it’s turned to the next sample stage, otherwise base model will be retrained with the second small sample. And so on, until the final sample test is completed. %

% The function\ref{UpdateNet} in Algorithm \ref{alg:Framwork}, % Similar to other policy gradient methods (DQN,DDPG), TD3 has a parameterized policy function at = µ(st |θ µ) (θ µ, mapping the state st to the value of action at which is usually % called actor). Critic function Q(st , at |θQ) ( θQ is learnable parameters) of the network aims to represent the value (score) % with specifc action at and state st , which guides the learning of actor. Specifcally, critic function helps to evaluate the % knob settings generated by the actor according to the current % state of the instance. Inheriting the insights from Bellman % Equation and DQN, the expected Q(s, a) is defned as:

% We speed up the training process of our model using asynchronous distributed training, as shown in Figure 3. Our framework consists of several controllers, each of which execute the current policy defined by the attentional sequence-to-sequence model as described in Section 3.2. All of the controllers interact with a single shared parameter server. We note that the parameter server holds only the controllers’ parameters, and not the input graph’s parameters, because keeping the input graph’s parameters on the parameter server can potentially create a latency bottleneck to transfer these parameters. Each controller in our framework interacts with K workers, where K is the number of Monte Carlo samples in Equation 3.

% The training process has two alternating phases. In the first phase, each worker receives a signal that indicates that it should wait for placements from its controller, while each controller receives a signal that indicates it should sample K placements. Each sampled placement comes with a probability. Each controller then independently sends the % placements to their workers, one placement per worker, and sends a signal to indicate a phase change.

% In the second phase, each worker executes the placement it receives and measures the running time. To reduce the variance in these measurements, each placement is executed for 10 steps and the average running time of the steps but the first one is recorded. We observe that in TensorFlow, the first step can take longer to execute compared to the following steps, and hence we treat itss runing time as an outlier. Each controller waits for all of its workers to finish executing their assigned placements and returning their running times. When all of the running times are received, the controller uses the running times to scale the corresponding gradients to asynchronously update the controller parameters that reside in the parameter server.

% In our experiments, we use up to 20 controllers, each with either 4 or 8 workers. Under this setting, it takes between 12 to 27 hours to find the best placement for the models in our experiments. Using more workers per controller yields more accurate estimates of the policy gradient as in Equation 3, but comes at the expense of possibly having to put % more workers in idle states. We also note that due to the discrepancies between machines, it is more stable to let each controller have its own baseline

% The increase of data nodes makes training more difficult in two aspects: \textit{1) More data nodes leads to longer training time; 2) As data nodes increase, the dimensions of state, action space and neural network increase, making the original model unavailable.} Model fine-tuning method \cite{dota2} is adopted, which can update the neural network structure based on the old model to reduce the training time when the number of nodes increases. Specifically, when the dimension of the neural network increases, the parameters of the old model will be used as the initial parameters of the lower dimension of the new model, and the new dimension will be zeroed or randomly initialized. Take the neural network ($K$ hidden layers) in Fig. \ref{RL_NN} as an example, only the dimensions of $W_1$, $W_n$ and $B_n$ are related to the number of data nodes $n$, so after nodes are added, all parameters except them are initialized with the parameters of the old model. Suppose the number of data nodes increases to $n’$. This causes the shapes of three parameter arrays to change: $W_1$ (from [$n,h1$] to [$n’,h1$]), $W_n$ (from [$h_k, n$] to [$h_k, n’$]) and $B_n$ (from [$n$] to [$n’$]). In this case, the new variables are initialized as: % % [ W_1’ = \left[\begin{matrix}W_1 & 0\end{matrix}\right] % % \hspace{5mm} % % W_n’ = \left[\begin{matrix}W_n\R()\end{matrix}\right] % % \hspace{5mm} % % B_n’ = \left[\begin{matrix}B_n\R()\end{matrix}\right] % % ] % Where $R()$ indicates a random initialization. The first $n$ dimensions of $W_1, W_n$ and $B_n$ are initialized with the parameters of the old model. The new dimensions initialization of $W_n’$ and $B_n’$ will be randomized, which ensures that symmetry is broken among the new dimensions. The new dimensions of $W_1’$ are initialized to zero, which ensure that the newly added variables do not affect the output of the full connection layer, and the subsequent training will not be affected if the weight of only part is reset to zero. In this way, the training speed of the new model will be greatly increased, or even increased by hundreds or thousands of times.

## Evaluation

In this section, we detail our evaluation of ADSTS. The implementation and experimental settings are introduced first in Sec.\ref{set}. And Sec.\ref{pe} shows the evaluation results, including performance improvement, the effect of parameter identification and training optimization. % performance % improvements in Ceph using ADSTS

\subsection{Experiment Setup} \label{set} ADSTS is implemented on Park\cite{park} platform, which is an open platform for learning-augmented computer systems. All the codes in ADSTS are written in C++ and python, and the adaptive lasso and reinforcement learning model is implemented based on the Tensorflow. % \begin{figure}[htbp] % \centerline{\includegraphics[width=0.28\textwidth]{fig//ADSTS-Ceph.png}} % \caption{Ceph with ADSTS.} % \label{Ceph} % \end{figure} In addition, ADSTS is successfully packaged into the actual cloud storage systems, Ceph (v12.2.13). % % As shown in Fig. \ref{Ceph}, % The Metrics Collector interact with Ceph through the Ceph Monitor. % Metrics Collector fetches the system metrics from the Ceph OSDs at 30 second intervals by using the Linux SAR (System Activity Report) utility. % ADSTS is implemented as plug-ins, and retains the original architecture and other processes of Ceph. % \subsection{Experiment Setup} We compare ADSTS with the default (\textbf{Default}), manual-tuning (\textbf{Manual}), and \textbf{SAPPHIRE} (Ceph auto-tuning system using Bayesian Optimization) in Ceph using COSBench\cite{cosbench}, which is a benchmarking tool to measure the performance of cloud object storage services. We choose four different workload mixes derived from COSBench as listed in Table.\ref{tab1}. % workloads represent real-world applications that use cloud object storage and are described in Table.\ref{tab1}. The Ceph environment is conducted on six hosts including one Monitor host, three OSD host and two Client hosts. Each host is equipped with an Intel Skylake Xeon Processor (2.40 GHz), 54 GB of memory. Inside each OSD host, there are four Intel DC NVMe SSDs (P4510, 2 TB) as storage devices. % for OSD daemons to manage.

\begin{small} \begin{table} \caption{COSBench Workloads.} \scalebox{0.8}{ \begin{tabular}{c|c|c} \hline Workloads & Operations & App. scenario

\hline A & G: 90\%, P: 5\%, D: 5\% & Online video sharing

% \hline B & G: 5\%, P: 90\%, D: 5\% & Enterprise backup

% \hline C & G: 50\%, P: 50\% & General

\hline \multicolumn{3}{l}{G: Get, P: Put, D: Delete.} \end{tabular}} \label{tab1} \end{table} \end{small} % \begin{table} % \caption{COSBench Workloads.} % \scalebox{0.75}{ % \begin{tabular}{c|c|c|c} % \hline % Workloads & A & B & C

% \hline % Operations & G: 90\%, P: 5\%, D: 5\% & G: 5\%, P: 90\%, D: 5\% & G: 50\%, P: 50\%

% \hline % App. scenario & Online video sharing & Enterprise backup & General

% \hline % \multicolumn{4}{l}{G: Get, P: Put, D: Delete.} % \end{tabular}} % \label{tab1} % \end{table}

\subsection{Evaluation Results} \label{pe}

\textbf{Performance Evaluation.} Fig.\ref{test1} and Fig.\ref{test2} show the bandwidth and average latency of Ceph using configurations recommended by different strategies under different workloads. \begin{figure}[htbp] \centering

\subfigure[Bandwidth] { \begin{minipage}[t]{0.5\linewidth} \includegraphics[scale=0.6]{fig//test1.pdf}

\label{test1} \end{minipage} }\subfigure[Average Latency] { \begin{minipage}[t]{0.5\linewidth} \includegraphics[scale=0.6]{fig//test2.pdf} \label{test2} \end{minipage} } \caption{Performance evaluation under different workloads} \label{learned_read} \end{figure} Results show that ADSTS significantly boosts Ceph performance to $1.5\times \thicksim 2.5\times$ compared to other strategies and ADSTS has good adaptability varying workloads. SAPPHIRE can improve performance compared to the default configurations, but not as good as time-consuming manual tuning under some workloads.

% \subsection{Parameter Identification} \label{pi}

\textbf{Parameter Identification.} As shown in Fig. \ref{test3}, ADSTS without Parameter Model needs about 6 hours to reach the optimal configurations using all parameters (\textit{All}). With the 32 parameters (\textit{K=32)} selected by Parameter Model, ADSTS uses only 2 hours. While with the 64 parameters, it takes nearly 4 hours to reach the optima and using 16 parameters will lead to suboptimal performance. Parameter Model can identify the most important parameters to shrink the optimization time. Moreover, the number of important parameters needs to be chosen carefully.

%识别最重要的参数

% Comparing to using the top 64 parameters, we % find that using the top 16 ones only consumes 30\% of optimization time, while the performance of final recommendations has no apparent differences. This is because the Ceph parameter impacts decrease drastically; most knobs have little effect on the performance. Thus, by only using the top ones, we can significantly shrink the optimization time of SAPPHIRE while maintains the performance efficacy.

\begin{figure}[htbp] \begin{minipage}[t]{0.21\textwidth} \centering \includegraphics[width=4cm]{fig//test3.pdf} \caption{Parameters Test} \label{test3} \end{minipage} \begin{minipage}[t]{0.26\textwidth} \centering \includegraphics[width=4cm]{fig//test4.pdf} \caption{Stagewise Training} \label{test4} \end{minipage} \end{figure}

% \subsection{Training Optimization} \label{to} \textbf{Training Optimization.} As shown in Fig. \ref{test4}, the training time and training error in Stagewise Training and common training with large and small samples are statistically analyzed. It can be seen that training with small samples causes high error, while training with large samples takes longer time. Stagewise Training with large samples has less error and the training time is almost the same as that with small samples.

## Conclusion

We present ADSTS, an automatic cloud storage tuning system based on deep reinforcement learning. In ADSTS, Parameter Model is constructed to identify the key parameters that most affect performance. In addition, ADSTS utilizes TD3 algorithm to find the optimal configurations in high dimensional continuous space. Finally, ADSTS adopts a careful performance evaluation strategy to reduce noise. Evaluations show that recommended configurations by ADSTS outperforms existing auto-tuning strategies.