含微室: Safer/slower tape backup

Last week, I read an nice article relative to a video talking about Google backup: How Google Backs Up the Internet.

The article tells us that one way to back up data is using tapes. For every 4 tapes, a fifth tape, the xor of the 4 tapes, is used to recover from tapes corruption.

This strategy can recovers from any one tape corruption of 5 tapes, and about 20% capacity of tape is wasted. If we assumes simply that every tape have the same corrupt probability p in a fixed period(The time to next check), then if we have 100 tapes, the probability without data lost($P_1$) is $(4p(1-p)^4 + (1-p)^5)^{20}$.

Here We can assume that every tape is self checked, so we can check tapes independently.

The mainly problem is that, if there are 2 tapes corruption in one group of 5 tapes, we'll lost data. So we still have a big chance to lost some data.

I was thinking, there should be a another better strategy. We just need a encoding way, for a given number of tapes(n), use some extra number of tapes(e), so that we can recover the data at a limit number(k, $k \le e$) of tapes lost. The efficiency is $\frac{n}{n + e}$, and the probability without data lost($P_2$) is $\sum_{i=0}^{k}{n + e \choose i}(1-p)^{n + e - i}p^i$.

Or, let $X = 1$ if the tape is corrupted, otherwise $X=0$, and $S = \frac{\sum X}{n + e}$, The above can be rewrite as $S \le \frac{k}{n + e}$. Since $E\; S = E X = p$, $\mathrm{var}\;S = \frac{p(1-p)}{n + e}$, and $S \to p$. If we can keep $\frac{k}{n + e}$, and assume $\frac{k}{n + e} \gt p$, then the larger $n$, $e$, the safer we are. And for large $n, e$, we can assume $S$ suits a normal distrubution, and

\begin{align*}
P\left(S \le \frac{k}{n + e}\right) & = \Phi\left(\frac{\frac{k}{n + e} - \mu}{\sigma}\right) \\
& = \Phi\left(\frac{\sqrt{n + e}(\frac{k}{n + e} - p)}{\sqrt{p (1 - p)}}\right) \\
& \to 1
\end{align*}

Usually, $p$ is very small. For example, if $p=10^{-5}$, then for 100 tapes, 80 data tapes and 20 redundant tapes, we'll survival in $P_1 = 0.9998$. And if we use $n=80$, $k=e=20$, we'll survival in $P_2 = 1.0 - 2.04\times 10^{-84}$.

Keep the same tapes configuration, for different $p$, we get this numbers:

$p$	$P_1$	$P_2$
0.2	0.0002	0.559
0.1	0.042	0.9992
0.01	0.81	$1.0 - 9.58\times 10^{-22}$
0.001	0.98	$1.0 - 1.9\times 10^{-42}$
0.0001	0.998	$1.0 - 2.03\times 10^{-63}$
0.00001	0.9998	$1.0 - 2.04\times 10^{-84}$
0.000001	0.99998	$1.0 - 2.04\times 10^{-105}$

The strategy for this safer way does exist, for example, polynomial code, almost the reverse of secret sharing. We build the virtual whole message from first $n$ data parts, and split the whole message to $n + e$ parts, the first $n$ parts should be identity to original $n$ data parts. The problem is that, in order to recover a tape, we need at least $n$ tapes.

I once heard that, disk vendors have very good redundant encoding algorithms, what are they? May this problem have been resolved very well in that domain, we just need to transfer the solutions.

3 comments:

Davies said...: 我写了很长的comments, 提交后居然没了。。。; Jul 1, 2014, 7:04:00 AM
Davies said...: 这个问题的解决方案叫RAID，需要在Durability，成本和恢复速度中选择平衡。其中Reed Solomon (N,P) 是常用的一个算法，GFS2 里面好像用的是 RS(6,3)，磁带备份其实用的就是RS(4,1)，可能与磁带的访问特性有关系，比如可能不能同时访问多于N块磁带。

另外一个考虑因素是性能，磁带的顺序读写速度是几百兆每秒，那么RAID算法的性能就不能比这差。校验块越多性能会越差，包括生成和恢复。

像RS（80，20）这样的组合，虽然理论上看起来很好，但实际因为性能太差以及恢复太慢而无法在实际项目中采用。; Jul 1, 2014, 7:10:00 AM
blog.weizi.org said...: 我也写了很长评论, 提交后居然没了 ...

换成弹窗式评论了.

RS(80, 20) 确实有写和恢复性能问题. 写此篇之后有个想法是将每个 tape 做小点, 组合成 RS(80, 20), 外面的控制器负责读写分发. 每个小 tape 有独立错误指示, 可以热替换. 只要 RS(80, 20) 能达到单个 tape 的读写性能就可以了, 但稳定性高很多.

考虑仅追加的备份系统, 做成热替换也不算太难, 只是可能需要在控制器处节速, 以让后加的单元赶上, 稍稍有损性能. 但更多的是场景可能是备份的离线维护.

可以在坏了几个单元后再开始替换.; Jul 1, 2014, 8:51:00 AM