### Safer/slower tape backup

The article tells us that one way to back up data is using tapes. For every 4 tapes, a fifth tape, the xor of the 4 tapes, is used to recover from tapes corruption.

This strategy can recovers from any one tape corruption of 5 tapes, and about 20% capacity of tape is wasted. If we assumes simply that every tape have the same corrupt probability p in a fixed period(The time to next check), then if we have 100 tapes, the probability without data lost($P_1$) is $(4p(1-p)^4 + (1-p)^5)^{20}$.

Here We can assume that every tape is self checked, so we can check tapes independently.

The mainly problem is that, if there are 2 tapes corruption in one group of 5 tapes, we'll lost data. So we still have a big chance to lost some data.

I was thinking, there should be a another better strategy. We just need a encoding way, for a given number of tapes(n), use some extra number of tapes(e), so that we can recover the data at a limit number(k, $k \le e$) of tapes lost. The efficiency is $\frac{n}{n + e}$, and the probability without data lost($P_2$) is $\sum_{i=0}^{k}{n + e \choose i}(1-p)^{n + e - i}p^i$.

Or, let $X = 1$ if the tape is corrupted, otherwise $X=0$, and $S = \frac{\sum X}{n + e}$, The above can be rewrite as $S \le \frac{k}{n + e}$. Since $E\; S = E X = p$, $\mathrm{var}\;S = \frac{p(1-p)}{n + e}$, and $S \to p$. If we can keep $\frac{k}{n + e}$, and assume $\frac{k}{n + e} \gt p$, then the larger $n$, $e$, the safer we are. And for large $n, e$, we can assume $S$ suits a normal distrubution, and

\begin{align*}
P\left(S \le \frac{k}{n + e}\right) & = \Phi\left(\frac{\frac{k}{n + e} - \mu}{\sigma}\right) \\
& = \Phi\left(\frac{\sqrt{n + e}(\frac{k}{n + e} - p)}{\sqrt{p (1 - p)}}\right) \\
& \to 1
\end{align*}

Usually, $p$ is very small. For example, if $p=10^{-5}$, then for 100 tapes, 80 data tapes and 20 redundant tapes, we'll survival in $P_1 = 0.9998$. And if we use $n=80$, $k=e=20$, we'll survival in $P_2 = 1.0 - 2.04\times 10^{-84}$.

Keep the same tapes configuration, for different $p$, we get this numbers:

$p$ $P_1$ $P_2$
0.2 0.0002 0.559
0.1 0.042 0.9992
0.01 0.81 $1.0 - 9.58\times 10^{-22}$
0.001 0.98 $1.0 - 1.9\times 10^{-42}$
0.0001 0.998 $1.0 - 2.03\times 10^{-63}$
0.00001 0.9998 $1.0 - 2.04\times 10^{-84}$
0.000001 0.99998 $1.0 - 2.04\times 10^{-105}$

The strategy for this safer way does exist, for example, polynomial code, almost the reverse of secret sharing. We build the virtual whole message from first $n$ data parts, and split the whole message to $n + e$ parts, the first $n$ parts should be identity to original $n$ data parts. The problem is that, in order to recover a tape, we need at least $n$ tapes.

I once heard that, disk vendors have very good redundant encoding algorithms, what are they? May this problem have been resolved very well in that domain, we just need to transfer the solutions.

Davies said...

Davies said...

Changsheng Jiang said...

RS(80, 20) 确实有写和恢复性能问题. 写此篇之后有个想法是将每个 tape 做小点, 组合成 RS(80, 20), 外面的控制器负责读写分发. 每个小 tape 有独立错误指示, 可以热替换. 只要 RS(80, 20) 能达到单个 tape 的读写性能就可以了, 但稳定性高很多.