含微室: 201406

Last week, I read an nice article relative to a video talking about Google backup: How Google Backs Up the Internet.

The article tells us that one way to back up data is using tapes. For every 4 tapes, a fifth tape, the xor of the 4 tapes, is used to recover from tapes corruption.

This strategy can recovers from any one tape corruption of 5 tapes, and about 20% capacity of tape is wasted. If we assumes simply that every tape have the same corrupt probability p in a fixed period(The time to next check), then if we have 100 tapes, the probability without data lost($P_1$) is $(4p(1-p)^4 + (1-p)^5)^{20}$.

Here We can assume that every tape is self checked, so we can check tapes independently.

The mainly problem is that, if there are 2 tapes corruption in one group of 5 tapes, we'll lost data. So we still have a big chance to lost some data.

I was thinking, there should be a another better strategy. We just need a encoding way, for a given number of tapes(n), use some extra number of tapes(e), so that we can recover the data at a limit number(k, $k \le e$) of tapes lost. The efficiency is $\frac{n}{n + e}$, and the probability without data lost($P_2$) is $\sum_{i=0}^{k}{n + e \choose i}(1-p)^{n + e - i}p^i$.

Or, let $X = 1$ if the tape is corrupted, otherwise $X=0$, and $S = \frac{\sum X}{n + e}$, The above can be rewrite as $S \le \frac{k}{n + e}$. Since $E\; S = E X = p$, $\mathrm{var}\;S = \frac{p(1-p)}{n + e}$, and $S \to p$. If we can keep $\frac{k}{n + e}$, and assume $\frac{k}{n + e} \gt p$, then the larger $n$, $e$, the safer we are. And for large $n, e$, we can assume $S$ suits a normal distrubution, and

\begin{align*}
P\left(S \le \frac{k}{n + e}\right) & = \Phi\left(\frac{\frac{k}{n + e} - \mu}{\sigma}\right) \\
& = \Phi\left(\frac{\sqrt{n + e}(\frac{k}{n + e} - p)}{\sqrt{p (1 - p)}}\right) \\
& \to 1
\end{align*}

Usually, $p$ is very small. For example, if $p=10^{-5}$, then for 100 tapes, 80 data tapes and 20 redundant tapes, we'll survival in $P_1 = 0.9998$. And if we use $n=80$, $k=e=20$, we'll survival in $P_2 = 1.0 - 2.04\times 10^{-84}$.

Keep the same tapes configuration, for different $p$, we get this numbers:

$p$	$P_1$	$P_2$
0.2	0.0002	0.559
0.1	0.042	0.9992
0.01	0.81	$1.0 - 9.58\times 10^{-22}$
0.001	0.98	$1.0 - 1.9\times 10^{-42}$
0.0001	0.998	$1.0 - 2.03\times 10^{-63}$
0.00001	0.9998	$1.0 - 2.04\times 10^{-84}$
0.000001	0.99998	$1.0 - 2.04\times 10^{-105}$

The strategy for this safer way does exist, for example, polynomial code, almost the reverse of secret sharing. We build the virtual whole message from first $n$ data parts, and split the whole message to $n + e$ parts, the first $n$ parts should be identity to original $n$ data parts. The problem is that, in order to recover a tape, we need at least $n$ tapes.

I once heard that, disk vendors have very good redundant encoding algorithms, what are they? May this problem have been resolved very well in that domain, we just need to transfer the solutions.

Safer/slower tape backup