reorder to improve cache locality

In large machine learning problems, such as parameter server, we have to calculate over samples. And for every sample, we have to fetch some variables from remote server. To improve performance, we can cache the recent used variables in local memory, and only fetch ones not cached. The problem is that, can we reorder the samples, to minimize the fetching times, and number of fetching variables?

Another case is to reorder a sparse matrix to improve the performance of matrix and vector multiplication, esp. if the sparse matrix is unchanged(or, only values are changed, the sparse structure is kept) and used many times. 

The precise problem corresponding to the best cache algorithm. I think is very hard to solve.

A greedy solution is choose the sample from not calculated samples, whose number of fetching variables(not in cache) is smallest.

In the matrix-vector example, if the sparse matrix is not randomly generated, such as user-book rating, there is a lots of people rating almost the same books, then, I guess, the performance improvement may be considerable.

In the parameter server case, the greedy algorithm works for one machine. But we still needs a algorithm to distribute the samples over machines.