Mimic the interface of golang -- Part 1

I'm learning golang. As a C/C++ programmer, I like the feature of open method and dynamic interface, which are not supported(or not supported in native) by C/C++. In my experience,  dynamic interface damage performance a lot, so i write a small bench code:

package main
import (
"fmt"
"time"
)
var (
count = 0
)
func benchmarkFunc(N int, f func(int)) float64 {
start := time.Now()
f(N)
return float64(time.Since(start).Nanoseconds()) / float64(N)
}
func benchmarkFuncPrint(name string, N int, f func(int)) {
ns := benchmarkFunc(N, f)
fmt.Printf("%s %f nano-seconds/call\n", name, ns)
fmt.Printf("%s %f #/second\n", name, 1.e9/ns)
}
type Showable interface {
Show()
}
type Show1 struct {
}
func (p *Show1) Show() {
count++
}
type Show2 struct {
}
func (p *Show2) Show() {
count += 2
}
func call(p interface{}) {
p.(Showable).Show()
}
func call_show(s Showable) {
s.Show()
}
func dynamic(p interface{}, N int) {
for i := 0; i < N; i++ {
// call(p)
p.(Showable).Show()
}
}
func static(s Showable, N int) {
for i := 0; i < N; i++ {
// call_show(s)
s.Show()
}
}
func main() {
N := 10000000
benchmarkFuncPrint("dynamic", N,
func(N int) {
d := Show1{}
dynamic(&d, N)
})
benchmarkFuncPrint("static", N,
func(N int) {
d := Show2{}
static(&d, N)
})
}
This gives
dynamic 19.478761 nano-seconds/call
dynamic 51337966.752537 #/second
static 2.507811 nano-seconds/call
static 398754196.190095 #/second
Which is very quick. Only 19 nano seconds, in my last benchmark, only mutex will cost about 10 nano seconds. And the static version is almost the same as virtual function in C++.

I wrote a toy version of open method for C++, using std::unorderd_map, std::map, google::dense_hash_map, as the storage from std::type_info* to function pointer. And this's the performance counters
lambda: 0.795678 nano-seconds/call
lambda: 1.25679e+09 #/second
unordered_map: 30.4666 nano-seconds/call
unordered_map: 3.28228e+07 #/second
dense hash map: 22.4284 nano-seconds/call
dense hash map: 4.45863e+07 #/second
map: 22.5974 nano-seconds/call
map: 4.42528e+07 #/second
std::map is quicker in this small test, but in larger project, we should divide the performance counter by log(N), where N is the number of function pointers.

It seems that golang's runtime is quicker than google::dense_hash_map + std::mutex. So I felt interesting about it's implementation, and checked the file src/pkg/runtime/iface.c. golang uses a static size(1009) hash table, and only lock if finding first time failed. Chain list in every bucket of hash table is a multiple readers and single writer forward list, reader is wait free.

It's a good implementation, but I don't like the fix size to 1009?(consider a larger project, your performance counter should be divided by ceil(N / 1009), where N is the number of interfaces. And, I guess, using a bit modulo with quadratic open addressing may improve the performance.

So I tried, and I think it did(only test in C++).
template
struct ValueFree {
  inline void operator()(const T & v) const {}
};
template <>
struct ValueFree {
  inline void operator()(void (*v)(void)) const {  }
};

template
struct ValueFree {
  inline void operator()(T * v) const {
    delete v;
  }
};
template
struct TableHash : std::hash {};
template
struct TableHash {
  inline size_t operator()(const T*ptr) const {
    return (intptr_t)(ptr) >> 4;
  }
};
template ,
          class DeleteValue=ValueFree>
struct Table {
  struct State {
    State * old;
    size_t size;
    size_t buckets;
    std::pair table[0];
  };
  std::atomic state_;
  std::mutex add_mutex_;
  std::atomic num_find_;
  Table() {
    const size_t buckets = 8;
    auto state = (State*)malloc(sizeof(State) + sizeof(state_.load()->table[0]) * buckets);
    state->old = nullptr;
    state->size = 0;
    state->buckets = buckets;
    state_.store(state);
    num_find_.store(0);
  }
  ~Table() {
    DeleteValue delete_value;
    auto state = state_.load();
    for (size_t i = 0; i < state->buckets; ++i) {
      auto &kv = state->table[i];
      if (kv.first != Key()) {
        delete_value(kv.second);
      }
    }
    while (state) {
      auto old = state->old;
      free(state);
      state = old;
    }
  }
  Value Find(const Key &k) {
    auto state = state_.load(std::memory_order_relaxed);
    size_t b = Hash()(k) & (state->buckets - 1);
    size_t idx = 0;
    while (1) {
      auto &o = state->table[b];
      if (o.first == k) {
        return o.second;
      }
      if (o.first == Key()) break;
      b = (b + ++idx) & (state->buckets - 1);
    }
    return Value();
  }
  void Resize(int dir) {
    assert(dir == 1);
    auto old = state_.load(std::memory_order_acquire);
    size_t buckets = old->buckets * 2;
    State *state = (State*)malloc(sizeof(State) + sizeof(old->table[0]) * buckets);
    state->old = old;
    state->size = old->size;
    state->buckets = buckets;
    for (size_t i = 0; i < buckets; ++i) {
      state->table[i].first = Key();
    }
    for (size_t i = 0; i < old->buckets; ++i) {
      auto &o = old->table[i];
      if (o.first == Key()) continue;
      size_t b = Hash()(o.first) & (state->buckets - 1);
      size_t idx = 0;
      while (state->table[b].first != Key()) {
        b = (b + ++idx) & (state->buckets - 1);
      }
      state->table[b] = o;
    }
    state_.store(state, std::memory_order_release);
  }
  std::pair Add(const Key &k, const Value &value) {
    std::lock_guard lk(add_mutex_);
    auto state = state_.load(std::memory_order_acquire);
    if (state->size * 5 >= 4 * state->buckets) {
      Resize(1);
      state = state_.load(std::memory_order_relaxed);
    }
    size_t b = Hash()(k) & (state->buckets - 1);
    size_t idx = 0;
    while (1) {
      auto &o = state->table[b];
      if (o.first == k) {
        return std::make_pair(o.second, false);
      }
      if (o.first == Key()) {
        o.second = value;
        o.first = k;
        ++state->size;
        break;
      }
      b = (b + ++idx) & (state->buckets - 1);
    }
    return std::make_pair(value, true);
  }
};
Using this as the store to replace google::dense_hash_map, gives:
table: 3.96917 nano-seconds/call
table: 2.51942e+08 #/second
The Table implementation achieve the performance in cost of memory, about half bytes are wasted, and assuming that the value is never changed, deleted. And it also assumes Key() is the empty key.

TODO: open multiple method, with ambiguous resolutions.


No comments: