Optimization Approach for Boosting

In the Adaptive Boosting paper, the authors gave a construction way to get a stronger learner from weak learners.

If we free the construction from AdaBoost, we can consider the AdaBoost as optimization over linear function spaces spanned by weak learners, the construction method is like a coordinate decent solution algorithm.

We try to give some theory limits of the optimal point under some reasonable(?) assumptions.

Let $w_i$, $i=0, 1, \ldots, n$ be weaker learners. They are totally independent. For any sample $x$, let $y(x)$ be the real class label(-1, 1, or real value). The sign of $w_i(x)y(x)$ will indicate the correctness of $w_i$ on $x$. In most case, $w_i(x)$ has this form $w_i(x) = \sum_j^\infty f_{i_j} \times x_{i_j}$, where $x_{i_j}$ are some features of $x$, transferred by an unknown kernel function of $w_i$.

Here, we assume that the value of $w_i(x) y(x)$ follows $N(\mu_i, \sigma_i^2)$. Then $\Phi(\frac{\mu_i}{\sigma_i})$ is the precision of $w_i$. We also assume that all $w_i(x)y(x)$ are independent, then $\sum_i\alpha_i w_i(x) y(x)$ follows $N(\sum_i \alpha_i\mu_i, \sum_i \alpha_i^2\sigma_i^2)$.

The best $\alpha_i$ is $\mathrm{arg max}_{\alpha}\Phi(\frac{\sum_i\alpha_i\mu_i}{\sqrt{\sum_i\alpha_i^2\sigma_i^2}})$, which can be calculated as $\alpha_i = \frac{c \mu_i}{\sigma_i^2}$, where $c$ is a non-zero constant.

If we assume that, all weak learners has the same precision $p$, we can calculate the stronger learner has precision $\Phi(\sqrt{n}\Phi^{-1}(p))$.

I think this approach is easier to understand than original paper and easier for me to implement with optimization framework. We also has a good start point. We can try different lost and link functions for boosting. But this approach may introduced more over-fitting compared to construction method of original Ada Boosting, which it's not analyzed in this article.

Experiment

For easy of experiment, we draw $n$ samples from $U(0, 1)^d$, and let real positive label as $\mathrm{mean}(x) > 0.5$, we choose some weak learners as $x_i > 0.7$, and $\mathrm{mean}(x) > 0.45$. Note that these weak learners are not independent, but we still got the best weights by the calculation. There're no precision differences between lost functions $e^{-w(x)y(x)}$, $\log(1 + e^{-w(x)y(x)})$ and our direct calculation. You can check code in boosting.go, and run it as following:

git clone https://github.com/zuoyan/optimization_go
cd optimization_go
export GOPATH=$(pwd)
go run bin/boosting.go --generate_samples=10000 --generate_dimension=10 --generate_output=train.labels
go run bin/boosting.go --generate_samples=10000 --generate_dimension=10 --generate_output=test.labels --seed=13
go run bin/boosting.go --train_file=train.labels --test_file=test.labels --calculate_weights --optimize_weights --eval_weakers

And once we got this(this can regenerate, since math.rand in golang is seeded):
weak precisions at train 0.5968 0.6022 0.5954 0.5947 0.5959 0.5995 0.5929 0.5872 0.6011 0.5995 0.7933
weak precisions at test 0.5918 0.5895 0.5869 0.5975 0.5867 0.5948 0.5931 0.597 0.59 0.5922 0.7953
weak precisions at all 0.5943 0.59585 0.59115 0.5961 0.5913 0.59715 0.593 0.5921 0.59555 0.59585 0.7943
calculate weights 0.06936016186937585 0.07355792072038556 0.06828066537841014 0.06774223410516975 0.06866579457265785 0.07145217471411021 0.06636165980817825 0.06202615041687599 0.07269833592484762 0.07145217471411021 0.30840272777587846
weights precision at train 0.834
weights precision at test 0.8357
weights precision at all 0.83485
2014/04/05 08:16:54 optimize ...
2014/04/05 08:16:54 solver[level=100]:iter=0 y=0.6207724781251298 #f=2/3 #g=2/3 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=1 y=0.5506796109437893 #f=1/4 #g=1/4 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=2 y=0.533068799117156 #f=2/6 #g=2/6 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=3 y=0.5319598811608891 #f=2/8 #g=2/8 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=4 y=0.531943122354469 #f=2/10 #g=2/10 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=5 y=0.5319426989350947 #f=2/12 #g=2/12 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=6 y=0.5319426852247808 #f=2/14 #g=2/14 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=7 y=0.5319426852122358 #f=2/16 #g=2/16 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=8 y=0.5319426852121972 #f=1/17 #g=1/17 result=Forward
2014/04/05 08:16:54 solver[level=100]:iter=9 y=0.5319426852121972 #f=3/20 #g=3/20 result=BreakKeep
2014/04/05 08:16:54 solver min value 0.5319426852121972 #f=20 #g=20 at [0.3588231913006707, 0.3626950627657966, 0.3320246202662729, 0.3396292934695077, 0.32662572027393794, 0.32597136863617504, 0.32435762645257327, 0.3182808901882726, 0.3276780489197852, 0.34074165915815785, 1.67455936635565]
optimize weights [0.07131695537553615 0.07208649895909681 0.06599067619146069 0.06750212292241123 0.06491763208738632 0.0647875781564208 0.06446684309222839 0.06325907743076396 0.06512678488713779 0.067723208225987 0.33282262267157076]
weights precision at train 0.834
weights precision at test 0.8357
weights precision at all 0.83485