Viola-Jones人脸检测--AdaptBoost特征选择-蒲公英云

Viola-Jones人脸检测算法的伟大之处不不仅仅在于其算法的实时效果，更重要的是其提出了解决目标检测这一类问题的一种通用思路。该算法有两个亮点，一个是积分图技术，一个是Cascade训练模型，一经提出便引起了极大关注，在很多优秀的论文中都能看到他们的身影。如TLD算法中Detector部分，以及BING objectness训练时的两层SVM模型等，很难说这没有受到Viola-Jones算法的影响。下面就来介绍构成Cascade模型的其中的一个基本元素AdaptBoost吧。

AdaptBoost并不是Viola-Jones的原创算法，它是机器学习领域的产物，属于Ensemble Learning中boosting的类别。Ensemble类的学习算法分为bagging和boosting两个类别，都是基于弱分类器构造强分类器的思想，其中bagging的代表算法是RandomForests，boosting的代表算法是AdaptBoost。这里推荐一篇论文，介绍AdaptBoost算法理论的，《A Brief Introduction to Boosting》。

本着分享交流的目的，下面的内容包括对AdaptBoost算法的理论介绍及给出用标准C++实现AdaptBoost的代码。对于不想依赖特定库的伙伴们来说，标准C++的这个版本是个不错的选择。如果有什么不正确的地方，请多多指教。

1.AdaptBoost原理

我们知道对于一个给定窗口大小的图像，其Harr特征的维度是很高的，如果用直接用对训练样本计算出的Harr特征来训练分类器这是不太可行的，我们需要对高维的Harr特征进行选择，选择部分来进行分类器的训练。而AdaptBoost恰好就符合这样的思想，其基本思想是由弱分类器构造强分类器，用弱分类器的联合分类结果作为强分类器的结果。AdaptBoost的弱分类器可以是一个stump，也就是树桩的意思，就是一个弱分类器是一个二分类树。在众多维的Harr特征中进行特征选择的方法是，要求选择一个特征，及选择一个该特征下用于二分类的阈值，如果在该特征和阈值下对训练样本的分类误差最小，就以该特征和其二分类阈值作为一个训练好的弱分类器，算法的具体实现可以参看实现部分的bestStump（）接口。在每一次为弱分类器选择特征完成后，对于用于训练的样本的分布（也就是各样本的权重，初始值一般是相等的，都是1/N，N为样本个数）进行更新，每次的更新是由上一次的弱分类器的分类结果确定的，对于上一次弱分类器判断错误的样本，其权重会增大，判断正确的样本其权重会减小。AdaptBoost与RandomForest的一个区别是，在计算强分类器的结果时，AdaptBoost的弱分类器的权重是不一样的，而RandomForest的弱分类器的权重是相等的。

AdaptBoost算法的伪代码描述如下：

2.标准C++实现

下面的这个接口部分，包含train的接口不包含test的部分，你可以在这个基础上增加test的接口部分。

#ifndef _ADAPTBOOST_H_
#define _ADAPTBOOST_H_
#include <vector>
#include <utility>
#include <cmath>
using namespace std;
/**
 * @brief decision stump declaration
 *
 * @param featureIndex
 * @param weightedError achieved weighted error
 * @param threshold
 * @param margin achieved margin
 * @param toggle +1 or -1
 */
struct StumpRule{
    int featureIndex;
    long double weightedError;
    double threshold;
    float margin;
    int toggle;
};
/**
　* @brief what's inside AdaptBoost
　*
　* @param nPositives number of positive examples
　* @param nNegatives number of negative examples
　* @param initialPositiveWeight how much weight we give to positives at the outset
　* @param ascendingFeatures for each feature, we have (float feature value, int exampleIndex)
　*
　* @param sampleCount nPositives + nNegatives
　* @param inTrain is this a training set or a validation set
　* @param exponentialRisk exponential risk for training set
　* @param positiveTotalWeight total weight received by positive examples currently 
　* @param negativeTotalWeight total weight received by negative examples currently
　* @param minWeight minimum weight among all weights currently
　* @param maxWeight maximum weight among all weights currently
　* @param weights weight vector for all examples involved
　* @param labels are they positive or negative examples
　* @param featureCount how many features are there
　* @param committee what's the learned committee
　*/
class AdaptBoost{
private:
    int nPositives;
    int nNegatives;
    long double initialPositiveWeight;
    vector< vector<pair<float, int>> > ascendingFeatures;
    int sampleCount;
    int featureCount;
    long double positiveTotalWeight;
    long double negativeTotalWeight;
    long double minWeight;
    long double maxWeight;
    long double exponentialRisk;
    vector<double> weights;
    vector<int> labels;
    vector<StumpRule> committee;
    /**
     * @brief prevent copy and assignment 
     */
    AdaptBoost(const AdaptBoost&);
    AdaptBoost operator=(const AdaptBoost&);
protected:
    /**
     * @brief return for an element pointed by iterator and featureIndex its exampleIndex
     */
    int getTrainingExampleIndex(int featureIndex, int iterator);
    /**
     * @brief return for an element pointed by iterator and featureIndex its example value
     */
    float getTrainingExampleFeature(int featureIndex, int iterator);
    /**
     * @brief sort each featrue from different samples
     */
    void sortFeatures(
        vector< vector<pair<float, int>> >& features
    );
    /**
     * @brief best stump given a feature
     */
    void decisionStump(
        int featureIndex
    ,     StumpRule & best
    );
    /**
     * @brief best stump among all features
     */
    StumpRule bestStump();
public:
    /**
     * @brief constructor
     * @param nPositives number of positives for training examples
     * @param nNegatives number of negatives for training examples
     * @param initialPositiveWeight initial weight of positives
     * @param data for training examples, positves front and negatives back
     */
    AdaptBoost(
        int nPositives
    ,    int nNegatives
    ,    long double initialPositiveWeight
    ,    const vector< vector<float> >& data
    );
    /**
     * @brief destructor
     */
    ~AdaptBoost();
    /**
     * @brief perform one round of adaboost
     */
    void oneRoundOfAdaboostTraining();
    /**
     * @brief get committee adaptboost trained
     */
    vector<StumpRule> getCommittee() {
        return committee;
    }
    /**
     * @brief get committee size
     */
    int getCommitteeSize() {
        return committee.size();
    }
    /**
     * @brief given the number of weak classifiers train for a committee
     * @param numOfWeakClassifier for number of weak classifiers of adapt boost
     */
    void adaptBoostTraining(int numOfWeakClassifier);
    /**
     * @brief evaluate how the committee fares on a training dataset
     *
     * @param tweak for predictLableOfTrainingExamples
     * @return falsePositive
     * @return detectionRate
     * @vector<int> return a blackList,if element of balckList is 0, then it means that
     *  this sample could be used again otherwise it means not usable
     */
    vector<int> calcEmpiricalErrorInAdaBoostTraining(
        float tweak
    ,    float & falsePositive
    ,    float & detectionRate
    );
    /**
     * @brief given a tweak and a committe, what prediction do you make as to the training examples
     *
     * @param thresholdTweak tweak
     * @return prediction
     * @param onlyMostRecent use all the committee or its most recent member (a weak learner)
     */
    void predictLabelOfTrainingExamples(
        float tweakThreshold
    ,     vector<int> & prediction
    ,     bool onlyMostRecent=false
    );
};
#endif
#include <cassert>
#include <algorithm>
#include <iostream>
#include <iomanip>
#include "VJAdaptBoost.h"
using namespace std;
#define VERBOSE true
//fail and messaging
static void fail(const char* message){
    cerr << "Error:" <<  message << endl;
    exit(EXIT_FAILURE);
}
//order definition for this type of pairs
//compare only the feature values
static bool myPairOrder(
    const pair<float, int>& one
,    const pair<float, int>& other
){
    return one.first < other.first;
}
//why is one stump better than the other
static bool myStumpOrder(
    const StumpRule & one
,    const StumpRule & other
){
    if(one.weightedError < other.weightedError)
        return true;
    if(one.weightedError == other.weightedError && one.margin > other.margin)
        return true;
    return false;
}
int AdaptBoost::getTrainingExampleIndex(int featureIndex, int iterator){
    assert(ascendingFeatures.size() > 0 && ascendingFeatures[0].size() >0);
    return ascendingFeatures[featureIndex][iterator].second;
}
float AdaptBoost::getTrainingExampleFeature(int featureIndex, int iterator){
    assert(ascendingFeatures.size() > 0 && ascendingFeatures[0].size() >0);
    if(_isnan(ascendingFeatures[featureIndex][iterator].first)){
        cerr<<"ERROR: nan feature "<<featureIndex<<" detected for example "<<getTrainingExampleIndex(featureIndex, iterator)<<endl;
        exit(EXIT_FAILURE);
    }
    return ascendingFeatures[featureIndex][iterator].first;
}
//constructor
AdaptBoost::AdaptBoost(
    int positives
,    int negatives
,    long double positiveWeight
,    const vector< vector<float> >& data) {
    assert(positives > 0 && negatives > 0);
    assert(positiveWeight > 0 && positiveWeight < 1);
    assert(data.size() > 0 && data[0].size() > 0 );
    assert(data.size() == (positives + negatives));
    //add number of data info to features
    vector< vector<pair<float, int>> > features(data.size(), vector<pair<float, int>>(data[0].size(), pair<float, int>(0,0)));
    for(int i=0; i<features.size(); i++) {
        for(int j=0; j<features[0].size(); j++) {
            features[i][j] = pair<float, int>(data[i][j], i);
        }
    }
    //initialize the class attributes for the training set
    nPositives = positives;
    nNegatives = negatives;
    initialPositiveWeight = positiveWeight;
    sortFeatures(features);//initialize ascendingFeatures
    sampleCount = positives + negatives;
    featureCount = ascendingFeatures.size();
    positiveTotalWeight = positiveWeight;
    negativeTotalWeight = 1 - positiveWeight;
    long double posAverageWeight = positiveTotalWeight/(long double)nPositives;
    long double negAverageWeight = negativeTotalWeight/(long double)nNegatives;
    maxWeight = max(posAverageWeight, negAverageWeight);
    minWeight = min(posAverageWeight, negAverageWeight);
    exponentialRisk = 1;
    //set weights for each example
    for(int exampleIndex = 0; exampleIndex < sampleCount; exampleIndex++){
        weights.push_back(exampleIndex < nPositives ? posAverageWeight : negAverageWeight);
        labels.push_back(exampleIndex < nPositives ? 1 : -1);
    }
}
//destructor
AdaptBoost::~AdaptBoost() {
}
//adaptBoost interface for training
void AdaptBoost::adaptBoostTraining(int numOfWeakClassifier) {
    assert(numOfWeakClassifier > 0);
    for(int i=0; i<numOfWeakClassifier; i++) {
        oneRoundOfAdaboostTraining();
    }
}
//validation procedure using training examples
vector<int> AdaptBoost::calcEmpiricalErrorInAdaBoostTraining(
    float tweak
,    float & falsePositive
,    float & detectionRate
){
    vector<int> blackList;
    blackList.resize(nPositives, 0);
    blackList.resize(nPositives+nNegatives, 1);
    int nFalsePositive = 0;
    int nFalseNegative = 0;
    //initially let all be positive
    vector<int> prediction;
    prediction.resize(sampleCount,0);
    predictLabelOfTrainingExamples(tweak, prediction, false);
    //evaluate prediction errors
    vector<int> agree(sampleCount);
    for(int i=0; i<sampleCount; i++) {
        agree[i] = labels[i]*prediction[i];
    }
    for(int exampleIndex = 0; exampleIndex < sampleCount; exampleIndex++){
        if(agree[exampleIndex] < 0)     {
            if(exampleIndex < nPositives){
                nFalseNegative += 1;
                blackList[exampleIndex] = 1;
            }else{
                nFalsePositive += 1;
                blackList[exampleIndex] = 0;
            }
        }
    }
    //set the returned values
    falsePositive = nFalsePositive/(float)nNegatives;
    detectionRate = 1 - nFalseNegative/(float)nPositives;
    return blackList;
}
//given a tweak and a committe, what prediction does it make as to the training examples
void AdaptBoost::predictLabelOfTrainingExamples(
    float tweakThreshold
,    vector<int> & prediction
,    bool onlyMostRecent
){
    int committeeSize = committee.size();
    //no need to weigh a single member's decision
    onlyMostRecent = committeeSize == 1 ? true : onlyMostRecent;
    int start = onlyMostRecent ? committeeSize - 1 : 0;
    //double to be more precise
    vector<vector<double>> memberVerdict;
    for(int i=0; i<committeeSize; i++) {//initialize memberVerdict
        vector<double> row(sampleCount);
        memberVerdict.push_back(row);
    }
    vector<double> memberWeight(committeeSize);
    //members, go ahead
    for(int member = start; member < committeeSize; member++){
        //sanity check
        if(committee[member].weightedError == 0 && member != 0)
            fail("Boosting Error Occured!");
        //0.5 does not count here
        //if member's weightedError is zero, member weight is nan, but it won't be used anyway
        memberWeight[member] = log(1./committee[member].weightedError -1);
        int feature = committee[member].featureIndex;
        #pragma omp parallel for schedule(static)
        for(int iterator = 0; iterator < sampleCount; iterator++){
            int exampleIndex = getTrainingExampleIndex(feature, iterator);
            memberVerdict[member][exampleIndex] = (getTrainingExampleFeature(feature, iterator) >
                committee[member].threshold ? 1 : -1)*committee[member].toggle + tweakThreshold;
        }
    }
    //joint session
    if(!onlyMostRecent){
        vector<double> finalVerdict(sampleCount);
        for(int i=0; i<sampleCount; i++) {
            double predict = 0;
            for(int j=0; j<committeeSize; j++) {
                predict += (memberWeight[j] * memberVerdict[j][i]);
            }
            finalVerdict[i] = predict;
        }
        for(int exampleIndex = 0; exampleIndex < sampleCount; exampleIndex++)
            prediction[exampleIndex] = finalVerdict[exampleIndex] > 0 ? 1 : -1;
    }else{
        for(int exampleIndex = 0; exampleIndex < sampleCount; exampleIndex++)
            prediction[exampleIndex] = memberVerdict[start][exampleIndex] > 0 ? 1 : -1;
    }
}
void AdaptBoost::oneRoundOfAdaboostTraining(){
    //try to be friendly here
    static int trainPhase = 0;
    if(VERBOSE && trainPhase == 0){
        cout << "\n#############################ADABOOST MESSAGE EXPLAINED####################################################\n\n";
        cout << "INFO: Adaboost starts. Exponential Risk is expected to go down steadily and strictly," << endl;
        cout << "INFO: and Exponential Risk should bound the (weighted) Empirical Error from above." << endl;
        cout << "INFO: Train Phase is the current boosting iteration." << endl;
        cout << "INFO: Best Feature is the most discriminative feature selected by decision stump at this iteration." << endl;
        cout << "INFO: Threshold and Toggle are two parameters that define a real valued decision stump.\n" << endl;
    }
    trainPhase++;
    //get and store the rule
    StumpRule rule = bestStump();
    committee.push_back(rule);
    //how it fares
    vector<int> prediction(sampleCount);
    predictLabelOfTrainingExamples(
        0
    ,    prediction
    ,    /*onlyMostRecent*/ true);
    vector<bool> agree(sampleCount);
    for(int i=0; i<sampleCount; i++) {
        if(prediction[i] == labels[i]) {
            agree[i] = true;
        }else {
            agree[i] = false;
        }
    }
    //update weights
    vector<double> weightUpdate;
    weightUpdate.resize(sampleCount,1);
    bool errorFlag = false;
    for(int exampleIndex = 0; exampleIndex < sampleCount; exampleIndex++){
        //more weight for a difficult example
        if(!agree[exampleIndex]){
            weightUpdate[exampleIndex] = 1/rule.weightedError - 1;
            errorFlag = true;
        }
    }
    //update weights only if there is an error
    if(errorFlag){
        double weightSum = 0;
        for(int i=0; i<sampleCount; i++) {
            weights[i] *= weightUpdate[i];
            weightSum += weights[i];
        }
        for(int i=0; i<sampleCount; i++) {
            weights[i] /= weightSum;
        }
        double posTotalWeight = 0;
        for(int i=0; i<nPositives; i++) {
            posTotalWeight += weights[i];
        }
        positiveTotalWeight = posTotalWeight;
        negativeTotalWeight = 1-positiveTotalWeight;
        double min,max;
        min = max = weights[0];
        for(int i=0; i<sampleCount; i++) {
            if(weights[i] < min) {
                min = weights[i];
            }else if(weights[i] > max) {
                max = weights[i];
            }
        }
        minWeight = min;
        maxWeight = max;
    }
    //exponentialRisk can be zero at the first boosting
    exponentialRisk *= 2*sqrt((1-rule.weightedError)*rule.weightedError);
    //print some statistics
    if(VERBOSE){
        float tweak = 0;
        float falsePositive = 0;
        float detectionRate = 0;
        calcEmpiricalErrorInAdaBoostTraining(tweak, falsePositive, detectionRate);
        float empError = static_cast<float>(falsePositive*(1-initialPositiveWeight)+initialPositiveWeight*(1-detectionRate));
        cout << "Training Performance Explanation (before threshold tweaking): falsePositive " << falsePositive 
             << " detectionRate " << detectionRate << endl;
        cout <<"###########################################################################################################\n";
        cout << "\nTrain Phase " << trainPhase << endl << endl;
//        whatFeature(rule.featureIndex);
        cout << "\tExponential Risk " << setw(12) << exponentialRisk << setw(19) << "Weighted Error " 
             << setw(11) << rule.weightedError << setw(14) << "Threshold " << setw(10) << rule.threshold 
             << setw(13) <<"Toggle " << setw(12) << rule.toggle <<  endl;
        cout << "\tPositive Weight" << setw(14) << positiveTotalWeight << setw(14) << "MinWeight " 
             << setw(16) << minWeight << setw(14) << "MaxWeight " << setw(10) << maxWeight << setw(22) 
             << "Empirical Error " << setw(10) << empError << endl << endl;
    }
}
//get a feature from features and put them in ascending order
//and record at the same time the permuted example order
void AdaptBoost::sortFeatures(vector< vector<pair<float, int>> >& features) {
    assert(features.size()!=0 && features[0].size() !=0 );
    for(unsigned int i=0; i<features[0].size(); i++) {
        vector<pair<float, int>> temp = vector<pair<float, int>>();
        for(unsigned int j=0; j<features.size(); j++) {
            temp.push_back(features[j][i]);
        }
        //sort
        sort(temp.begin(), temp.end(), myPairOrder);
        ascendingFeatures.push_back(temp);
    }
}
//base learner is a stump, a decision tree of depth 1
//decisionStump has to look at feature and return rule
void AdaptBoost::decisionStump(
    int featureIndex
,    StumpRule & best
){
    //a stump is determined by threshold and toggle, the other two attributes measures its performance
    //initialize with some crazy values
    best.featureIndex = featureIndex;
    best.weightedError = 2;
    best.threshold = getTrainingExampleFeature(featureIndex, 0) - 1;
    best.margin = -1;
    best.toggle = 0;
    StumpRule current = best;
    //error_p and error_n allow to set the best toggle
    long double error_p, error_n;
    //initialize: r denotes right hand side and l left hand side
    //convention: in TrainExamples nPositives positive samples are followed by negatives samples
    long double rPositiveWeight = positiveTotalWeight;
    long double rNegativeWeight = negativeTotalWeight;
    //yes, nothing to the left of the sample with the smallest feature
    long double lPositiveWeight = 0;
    long double lNegativeWeight = 0;
    //go through all these observations one after another
    int iterator = -1;
    //to build a decision stump, you need a toggle and an admissible threshold
    //which doesn't coincide with any of the observations
    while(true){
        //We've got a threshold. So determine the best toggle based on two types of error
        //toggle = 1, positive prediction if and only if the observed feature > the threshold
        //toggle = -1, positive prediction if and only if the observed feature < the threshold
        //error_p denotes the error introduced by toggle = 1, error_n the error by toggle = -1
        error_p = rNegativeWeight + lPositiveWeight;
        error_n = rPositiveWeight + lNegativeWeight;
        current.toggle = error_p < error_n ? 1 : -1;
        //sometimes shit happens, prevent error from being negative
        long double smallerError = min(error_p, error_n);
        //this prevents some spurious nonzero: for currentError must be at least equal to minWeight
        current.weightedError = smallerError < minWeight * 0.9 ? 0 : smallerError;
        //update if necessary
        if(myStumpOrder(current, best))
            best = current;
        //move on
        iterator++;
        //we don't actually need to look at the sample with the largest feature
        //because its rule is exactly equivalent to those produced
        //by the sample with the smallest feature on training observations
        //but it won't do any harm anyway
        if(iterator == sampleCount)
            break;
        //handle duplicates, update lr weights and find a new threshold
        while(true){
            //take this guy's attributes
            int exampleIndex = getTrainingExampleIndex(featureIndex, iterator);
            int label = labels[exampleIndex];
            long double weight = weights[exampleIndex];
            //update weights
            if(label < 0){
                lNegativeWeight += weight;
                rNegativeWeight -= weight;
            }else{
                lPositiveWeight += weight;
                rPositiveWeight -= weight;
            }
            //if a new threshold can be found, break
            //two cases are possible: either it is the last observation
            if(iterator == sampleCount - 1)
                break;
            //or no duplicate. If there is a duplicate, repeat
            if(getTrainingExampleFeature(featureIndex, iterator) != getTrainingExampleFeature(featureIndex, iterator + 1)){
                double test = ((double)getTrainingExampleFeature(featureIndex, iterator) 
                    + (double)getTrainingExampleFeature(featureIndex, iterator + 1))/2;
                //well that's a bit frustrating: I want to keep float because of memory constraint, but apparently
                //features are so close, sometimes, numerical precision arises as an unexpected problem, so I decide
                //to use a double threshold so as to separate float features
                if(getTrainingExampleFeature(featureIndex, iterator) < test && test < getTrainingExampleFeature(featureIndex, iterator + 1))
                    break;
                else{
                    #pragma omp critical
                    {
                        cout << "ERROR: numerical precision breached: problem feature values " 
                             << getTrainingExampleFeature(featureIndex, iterator) 
                             << " : " << getTrainingExampleFeature(featureIndex, iterator+1) 
                             << ". Problem feature " << featureIndex << " and problem example " 
                             << getTrainingExampleIndex(featureIndex, iterator) << " : " 
                             << getTrainingExampleIndex(featureIndex, iterator+1) << endl;
                    }
                    fail("fail to find a suitable threshold.");
                }
            }
            iterator++;
        }
        //update threshold
        if(iterator < sampleCount - 1){
            current.threshold = ((double)getTrainingExampleFeature(featureIndex, iterator) 
                + (double)getTrainingExampleFeature(featureIndex, iterator + 1))/2;
            current.margin = getTrainingExampleFeature(featureIndex, iterator + 1) - getTrainingExampleFeature(featureIndex, iterator);
        }else{
            //slightly to the right of the biggest observation
            current.threshold = getTrainingExampleFeature(featureIndex, iterator) + 1;
            current.margin = 0;
        }
    }
}
//implement the feature selection's outer loop
//return the most discriminative feature and its rule
StumpRule AdaptBoost::bestStump(
){
    vector<StumpRule> candidates;
    candidates.resize(featureCount);
    #pragma omp parallel for schedule(static)
    for(int featureIndex = 0; featureIndex < featureCount; featureIndex++)
        decisionStump(featureIndex, candidates[featureIndex]);
    //loop over all the features
    //the best rule has the smallest weighted error and the largest margin
    StumpRule best = candidates[0];
    for(int featureIndex = 1; featureIndex < featureCount; featureIndex++){
        if(myStumpOrder(candidates[featureIndex], best))
            best = candidates[featureIndex];
    }
    //if shit happens, tell me
    if( best.weightedError >= 0.5 )
        fail("Decision Stump failed: base error >= 0.5");
    //return
    return best;
}