利用python的theano库刷kaggle mnist排行榜

背景

theano 是一个python语言的库，实现了一些机器学习的方法，最大的特点是可以就像普通的python程序一样透明的使用GPU

theano的主页：http://deeplearning.net/software/theano/index.html

theano 同时也支持符号计算，并且和numpy相容，numpy是一个python的矩阵计算的库，可以让python具备matlab的计算能力，虽然没有matlab方便

numpy的主页：http://www.numpy.org/

MNIST是一个手写数字识别的公开数据集，我以为地球人都知道

http://kaggle2.blob.core.windows.net/competitions/kaggle/3004/logos/front_page.png

mnist主页：http://yann.lecun.com/exdb/mnist/

其他大部分资源位于deeplearning向导的主页：

deeplearning.net向导: http://deeplearning.net/tutorial/

kaggle是一个供大家公开测试各种机器学习算法的平台，包括ICML和KDD cup一类的比赛都在上面进行，其中的入门测试集就是MNIST：

kaggle的MNIST主页：http://www.kaggle.com/c/digit-recognizer

目前发表的最好结果是卷积神经网络方法的0.23%错误率^[1]，kaggle上被认可的最好结果是0.5%。看这个架势，mnist已经基本被大家解决了。不过本着实践出真知和学习threano用法的目的，我觉得用python的theano库对kaggle mnist刷个榜玩玩也不错。

数据转换与代码修改

theano的代码位于:

https://github.com/lisa-lab/DeepLearningTutorials

我修改后的代码位于：

https://github.com/chaosconst/DeepLearningTutorials

输入数据修改

原来是从cPickle导入：

    #############
    # LOAD DATA #
    #############

    # Download the MNIST dataset if it is not present
    data_dir, data_file = os.path.split(dataset)
    if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz':
        import urllib
        origin = 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
        print 'Downloading data from %s' % origin
        urllib.urlretrieve(origin, dataset)

    print '... loading data'

    # Load the dataset
    f = gzip.open(dataset, 'rb')
    train_set, valid_set, test_set = cPickle.load(f)
    f.close()

更改为读取train.csv和test.csv，先初始化四个list。

    print '... loading data'
    train_set=list();
    valid_set=list();
    test_set=list();
    predict_set=list();

valid_set是用来在SGD迭代过程中，用来验证效果但不参与训练的数据集。每次只有确定在valid_set上更有效，才继续进行目标函数的优化，这样可以防止过拟合。参见early-stopping^[2]。

设定数据集的大小，如果是调试模式则减小数据集。

    train_set_size = 36000;
    valid_set_size = 5000;
    test_set_size = 1000;
    predict_set_size = 28000;

    debug = "false";
    if debug == "true":
      train_set_size = 3600;
      valid_set_size = 500;
      test_set_size = 100;
      predict_set_size = 2800;

MNIST共有7w条记录，其中6w是训练集，1w是测试集。theano的样例程序就是这么做的，但kaggle把7w的数据分成了两部分，train.csv一共42000行，test.csv一共28000行。实际可用来训练的数据只有42000行（由此估计最后的效果也会有相应的折扣）。theano把6w的训练集分为了5w的test_set和1w的valid_set，我在这里把42000行数据分为36000的train_set、5000行的valid_set和1000行的test_set（训练时用不到）。

另外我建了一个predict_set，用来保存准备提交给kaggle的数据。然后我进行了变量初始化并从文件读取数值，读取的时候把kaggle的int转化成了theano需要的float。

    train_set.append(numpy.ndarray(shape=(train_set_size,28*28), dtype=theano.config.floatX));
    train_set.append(numpy.ndarray(shape=(train_set_size), dtype=int));
    valid_set.append(numpy.ndarray(shape=(valid_set_size,28*28), dtype=theano.config.floatX));
    valid_set.append(numpy.ndarray(shape=(valid_set_size), dtype=int));
    test_set.append(numpy.ndarray(shape=(test_set_size,28*28), dtype=theano.config.floatX));
    test_set.append(numpy.ndarray(shape=(test_set_size), dtype=int));
    predict_set.append(numpy.ndarray(shape=(predict_set_size,28*28), dtype=theano.config.floatX));
    predict_set.append(numpy.ndarray(shape=(predict_set_size), dtype=int));

    #load data from kaggle test set
    with open('train.csv', 'rb') as csvfile:
      datareader = csv.reader(csvfile, delimiter=',')
      index=0;
      for row in datareader:
        if index<train_set_size : 
          train_set[1][index] = string.atoi(row[0]);
          for pixel_index in xrange(1,28*28+1) : 
            train_set[0][index][pixel_index-1] = string.atof(row[pixel_index])/255;
        elif index < train_set_size + valid_set_size :
          valid_set[1][index-train_set_size] = string.atoi(row[0]);
          for pixel_index in xrange(1,28*28+1) : 
            valid_set[0][index-train_set_size][pixel_index-1] = string.atof(row[pixel_index])/255;
        else :
          test_set[1][index-train_set_size-valid_set_size] = string.atoi(row[0]);
          for pixel_index in xrange(1,28*28+1) : 
            test_set[0][index-train_set_size-valid_set_size][pixel_index-1] = string.atof(row[pixel_index])/255;
        index+=1;
        if index == train_set_size + valid_set_size + test_set_size : 
          break; 
    
    print '... loading predict dataset'
    #load data from kaggle test set
    with open('test.csv', 'rb') as csvfile:
      datareader = csv.reader(csvfile, delimiter=',')
      index=0;
      for row in datareader:
        for pixel_index in xrange(0,28*28) : 
          predict_set[0][index][pixel_index] = string.atof(row[pixel_index])/255;
        index+=1;
        if index == predict_set_size: 
          break;

    train_set = tuple(train_set);
    valid_set = tuple(valid_set);
    test_set = tuple(test_set);
    predict_set = tuple(predict_set);

输出数据修改

theano的convnet是由两个卷积层，一个hidden layer和一个logistic regression构成的，如图^[3]： http://deeplearning.net/tutorial/_images/mylenet.png

我们需要的是最后一层的输出，theano的样例程序在最后一层lr给了我们一个符号变量y_pred，定义如下：

        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(value=numpy.zeros((n_in, n_out),
                                                 dtype=theano.config.floatX),
                                name='W', borrow=True)
        # initialize the baises b as a vector of n_out 0s
        self.b = theano.shared(value=numpy.zeros((n_out,),
                                                 dtype=theano.config.floatX),
                               name='b', borrow=True)

        # compute vector of class-membership probabilities in symbolic form
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

        # compute prediction as class whose probability is maximal in
        # symbolic form
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)

手册上说可以使用eval()对其进行实例化^[4]：

        predict_results = layer3.y_pred.eval({input:predict_set_x});

但是我这样做不行，只好用了很不理想的方案，原谅我

    predict_model = theano.function([index], layer3.predict(),
             givens={
                x: predict_set_x[index * batch_size: (index + 1) * batch_size]})

其中predict函数为：

    def predict(self):
        return T.mul(self.y_pred,1);

我对技术不敬畏，对不起各位了。

这样我们就得到可以操作的数组，写入输出文件：

                    predict_res_array = [predict_model(i) for i in xrange(n_predict_batches)]
                    print predict_res_array;
                    f = open("predict_res","w+");
                    for y_pred_item_array in predict_res_array:
                      for y_pred_item in y_pred_item_array:
                        f.write(str(y_pred_item)+'\n');
                    f.close();

平移数据

以上可以差不多达到1.0%的误差，和理论值0.5%还有差距，我觉得可能是因为数据不够多，所以我对输入输出数据进行了平移预处理。输入数据平移：

#!/bin/bash
awk -F , '{

for (shiftx=-1;shiftx<=1;shiftx++) {
  for (shifty=-1;shifty<=1;shifty++) {

    printf $1","; 

    for (y=0;y<28;y++) {
      for (x=1;x<=28;x++) {
        x_shift = x + shiftx;
        y_shift = y + shifty;
        if ((x_shift<1) || (x_shift>28) || (y_shift<0) || (y_shift>=28)) {
          printf "0,";
        } else {
        i=y_shift*28+x_shift+1;
        printf $i",";
      }
    }
  }

  printf"\n"

}}
//g' | sed 's/,$//g'

输出的时候让平移后的9个位置进行投票，boost了一把

#!/bin/bash
awk '{
  dist[$0]++; 
  if (NR%9==0) {
    best=1;
    for (x in dist) {
      if (dist[x]>dist[best]) best=x
    } 
    printf best"\n"
    delete dist;
  }
}'

ok，万事俱备，刷榜吧！

运行结果

kaggle传送门

valid_set_error=0.90 test_set_error=0.68

800px

刷到前10，我感觉可以了，再往上刷10名就要被怀疑作弊了。

不明觉厉

simple cell到complex cell是怎么实现的？

拿着某一斜率的filter去扫一遍全局的图像
把图像分割成nxn份,做pooling（可以是max pooling）

两个二维向量卷积的意思就是扫一遍，类似于你在暗处拿着一个手电筒把一篇文章看一遍。扫的每一帧的具体操作就是相乘（找相似的特征，仅仅是相乘就可以了）。卷积不是目的，扫一遍算相似度才是。

当做polling的时候，时空信息就消失了，本来是28x28维的空间，如果4x4方块做pooling，就只剩下7x7的位置信息了。取而代之的，是feature域的信息。典型的“时空样本变换”，不过这个是98年就做出来的，实在是很赞。

学习方案是构造一个损失函数，然后用SGD求解，因为有很多层，所以损失函数的梯度计算超级复杂，参数也很多很多，不过theano有一个库，可以自动计算梯度。先进行符号计算，然后Sample一些输入数据算梯度。

嗯，大概就是这个样子吧。

改进

肖达说：“Hinton组的cuda-convnet GPU卷积库确实快，实现同样结构的卷积神经网做MNIST手写体分类，比theano的GPU卷积快5倍多。另一个发现，用sgd优化时，max kernel norm constraint比weight decay好用”。
cuda-convnet, https://code.google.com/p/cuda-convnet/
weight decay不知道theano用了没有

参考

pylearn2的convnet，http://nbviewer.ipython.org/urls/raw.github.com/lisa-lab/pylearn2/master/pylearn2/scripts/tutorials/convolutional_network.ipynb
theano是肖达告诉我的，GPU也是借用肖达的工作站的，非常感谢！
深度学习读书会
lwta-theano

模板:Reply

[mnist_homepage-1] ttp://yann.lecun.com/exdb/mnist/, mnist homepage

[2] ttp://deeplearning.net/tutorial/gettingstarted.html#early-stopping

[3] ttp://deeplearning.net/tutorial/lenet.html

[4] ttp://deeplearning.net/software/theano/tutorial/adding.html#adding-two-scalars

[1]

[2]

[3]

[4]