# 贝叶斯定理与情感分析

## 概率模型

### 从条件概率公式推导出贝叶斯

$\displaystyle{ P(A|B) P(B) =P(A \cap B) =P(B|A) P(A) }$

$\displaystyle{ P(B|A) = \frac{P(B)P(A|B)}{P(A)} }$

### 语义模型贝叶斯

Variable Values Meaning
C "pos","neg" 单条Tweet 分类
F1 "non-neg" 单条Tweet中"awesome"个数
F2 "non-neg" 单条Tweet中"crasy"个数

$\displaystyle{ P(C|F_1,F_2) = \frac{P(C)P(F_1,F_2|C)}{P(F_1,F_2)} }$

$\displaystyle{ P(F_1,F_2|C) = P(F_1|C)P(F_2|F_1,C)= P(F_1|C)P(F_2|C) }$

$\displaystyle{ P(C|F_1,F_2) = \frac{P(C)P(F_1|C)P(F_2|C)}{P(F_1,F_2)} }$

$\displaystyle{ posterior = \frac{prior * likelihood}{evidence} }$

### 对语义模型贝叶斯经行调整

1.P(F_1,F_2)可能等于0;

2.P(F_1|C)P(F_2|C)的概率可能会很小，超出像python numpy这种包的定义范围，以至需要特殊的包。

$\displaystyle{ log P(C|F_1,F_2) = log P(C) + log P(F_1|C) + log P(F_2|C)- log P(F_1,F_2) }$

$\displaystyle{ log P(C|F_1,F_2) = log P(C) + \sum_{i=1}^{k} log P(F_k|C) }$

## Python 代码及示例：使用朴素贝叶斯模型判别论坛帖子是否辱骂性

    from numpy import *



#------revised from http://blog.csdn.net/marvin521/article/details/9262445

from numpy import *
#----------1. prepare trainning set------------
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1]   #1 is abusive, 0 not
return postingList,classVec

#---------2. define feature space/tokenize function----------
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet)

#---------3. function to project test data in the feature space-------
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else: print "the word: %s is not in my Vocabulary!" % word
return returnVec

#----------4. function to train a naive Bayesian model from the training data set----
def trainNB0(trainMatrix,trainCategory):
NdataPoints = len(trainCategory)
featureSpaceDimensions = len(trainMatrix[0])
pAbusive = sum(trainCategory)*1.0/len(trainCategory)
p0Num = ones(featureSpaceDimensions); p1Num = ones(featureSpaceDimensions)   # zeros change to ones()
p0Denom = 2.0; p1Denom = 2.0                        #1.0 change to 2.0
for i in range(NdataPoints):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = log(p1Num/p1Denom)          #change to log()
p0Vect = log(p0Num/p0Denom)          #change to log()
return p0Vect,p1Vect,pAbusive

#------------5. Applied the trained model to classify posts (projected in the feature space)--------
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0

#----------------------      test----------------