秒懂机器学习—-朴素贝叶斯进行垃圾邮件分类实战

一、总结

一句话总结：

没必要一次学很多个算法，不然，其实真的一个也不懂，要一个一个搞懂了再往下学

如何讲解这个问题：实例+人话：朴素贝叶斯（ P(结果|关键词1,关键词2…) = P(关键词1,关键词2…|结果)*P(结果)/P(关键词1,关键词2…)）

怎么继续学习这个算法：写一个既简单数据量极少的程序出来，由浅入深

1、停词表是什么？

会自动过滤掉某些字或词：比如标点符号，比如虚词

停用词是指在信息检索中，为节省存储空间和提高搜索效率，在处理自然语言数据（或文本）之前或之后会自动过滤掉某些字或词，这些字或词即被称为Stop Words（停用词）。这些停用词都是人工输入、非自动化生成的，生成后的停用词会形成一个停用词表。但是，并没有一个明确的停用词表能够适用于所有的工具。甚至有一些工具是明确地避免使用停用词来支持短语搜索的。

2、python中列表应用实例？

stopList=[]：定义

stopList.append(line[:len(line)-1])：追加

#获得停用词表
def getStopWords(self):
  stopList=[]
  for line in open("../data/China_stop_word.txt"):
      stopList.append(line[:len(line)-1])
  return stopList;

3、python中for语句循环的语法是什么？

for in：for in :：for line in open(“../data/China_stop_word.txt”):

4、python的jieba包是干嘛的？

分词模块：词云技术遍地都是，分词模块除了jieba也有很多

5、分词模块包一般有哪些分词模式（比如python的jieba包分’我想和女朋友一起去北京天安门闲逛。。’）？

精确模式：jieba.cut(s)：词只分一次：我,想,和,女朋友,一起,去,北京,天安门,闲逛,。,。

全模式：jieba.cut(s,cut_all = True)：尽量将所有的词拿出来：我,想,和,女朋友,朋友,一起,去,北京,天安,天安门,闲逛,,,

搜索引擎模式：jieba.cut_for_search(s)：我,想,和,朋友,女朋友,一起,去,北京,天安,天安门,闲逛,。,。

6、如何过滤掉非中文字符？

rule=re.compile(r”[^\u4e00-\u9fa5]“)

7、P(x’|c)=P((sunny,cool,high,strong)|yes)这句话是什么意思？

给定了结果是yes的情况，sunny,cool,high,strong分别占各自总数的多少

8、朴素贝叶斯（ P(结果|关键词1,关键词2…) = P(关键词1,关键词2…|结果)*P(结果)/P(关键词1,关键词2…)）公式怎么解释？

P(结果|关键词1,关键词2…)：表示根据关键词1,关键词2… 判断邮件在什么分类：比如一封邮件拥有这些关键词，你要判断它是不是垃圾邮件

P(关键词1,关键词2…|结果)：表示不同分类下这些关键词(关键词1,关键词2…)出现的概率：比如关键词1在垃圾邮件中出现的概率为0.7

P(结果)：不同分类在总结果中的概率：比如训练数据的垃圾邮件占总邮件的0.5

9、用朴素贝叶斯（ P(结果|关键词1,关键词2) = P(关键词1,关键词2|结果)*P(结果)/P(关键词1,关键词2)）做邮件分类的原理是什么？

搞懂这句话：P(结果|关键词1,关键词2) = P(关键词1,关键词2|结果)*P(结果)/P(关键词1,关键词2)

选最高频15个词：需要测试的邮件选最高频的15个词就够了

10、朴素贝叶斯的这个概率P(关键词1,关键词2…|结果)的各个小概率之间是相加还是相乘？

相乘：肯定是相乘啊，因为这个条件是所有关键词都要包含，所以肯定是乘

11、现在我要求新邮件是否为垃圾邮件，本质是求P(结果|关键词1,关键词2…)、P(关键词1,关键词2…|结果)中的哪一个？

P(结果|关键词1,关键词2…)

12、用朴素贝叶斯（ P(结果|关键词1,关键词2) = P(关键词1,关键词2|结果)*P(结果)/P(关键词1,关键词2)）做邮件分类的步骤分类？

获得正常邮件中的词频和获得垃圾邮件中的词频：P(关键词1,关键词2|结果)

测试邮件：通过计算每个文件中p(s|w)来得到对分类影响最大的15个词：P(关键词1,关键词2|结果)*P(结果)

比较结果概率，选择结果：计算不同结果对应的 P(关键词1,关键词2|结果)*P(结果)的大小

二、内容在总结中

1、相关知识

2、代码

spamEmail.py

#encoding=utf-8
'''
Created on 2018年3月11日
@author: Fan Renyi
'''
import jieba;
import os;
class spamEmailBayes:
    #获得停用词表
    def getStopWords(self):
        stopList=[]
        for line in open("../data/China_stop_word.txt"):
            stopList.append(line[:len(line)-1])
        return stopList;
    #获得词典
    def get_word_list(self,content,wordsList,stopList):
        #分词结果放入res_list
        res_list = list(jieba.cut(content))
        for i in res_list:
            if i not in stopList and i.strip()!='' and i!=None:
                if i not in wordsList:
                    wordsList.append(i)
    #若列表中的词已在词典中，则加1，否则添加进去
    def addToDict(self,wordsList,wordsDict):
        for item in wordsList:
            if item in wordsDict.keys():
                wordsDict[item]+=1
            else:
                wordsDict.setdefault(item,1)
    def get_File_List(self,filePath):
        filenames=os.listdir(filePath)
        return filenames
    #通过计算每个文件中p(s|w)来得到对分类影响最大的15个词
    def getTestWords(self,testDict,spamDict,normDict,normFilelen,spamFilelen):
        wordProbList={}
        for word,num  in testDict.items():
            if word in spamDict.keys() and word in normDict.keys():
                #该文件中包含词个数
                pw_s=spamDict[word]/spamFilelen
                pw_n=normDict[word]/normFilelen
                ps_w=pw_s/(pw_s+pw_n) 
                wordProbList.setdefault(word,ps_w)
            if word in spamDict.keys() and word not in normDict.keys():
                pw_s=spamDict[word]/spamFilelen
                pw_n=0.01
                ps_w=pw_s/(pw_s+pw_n) 
                wordProbList.setdefault(word,ps_w)
            if word not in spamDict.keys() and word in normDict.keys():
                pw_s=0.01
                pw_n=normDict[word]/normFilelen
                ps_w=pw_s/(pw_s+pw_n) 
                wordProbList.setdefault(word,ps_w)
            if word not in spamDict.keys() and word not in normDict.keys():
                #若该词不在脏词词典中，概率设为0.4
                wordProbList.setdefault(word,0.47)
        sorted(wordProbList.items(),key=lambda d:d[1],reverse=True)[0:15]
        return (wordProbList)
    #计算贝叶斯概率
    def calBayes(self,wordList,spamdict,normdict):
        ps_w=1
        ps_n=1
        for word,prob in wordList.items() :
            print(word+"/"+str(prob))
            ps_w*=(prob)
            ps_n*=(1-prob)
        p=ps_w/(ps_w+ps_n)
#         print(str(ps_w)+""+str(ps_n))
        return p        
    #计算预测结果正确率
    def calAccuracy(self,testResult):
        rightCount=0
        errorCount=0
        for name ,catagory in testResult.items():
            if (int(name)<1000 and catagory==0) or(int(name)>1000 and catagory==1):
                rightCount+=1
            else:
                errorCount+=1
        return rightCount/(rightCount+errorCount)

main.py

#encoding=utf-8
'''
Created on 2018年3月11日
@author: Fan Renyi
'''
from spam.spamEmail import spamEmailBayes
import re
import time
#及时函数开始
begin_time=time.time()
#spam类对象
spam=spamEmailBayes()
#保存词频的词典
spamDict={}
normDict={}
testDict={}
#保存每封邮件中出现的词
wordsList=[]
wordsDict={}
#保存预测结果,key为文件名，值为预测类别
testResult={}
#分别获得正常邮件、垃圾邮件及测试文件名称列表
normFileList=spam.get_File_List("./../data/normal")
spamFileList=spam.get_File_List("./../data/spam")
testFileList=spam.get_File_List("./../data/test")
#获取训练集中正常邮件与垃圾邮件的数量
normFilelen=len(normFileList)
spamFilelen=len(spamFileList)
#获得停用词表，用于对停用词过滤
stopList=spam.getStopWords()
#获得正常邮件中的词频
for fileName in normFileList:
    wordsList.clear()
    for line in open("./../data/normal/"+fileName):
        #过滤掉非中文字符
        rule=re.compile(r"[^\u4e00-\u9fa5]")
        line=rule.sub("",line)
        #将每封邮件出现的词保存在wordsList中
        spam.get_word_list(line,wordsList,stopList)
    #统计每个词在所有邮件中出现的次数
    spam.addToDict(wordsList, wordsDict)
normDict=wordsDict.copy()  
#获得垃圾邮件中的词频
wordsDict.clear()
for fileName in spamFileList:
    wordsList.clear()
    for line in open("./../data/spam/"+fileName):
        rule=re.compile(r"[^\u4e00-\u9fa5]")
        line=rule.sub("",line)
        spam.get_word_list(line,wordsList,stopList)
    spam.addToDict(wordsList, wordsDict)
spamDict=wordsDict.copy()
# 测试邮件
for fileName in testFileList:
    testDict.clear( )
    wordsDict.clear()
    wordsList.clear()
    for line in open("./../data/test/"+fileName):
        rule=re.compile(r"[^\u4e00-\u9fa5]")
        line=rule.sub("",line)
        spam.get_word_list(line,wordsList,stopList)
    spam.addToDict(wordsList, wordsDict)
    testDict=wordsDict.copy()
    #通过计算每个文件中p(s|w)来得到对分类影响最大的15个词
    wordProbList=spam.getTestWords(testDict, spamDict,normDict,normFilelen,spamFilelen)
    #对每封邮件得到的15个词计算贝叶斯概率  
    p=spam.calBayes(wordProbList, spamDict, normDict)
    if(p>0.9):
        testResult.setdefault(fileName,1)
    else:
        testResult.setdefault(fileName,0)
# 将结果写在answer/ans.txt 里面
f2=open('../data/answer/ans.txt',encoding='utf-8',mode='w')
#计算分类准确率（测试集中文件名低于1000的为正常邮件）
testAccuracy=spam.calAccuracy(testResult)
for i,ic in testResult.items():
    print(i+"/"+str(ic))
    f2.write(i+"/"+str(ic)+'\n')
print(testAccuracy)
f2.write(str(testAccuracy)+'\n')
end_time=time.time()
print('程序总共运行了：',(end_time-begin_time),'(s)')