贝叶斯网络入门教程(原理介绍+python代码实现)
文章目录
- 贝叶斯网络原理
- 局部马尔科夫性
- 案例实战
- pgmpy源码剖析
- 参考资料
贝叶斯网络原理
贝叶斯网络(Bayesian Network)是一种去除了条件概率独立性的概率图模型,其结构为有向无环图(direct acyclic graph,DAG), 图中每个节点代表一个随机变量,每个节点有对应的概率分布表,有向边表示各节点之间的依赖关系。
局部马尔科夫性
贝叶斯网络的一个性质是局部马尔可夫性。
Assumption 1.1 (Local Markov Assumption) 给定一个节点 X X X在有向无环图中的父节点,该节点独立于其所有非后继节点。
对于相互独立的多个变量的联合概率分布,有:
P ( x 1 , x 2 , . . . , x n ) = p ( x 1 ) p ( x 2 ∣ x 1 ) p ( x 3 ∣ x 1 , x 2 ) . . . p ( x n ∣ x n − 1 . . . x 1 ) ( 1.1 ) P(x_1,x_2,...,x_n)=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)...p(x_n|x_{n-1}...x_1) \qquad(1.1) P(x1,x2,...,xn)=p(x1)p(x2∣x1)p(x3∣x1,x2)...p(xn∣xn−1...x1)(1.1)
若变量具有局部马尔可夫性,则联合概率分布可简化为:
P ( x 1 , x 2 , . . . , x n ) = P ( x i ∣ ∏ i = 1 n P a r e n t s ( x i ) ) ( 1.2 ) P(x_1,x_2,...,x_n)=P(x_i|\prod_{i=1}^nParents(x_i)) \qquad(1.2) P(x1,x2,...,xn)=P(xi∣i=1∏nParents(xi))(1.2)
案例实战
基于pgmpy的python包实现一个用贝叶斯网络进行推理的例子。
安装:
pip install pgmpy
以学生获得推荐信的质量为例,来构造贝叶斯网络。各节点构成的有向无环图和对应的概率表如图所示。从图中可看出推荐信(letter)的质量受到学生成绩(grade)的直接影响,而考试难度(diff)和智力(intel)直接影响学生的成绩。智力还影响SAT的分数。
使用贝叶斯网络推理一个天赋较高的学生在考试较难的情况下获得推荐信的质量的概率分布,步骤如下。
step1: 定义贝叶斯网络结构。
from pgmpy.models import BayesianNetwork letter_bn=BayesianNetwork([ ('D','G'),('I','G'),('I','S'),('G','L') # 指向关系 D->I ])
step2: 构建各节点的条件概率分布。
from pgmpy.factors.discrete import TabularCPD d_cpd=TabularCPD(variable='D',variable_card=2,values=[[0.6],[0.4]]) # 变量名,变量取值个数,对应概率 i_cpd=TabularCPD(variable='I',variable_card=2,values=[[0.7],[0.3]]) g_cpd=TabularCPD(variable='G',variable_card=3,values=[[0.3,0.05,0.9,0.5],[0.4,0.25,0.08,0.3],[0.3,0.7,0.02,0.2]], # 行数等于变量取值,列数等于依赖变量总取值数(3,4) evidence=['I','D'],evidence_card=[2,2]) # 变量名,变量取值个数,对应概率,依赖变量名,依赖变量取值 s_cpd=TabularCPD(variable='S',variable_card=2,values=[[0.95,0.2],[0.05,0.8]], evidence=['I'],evidence_card=[2]) l_cpd=TabularCPD(variable='L',variable_card=2,values=[[0.1,0.4,0.99],[0.9,0.6,0.01]], evidence=['G'],evidence_card=[3]) # evidence_card必须是列表
注意:这里条件概率表的行数等于变量取值,列数等于依赖变量总取值组合数;evidence_card必须是列表。
step3: 添加概率表到贝叶斯网络,检查模型,并输出。
letter_bn.add_cpds(d_cpd,i_cpd,g_cpd,s_cpd,l_cpd) letter_bn.check_model() # 检查构建的模型是否合理 letter_bn.get_cpds() # 网络中条件概率依赖关系
step4: 利用构建的贝叶斯网络进行具体的推理。这里推断一个天赋较高的学生在考试较难的情况下获得推荐信的质量的概率分布。
from pgmpy.inference import VariableElimination letter_infer=VariableElimination(letter_bn) # 变量消除 prob_I=letter_infer.query(variables=['L'],evidence={'I':1,'D':1}) print(f"prob_I:{prob_I}")
输出结果如下,可看到得到劣质推荐信的概率为0.368,得到优质推荐信的概率为0.632。
prob_I: +------+----------+ | L | phi(L) | +======+==========+ | L(0) | 0.3680 | +------+----------+ | L(1) | 0.6320 | +------+----------+
pgmpy源码剖析
- BayesianNetwork
pgmpy中BayesianNetwork类继承于DAG类。DAG类是基于network的有向图模块构建的,该类返回一个有向无环图。
一个简要的贝叶斯网络的伪代码如下:
class BayesianNetwork(DAG): def __init__(self, ebunch=None, latents=set()): """ 构造一个贝叶斯网络模型; 模型存储节点以及带有条件概率表和其他属性的边; 边为有向边; """ super(BayesianNetwork, self).__init__(ebunch=ebunch, latents=latents) self.cpds = [] self.cardinalities = defaultdict(int) def add_cpds(self, *cpds): """ # 添加条件概率分布到贝叶斯网络中 cpds : list, set, tuple (array-like), List of CPDs which will be associated with the model """ for cpd in cpds: if not isinstance(cpd, (TabularCPD, ContinuousFactor)): raise ValueError("Only TabularCPD or ContinuousFactor can be added.") if set(cpd.scope()) - set(cpd.scope()).intersection(set(self.nodes())): raise ValueError("CPD defined on variable not in the model", cpd) for prev_cpd_index in range(len(self.cpds)): if self.cpds[prev_cpd_index].variable == cpd.variable: logging.info(f"Replacing existing CPD for {cpd.variable}") self.cpds[prev_cpd_index] = cpd break else: self.cpds.append(cpd) def get_cardinality(self, node=None): """ 返回节点(随机变量)的取值个数 """ if node is not None: return self.get_cpds(node).cardinality[0] else: cardinalities = defaultdict(int) for cpd in self.cpds: cardinalities[cpd.variable] = cpd.cardinality[0] return cardinalities
构造一个贝叶斯网络示例:
from pgmpy.models import BayesianNetwork from pgmpy.factors.discrete import TabularCPD student = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')]) # 节点指向关系 diff->grade,intel->grade cpd_diff = TabularCPD('diff', 2, [[0.6], [0.4]]) cpd_intel = TabularCPD('intel', 2, [[0.7], [0.3]]) cpd_grade = TabularCPD('grade', 2, [[0.1, 0.9, 0.2, 0.7], [0.9, 0.1, 0.8, 0.3]],['intel', 'diff'], [2, 2]) student.add_cpds(cpd_diff,cpd_intel,cpd_grade) student.get_cardinality() # defaultdict(int, {'diff': 2, 'intel': 2, 'grade': 2}) print(f"grade节点的取值个数:{student.get_cardinality('grade')}") # grade节点的取值个数:2
- TabularCPD
TabularCPD类定义了各种条件概率分布表(conditional probability distribution table)。
简要代码如下:
构建一个条件概率表示例:
cpd = TabularCPD(variable='grade', # 随机变量名 variable_card=3, # 随机变量的取值个数 values=[[0.1,0.1,0.1,0.1,0.1,0.1], # 该随机变量的概率表 [0.1,0.1,0.1,0.1,0.1,0.1], [0.8,0.8,0.8,0.8,0.8,0.8]], evidence=['diff', 'intel'], # 该随机变量的依赖变量 evidence_card=[2,3]) # 依赖变量的取值个数 print(cpd)
参考资料
- https://pgmpy.org/detailed_notebooks/2.%20Bayesian%20Networks.html
- Brady Neal. 2020.Introduction to Causal Inference from a machine learning perspective[M], chapter 3, p28-30.
- Conrady S , Jouffe L . Bayesian Networks & BayesiaLab - A Practical Introduction for Researchers[M]. 2015.
- Pearl J . Causality: models, reasoning, and inference[M]. Cambridge University Press, 2000.
- A Bayesian network approach for population synthesis. https://www.sciencedirect.com/science/article/pii/S0968090X15003599
强烈推荐阅读贝叶斯网络的发明者朱迪亚·珀尔(Judea Pearl)的书Causality: models, reasoning, and inference,了解更多细节。
- TabularCPD
- BayesianNetwork