前言
算法来自国外大牛的一篇博客:点击此处可查看
算法不涉及任何人工智能领域知识,仅仅是针对上下文无关文法提出的生成句子的思路。
上下文无关文法
上下文无关文法仅与句子结构有关,与上下文语意无关。
属性|单词
--|--
S |NP VP
NP |Det N / Det N
NP |I / he / she / Joe
VP |V NP / VP
Det |a / the / my / his
N |elephant / cat / jeans / suit
V |kicked / followed / shot
上面是一个上下文无关文法规则的例子,S代表一个句子,从S开始,逐渐递归填充单词,便可以生成一个句子。
基本实现
import random
from collections import defaultdict
class CFG(object):
def __init__(self):
self.prod = defaultdict(list) # 默认dict值为list,对于空键值对来说
def add_prod(self, lhs, rhs):
""" Add production to the grammar. 'rhs' can
be several productions separated by '|'.
Each production is a sequence of symbols
separated by whitespace.
Usage:
grammar.add_prod('NT', 'VP PP')
grammar.add_prod('Digit', '1|2|3|4')
"""
prods = rhs.split('|') # 按照|分割
for prod in prods:
self.prod[lhs].append(tuple(prod.split())) # 默认split按空格进行分割,但是这里的分割是生成一个元组,整体添加到prod里
def gen_random(self, symbol):
""" Generate a random sentence from the
grammar, starting with the given
symbol.
"""
sentence = ''
# select one production of this symbol randomly
rand_prod = random.choice(self.prod[symbol]) # 从符号列表中随机选择一个词组
for sym in rand_prod: #遍历词组中的单词
# for non-terminals, recurse
if sym in self.prod: #如果这个位置的单词并不是一个确切的单词,而是一个词法结构,那么递归选择相应的符合条件的单词
sentence += self.gen_random(sym)
else:
sentence += sym + ' ' #如果已经是一个确切的单词,那么直接连接到句子上即可
return sentence
cfg1 = CFG()
cfg1.add_prod('S', 'NP VP')
cfg1.add_prod('NP', 'Det N | Det N')
cfg1.add_prod('NP', 'I | he | she | Joe')
cfg1.add_prod('VP', 'V NP | VP')
cfg1.add_prod('Det', 'a | the | my | his')
cfg1.add_prod('N', 'elephant | cat | jeans | suit')
cfg1.add_prod('V', 'kicked | followed | shot')
for i in range(10):
print(cfg1.gen_random('S'))
这里给出了一个基于Python的基本实现,通过递归填充单词即可。
上下文无关文法导致无法终止的问题
上面的算法很简单,可以看起来很棒。但实际上存在一个问题,容易导致无法终止的问题。
属性|表达式
--|--
EXPR|TERM + EXPR
EXPR|TERM - EXPR
EXPR|TERM
TERM|FACTOR * TERM
TERM|FACTOR / TERM
TERM|FACTOR
FACTOR|ID // NUM // ( EXPR )
ID|x // y // z // w
NUM|0//1//2//3//4//5//6//7//8//9
例如上面一个生成算数表达式的例子,上面的规则都符合正常的数学知识,但是在生成表达式的过程中产生了不能终结的问题。EXPR->TERM + EXPR->TERM + EXPR,类似这样的无限循环。
解决无法终止问题
破解无法终止的问题,可以采用概率生成算法。
这里引用了作者原文中的图,由于TERM-EXPR的祖先已经使用过这个表达式,那么此时这个表达式的生成概率会相应地降低,例如图中的降低因子是0.5,也就是说使用过一次,那么下一次使用这个表达式的概率只有原来的50%。
上述算法使用代码实现如下
import random
from collections import defaultdict
# 概率选择算法
def weighted_choice(weights):
rnd = random.random() * sum(weights)
for i, w in enumerate(weights):
rnd -= w
if rnd < 0:
return i
class CFG(object):
def __init__(self):
self.prod = defaultdict(list) # 默认dict值为list,对于空键值对来说
def add_prod(self, lhs, rhs):
""" Add production to the grammar. 'rhs' can
be several productions separated by '|'.
Each production is a sequence of symbols
separated by whitespace.
Usage:
grammar.add_prod('NT', 'VP PP')
grammar.add_prod('Digit', '1|2|3|4')
"""
prods = rhs.split('|') # 按照|分割
for prod in prods:
self.prod[lhs].append(tuple(prod.split())) # 默认split按空格进行分割,但是这里的分割是生成一个元组,整体添加到prod里
def gen_random_convergent(self,
symbol,
cfactor=0.25,
pcount=defaultdict(int)
):
""" Generate a random sentence from the
grammar, starting with the given symbol.
Uses a convergent algorithm - productions
that have already appeared in the
derivation on each branch have a smaller
chance to be selected.
cfactor - controls how tight the
convergence is. 0 < cfactor < 1.0
pcount is used internally by the
recursive calls to pass on the
productions that have been used in the
branch.
"""
sentence = ''
# The possible productions of this symbol are weighted
# by their appearance in the branch that has led to this
# symbol in the derivation
#
weights = []
for prod in self.prod[symbol]: # 对于满足某个要求的所有表达式,计算相应的生成概率
if prod in pcount:
weights.append(cfactor ** (pcount[prod])) # 对于父节点已经引用过的表达式,此处需要根据因子减小生成概率
else:
weights.append(1.0) #
rand_prod = self.prod[symbol][weighted_choice(weights)] # 根据概率选择新生成的表达式
# pcount is a single object (created in the first call to
# this method) that's being passed around into recursive
# calls to count how many times productions have been
# used.
# Before recursive calls the count is updated, and after
# the sentence for this call is ready, it is rolled-back
# to avoid modifying the parent's pcount.
#
pcount[rand_prod] += 1
for sym in rand_prod:
# for non-terminals, recurse
if sym in self.prod: # 如果不是一个确切的单词,那么递归填充表达式
sentence += self.gen_random_convergent(
sym,
cfactor=cfactor,
pcount=pcount)
else:
sentence += sym + ' ' # 如果是一个确切的单词,那么直接添加到句子后面即可
# backtracking: clear the modification to pcount
pcount[rand_prod] -= 1 # 由于pcount是引用传值,因此需要恢复原来状态
return sentence
cfg1 = CFG()
cfg1.add_prod('S', 'NP VP')
cfg1.add_prod('NP', 'Det N | Det N')
cfg1.add_prod('NP', 'I | he | she | Joe')
cfg1.add_prod('VP', 'V NP | VP')
cfg1.add_prod('Det', 'a | the | my | his')
cfg1.add_prod('N', 'elephant | cat | jeans | suit')
cfg1.add_prod('V', 'kicked | followed | shot')
for i in range(10):
print(cfg1.gen_random_convergent('S'))
cfg2 = CFG()
cfg2.add_prod('EXPR', 'TERM + EXPR')
cfg2.add_prod('EXPR', 'TERM - EXPR')
cfg2.add_prod('EXPR', 'TERM')
cfg2.add_prod('TERM', 'FACTOR * TERM')
cfg2.add_prod('TERM', 'FACTOR / TERM')
cfg2.add_prod('TERM', 'FACTOR')
cfg2.add_prod('FACTOR', 'ID | NUM | ( EXPR )')
cfg2.add_prod('ID', 'x | y | z | w')
cfg2.add_prod('NUM', '0|1|2|3|4|5|6|7|8|9')
for i in range(10):
print(cfg2.gen_random_convergent('EXPR'))
小结
通过递归,可以很容易地实现基于上下文无关文法生成句子的算法。但是需要注意的是,普通算法会导致无法终止的问题,针对这个问题,有人提出了基于概率的句子生成算法,很好地解决了无法终止的问题。
原文地址:https://www.cnblogs.com/zhenlingcn/p/9210540.html