【算法】基于树形结构分词

  1 #!/usr/bin/env python
  2 #encoding=gbk
  3 import os
  4 import sys
  5
  6 G_ENCODING="gbk"
  7 """
  8 ===============================
  9 中文分词
 10 1. 机械分词
 11 2. 统计分词
 12 3. 理解分词
 13 ===============================
 14 基于树形结构分词策略(结合机械分词，统计分词)
 15 例：笔记本电脑
 16     dict = {"笔":0.8,"记":0.8,"本":0.8,"电":0.8,"脑":0.8,"笔记":0.9,"笔记本":0.9,"电脑":0.9,"笔记本电脑":0.9}
 17          -------------------------------
 18         |              <s>              |
 19          -------------------------------
 20         /         /         \            21       [笔]     [笔记]    [笔记本]    [笔记本电脑]
 22        /          /        /    23      [记]       [本]     [电] [电脑]
 24       /         /   25     [本]      [电] [电脑]
 26     /  \       /
 27  [电] [电脑] [脑]
 28   /
 29 [脑]
 30 -------------------------------
 31 path: 笔 记 本 电 脑  -- score: [0.32768]
 32 path: 笔 记 本 电脑   -- score: [0.4608]
 33 path: 笔记 本 电 脑   -- score: [0.4608]
 34 path: 笔记 本 电脑    -- score: [0.648]
 35 path: 笔记本 电 脑    -- score: [0.576]
 36 path: 笔记本 电脑     -- score: [0.81]
 37 path: 笔记本电脑      -- score: [0.9]
 38
 39 best path: 笔记本电脑 -- score: [0.9]
 40
 41 -------------------------------
 42 1、路径加权（通过搜索引擎获取词语的词频，获得词语的权重）
 43 2、最少切分、OOV、最少单字等策略
 44 ==获取最佳分词路径
 45 -------------------------------
 46 Q1、如果句子过长，树非常大，遍历费时（需优化）
 47 Q2、字典加载（需优化）
 48 以下给出该思想的简单实现[python]:
 49 """
 50
 51 class Stack():
 52     def __init__(self, volume = 0):
 53         self.list = [] if volume == 0 else [0 for i in range(0,volume)]
 54         self.top = 0
 55
 56     def push(self, element):
 57         if self.list != None:
 58             self.top += 1
 59             self.list[self.top] = element
 60
 61     def pop(self):
 62         if self.list != None and self.top >= 0:
 63             ele = self.list[self.top]
 64             self.list[self.top] = None
 65             self.top -= 1
 66             return ele
 67         return None
 68     def empty(self):
 69         return self.top == 0
 70
 71 class Node():
 72     def __init__(self, data, next = None, prev = None, depth = 0, wlen = 0, weight = 0.0):
 73         self.data = data
 74         self.next = next if next != None else []
 75         self.prev = prev
 76         self.depth = depth
 77         self.wlen = wlen
 78         self.weight = weight
 79
 80     def isLeaf(self):
 81         return self.next == None or self.next == []
 82
 83 class Tree():
 84     def __init__(self, root = None):
 85         self.root = root
 86     """append a child node to child"""
 87     def append(self, node, cnode):
 88         if node != None and cnode != None:
 89             node.next.append(cnode)
 90             cnode.prev = node
 91             cnode.depth = node.depth + 1
 92             return 0
 93         return -1
 94
 95     """depth first search(binary preorder)"""
 96     def depth_first_search(self, node):
 97         list = []
 98         if node != None:
 99             stack = Stack(30)
100             stack.push(node)
101             while not stack.empty():
102                 tmp = stack.pop()
103                 list.append(tmp)
104                 for i in range(len(tmp.next) - 1, -1, -1):
105                     stack.push(tmp.next[i])
106         return list
107
108 class Tokenizer():
109     """init the tree"""
110     def load(self, tree, pnode, cache, dict):
111         clen = len(cache)
112         for node in tree.depth_first_search(pnode):
113             if node.isLeaf():
114                 i = node.wlen
115                 j = i
116                 while j < clen:
117                     j += 1
118                     tmp = cache[i:j].encode(G_ENCODING)
119                     if dict.has_key(tmp) or len(tmp) == 1:
120                         tnode = Node(tmp, wlen = j, weight = dict.get(tmp))
121                         tree.append(node, tnode)
122                         self.load(tree, tnode, cache, dict)
123         return 0
124     """backtrance"""
125     def backtrance(self, node, list):
126         if node.prev != None and node.prev.data != "<s>":
127             list.append(node.prev)
128             self.backtrance(node.prev, list)
129         return 0
130
131     def bestpath(self, tree):
132         highestScore = 0
133         bestpath = ""
134         for node in tree.depth_first_search(tree.root):
135             """find the leaf node and backtrance to find the bese path"""
136             if node.isLeaf():
137                 list = [node]
138                 self.backtrance(node, list)
139                 list.reverse()
140                 """
141                 1、路径加权（通过搜索引擎获取词语的词频，获得词语的权重）
142                 2、最少切分、OOV、最少单字等策略
143                 这里只是简单给出路径权重的乘积得分
144
145                 """
146                 sc = 1.0
147                 tp = ""
148                 for xn in list:
149                     sc *= xn.weight if xn.weight > 0 else 1
150                     tp += xn.data + " "
151                 if sc > highestScore:
152                     highestScore = sc
153                     bestpath = tp.strip()
154                 print "path: %s -- score: [%s]"%(tp.strip(), sc)
155         print "\nbest path: %s -- score: [%s]"%(bestpath, highestScore)
156         return bestpath
157 def example():
158     sent = "笔记本电脑"
159     dict = {"笔":0.8,"记":0.8,"本":0.8,"电":0.8,"脑":0.8,"笔记":0.9,"笔记本":0.9,"电脑":0.9,"笔记本电脑":0.9}
160     cache = unicode(sent, G_ENCODING)
161     tokenizer = Tokenizer()
162     tree = Tree(Node("<s>"))
163     """init tree"""
164     tokenizer.load(tree, tree.root, cache, dict)
165     """backtrance and find the best path"""
166     tokenizer.bestpath(tree)
167 example()

时间： 2024-11-09 01:39:54

【算法】基于树形结构分词的相关文章

PHP算法《树形结构》之伸展树(1) - 基本概念

伸展树的介绍 1.出处:http://www.cnblogs.com/skywang12345/p/3604238.html 伸展树(Splay Tree)是一种二叉排序树,它能在O(log n)内完成插入.查找和删除操作.它由Daniel Sleator和Robert Tarjan创造.(01) 伸展树属于二叉查找树,即它具有和二叉查找树一样的性质:假设x为树中的任意一个结点,x节点包含关键字key,节点x的key值记为key[x].如果y是x的左子树中的一个结点,则key[y] <= key

基于树形结构的导航实现

记得给segue设定标示符先设定viewController的Class,然后拉属性,在设置标识符 1.LhbTableViewController.h @interface LhbTableViewController : UITableViewController<UITableViewDataSource,UITableViewDelegate>@property (nonatomic,strong) NSDictionary *provinceDic;@property (nonat

在NLP中深度学习模型何时需要树形结构？

前段时间阅读了Jiwei Li等人[1]在EMNLP2015上发表的论文<When Are Tree Structures Necessary for Deep Learning of Representations?>,该文主要对比了基于树形结构的递归神经网络(Recursive neural network)和基于序列结构的循环神经网络(Recurrent neural network),在4类NLP任务上进行实验,来讨论深度学习模型何时需要树形结构.下面我将通过分享这篇论文以及查看的一些

树形结构的数据库表Schema设计-基于左右值编码

树形结构的数据库表Schema设计程序设计过程中,我们常常用树形结构来表征某些数据的关联关系,如企业上下级部门.栏目结构.商品分类等等,通常而言,这些树状结构需要借助于数据库完成持久化.然而目前的各种基于关系的数据库,都是以二维表的形式记录存储数据信息,因此是不能直接将Tree存入DBMS,设计合适的Schema及其对应的CRUD算法是实现关系型数据库中存储树形结构的关键. 理想中树形结构应该具备如下特征:数据存储冗余度小.直观性强:检索遍历过程简单高效:节点增删改查CRUD操作高效.无意

Java创建树形结构算法实例

在JavaWeb的相关开发中经常会涉及到多级菜单的展示,为了方便菜单的管理需要使用数据库进行支持,本例采用相关算法讲数据库中的条形记录进行相关组装和排序讲菜单组装成树形结构. 首先是需要的JavaBean 1 2 3 import java.io.Serializable; 4 import java.util.ArrayList; 5 import java.util.Collections; 6 import java.util.Comparator; 7 import java.util.

Hibernate中，基于Annotation的简单树形结构的实现

在系统设计中,经常用到递归性质的树形结果,比如菜单.多级分类等,一般是在同一个表中定义父子关系实现这种结构. 下面是在Hibernate中,基于Annotation的简单树形结构的实现: 第一步:创建Entity类,并添加注解实现关联关系 ps: 主要是利用@ManyToOne 和 @OneToMany 配置在同一个Entity类中实现树形递归的结构.hibernate注解形式比在xml配置更加简洁 TreeNode.java 1 package com.hfut.hibernate; 2

在Hadoop上运行基于RMM中文分词算法的MapReduce程序

原文:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/ 在Hadoop上运行基于RMM中文分词算法的MapReduce程序 23条回复我知道这个文章标题很“学术”化,很俗,让人看起来是一篇很牛B或者很装逼的论文!其实不然,只是一份普通的实验报告,同时本文也不对RMM中文分词算法进行研究.这个实验报告是我做高性能计算课程的实验里提交的.所以,下面的内容是从我的实验报告里摘录出来的,当作是我学

How to print a tree-ADT ? 打印树形结构的算法

How to print a tree-ADT 写和树相关的代码的时候老是不方便debug,因为树形结构虽然能够代码构造出来但是如果能够有个很好的方法可视化就更好了. 前些天看到一个MIT的代码片段,感激-.... 一开始你可能会想到一种比较简单的迭代实现,就像之前我做的 void putout(int S, int *n) 实现在这里 http://blog.csdn.net/cinmyheart/article/details/43086233 这个函数会打印一个三角形而我看到MIT老师

浅谈分词算法（2）基于词典的分词方法

[TOC] 前言在浅谈分词算法(1)分词中的基本问题中我们探讨了分词中的基本问题,也提到了基于词典的分词方法.基于词典的分词方法是一种比较传统的方式,这类分词方法有很多,如:正向最大匹配(forward maximum matching method, FMM).逆向最大匹配(backward maximum matching method,BMM).双向扫描法.逐词遍历法.N-最短路径方法以及基于词的n-gram语法模型的分词方法等等.对于这类方法,词典的整理选择在其中占到了很重要的作用,本