给定字典做分词

最近需要用到分词，无聊写个算法。。。

算法：给定一个字典和一句话，做分词；

Target：输入词典，输出所有可能的分词结果

思路：dfs

加速：首先判断是不是这句话里所有的词在字典中都有（validate）

//
//  Wordsplit.cpp
//
//  Target: Find all possible splitting of a sentence given a dictionary dict
//  Howto:  refer to main
//
//  Created by Rachel on 14-8-16.
//  Copyright (c) 2014年 ZJU. All rights reserved.
//

#include <iostream>
#include <stdio.h>
#include "vector"
#include <set>
#include<unordered_set>
using namespace std;

class Wordsplit {
private:
    vector<string> list;
    bool match(string s, string cur_ele){
        int l = cur_ele.length();
        if (s.substr(0,l)==cur_ele) {
            return true;
        }
        return false;
    }

    bool validate(string s, unordered_set<string> &dict){
        //1. calculate all alphabets in the query
        set<char> alpha;
        for (int i=0; i<s.length(); i++) {
            alpha.insert(s[i]);
            }
        //2. calculate all alphabets in the dictionary
        set<char> beta;
        unordered_set<string>::iterator dict_it;
        for (dict_it = dict.begin(); dict_it!=dict.end(); dict_it++) {
            for (int i=0; i<(*dict_it).length(); i++) {
                beta.insert((*dict_it)[i]);
            }
        }
        set<char>::iterator it;
        for (it = alpha.begin(); it!=alpha.end(); it++) {
            if (beta.find(*it)==beta.end()) {
                return false;
            }
        }
        return true;
    }

public:
    string split(string s, unordered_set<string> &dict, string cur_str){
        if (s.length()==0) {
            list.push_back(cur_str.substr(0,cur_str.length()-1));
            return s;
        }
        //cout<<s<<endl;
        unordered_set<string>::iterator it;
        for (it=dict.begin(); it!=dict.end(); it++) {
            if (match(s, (*it))) {
                string tmp_str = cur_str;
                string latter = s.substr(it->length(), s.length()-it->length());
                cur_str += (*it) + " "; // add current word to cur_str
                cur_str += split(latter, dict, cur_str); // split remaining words
                cur_str = tmp_str; //back to last status
            }
        }
        return "no result";
    }

    vector<string> main(string s, unordered_set<string> &dict) {
        if (!validate(s, dict)) {
            return list;
        }
        split(s, dict, "");
        return list;
    }
};

int main()
{
    Wordsplit s;
    unordered_set<string> L={"程序员","公务员","员","我","喜","做","程序","一","欢","喜欢","做一个","一个"};
    vector<string> V = s.main("我喜欢做一个程序员", L);
    vector<string>::iterator it;
    for (it=V.begin(); it!=V.end(); it++) {
        cout<<(*it)<<endl;
    }
}

输出：

我喜欢 做一个 程序员

我喜欢 做一个 程序员

我喜欢做一个程序员

我喜欢做一个 程序员

我喜欢 做一个 程序员

我喜欢 做一个 程序员

我喜欢做一个程序员

我喜欢做一个 程序员

给定字典做分词,布布扣,bubuko.com

时间： 2024-10-17 04:07:01

给定字典做分词的相关文章

由隐马尔科夫意淫无字典中文分词 C#

using System; using System.Windows.Forms; using System.IO; using System.Text.RegularExpressions; using System.Collections; using System.Collections.Generic; using System.ComponentModel; namespace HMM { public partial class Form1 : Form { string[] arr

【python基础】用字典做一个小型的查询数据库

例子来源于<python基础教程>第三版,57p 该例子主要是使用字典的方式,实现一个小型的数据库,通过查询字典的键值来获取用户的信息. 本人修改了部分代码. #!/usr/bin/python3 -*- coding:utf-8 -*- # 使用字典构建一个简单的数据库 #导入模块,主要为了做异常退出 import os # 构建people字典,用来存储用户信息 people = { 'Ailce':{ 'phone': '2341', 'addr': 'Foo drive 23' },

hdu 1251 统计难题(给定字典单词，查询以某单词为前缀的单词的个数)

代码: #include<cstdio> #include<cstring> #include<stdlib.h> using namespace std; struct Node { int cnt; Node * next[26]; void init() { cnt=0; for(int i=0; i<26; i++) { next[i]=NULL; } } }; Node *P_root; void Insert(char s[]) { int len=s

编程算法 - 字典分词代码(C)

字典分词代码(C) 本文地址: http://blog.csdn.net/caroline_wendy 给定字典, 给定一句话, 进行分词. 使用深度遍历(DFS)的方法. 使用一个参数string, 保存当前分支的分词后的句子; 使用一个参数vector, 保存所有可能的组合. 使用一个验证函数, 判断句子是否可以分词. 代码: /* * main.cpp * * Created on: 2014.9.18 * Author: Spike * Copyright (c) 2014年 WCL.

双数组原理在分词字典中的应用

首先是将分词字典构造成检索树.通常情况下,分词字典是完全的文本文件,其中每一行代表一个词例如表3-1所示的字典可以构造成如图3-8所示字典检索树的形式. 由此一来,当利用该字典进行分词时,可以将待匹配字符串作为状态转移的字符输入,在字典检索树中进行遍历,从而判断该字符串是否为字典中存在的词.其算法如下: 1 Begin 2 c = FirstCharacter(s): //s为待匹配字符串 3 while(c不为空) 4 Begin 5

R语言做文本挖掘 Part2分词处理

Part2分词处理在RStudio中安装完相关软件包之后,才能做相关分词处理,请参照Part1部分安装需要软件包.参考文档:玩玩文本挖掘,这篇文章讲用R做文本挖掘很详尽,并且有一些相关资料的下载,值得看看! 1. RWordseg功能说明文档可在http://download.csdn.net/detail/cl1143015961/8436741下载,这里只做简单介绍. 分词 > segmentCN(c("如果你因为错过太阳而流泪", "你也会错过星星

汉语分词软件的使用（python底下）

目前我常常使用的分词有结巴分词.NLPIR分词等等最近是在使用结巴分词,稍微做一下推荐,还是蛮好用的. 一.结巴分词简介利用结巴分词进行中文分词,基本实现原理有三: 基于Trie树结构实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图(DAG) 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合对于未登录词,采用了基于汉字成词能力的HMM模型,使用了Viterbi算法二.安装及使用(Linux) 1.下载工具包,解压后进入目录下,运行:python setup

中文分词

一周乱谈(第八周) - 中文分词中文分词 NLP(Natural language processing)自然语言处理一直都是比较热门的领域,现在不管是搜索,推荐神马的基本都需要和nlp打交道,而中文的nlp处理的第一步就是分词了,所以中文分词一直扮演者举足轻重的角色.当然了,分词的算法也是层出不穷,从最初的字典匹配到后来的统计模型,从HMM到CRF,分词精度都在不断提高,下面我就简单介绍下基本的分词算法. 字典匹配最简单的分词就是基于字典匹配,一个句子“乱谈中文分词”,如果字典中我有这三个

分词技术

目录(?)[+] 我们要理解分词技术先要理解一个概念.那就是查询处理,当用户向搜索引擎提交查询后,搜索引擎接收到用户的信息要做一系列的处理.步骤如下所示: 1.首先是到数据库里面索引相关的信息,这就是查询处理. 那么查询处理又是如何工作的呢?很简单,把用户提交的字符串没有超过3个的中文字,就会直接到数据库索引词汇.超过4个中文字的,首先用分隔符比如空格,标点符号,将查询串分割成若干子查询串. 举个例子.“什么是百度分词技术” 我们就会把这个词分割成“ 什么是,百度,分词技术.”这种分词方法叫做反