解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

  在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库(已分词)时报如下错误:

UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xca in position 0: invalid continuation byte

  这种编码问题真的很让人头疼,这种问题都是出现在xxx.decode("utf-8")的时候,所以接下来我们来看看gensim中的源码:

class LineSentence(object):
    """Iterate over a file that contains sentences: one line = one sentence.
    Words must be already preprocessed and separated by whitespace.

    """
    def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).
        limit : int or None
            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> sentences = LineSentence(datapath(‘lee_background.cor‘))
            >>> for sentence in sentences:
            ...     pass

        """
        self.source = source
        self.max_sentence_length = max_sentence_length
        self.limit = limit

    def __iter__(self):
        """Iterate through the lines in the source."""
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don‘t have seek will trigger an exception
            self.source.seek(0)
            for line in itertools.islice(self.source, self.limit):
                line = utils.to_unicode(line).split()
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length
        except AttributeError:
            # If it didn‘t work like a file, use it as a string filename
            with utils.smart_open(self.source) as fin:
                for line in itertools.islice(fin, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length

  从源码中可以看到__iter__方法让LineSentence成为了一个可迭代的对象,而且文件读取的方法也都定义在__iter__方法中。一般我们输入的source参数都是一个文件路径(也就是一个字符串形式),因此在try时,self.source.seek(0)会报“字符串没有seek方法”的错,所以真正执行的代码是在except中。

  接下来我们有两种方法来解决我们的问题:

  1)from gensim import utils

    utils.samrt_open(url, mode="rb", **kw)

    在源码中用utils.smart_open()方法打开文件时默认是用二进制的形式打开的,可以将mode=“rb” 改成mode=“r”。

  2)from gensim import utils

    utils.to_unicode(text, encoding=‘utf8‘, errors=‘strict‘)

    在源码中在decode("utf8")时,其默认errors=“strict”, 可以将其改成errors="ignore"。即utils.to_unicode(line, errors="ignore")

  不过建议大家不要直接在源码上修改,可以直接将源码复制下来,例如:

import logging
import itertools
import gensim
from gensim.models import word2vec
from gensim import utils

logging.basicConfig(format=‘%(asctime)s : %(levelname)s : %(message)s‘, level=logging.INFO)

class LineSentence(object):
    """Iterate over a file that contains sentences: one line = one sentence.
    Words must be already preprocessed and separated by whitespace.

    """
    def __init__(self, source, max_sentence_length=10000, limit=None):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).
        limit : int or None
            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> sentences = LineSentence(datapath(‘lee_background.cor‘))
            >>> for sentence in sentences:
            ...     pass

        """
        self.source = source
        self.max_sentence_length = max_sentence_length
        self.limit = limit

    def __iter__(self):
        """Iterate through the lines in the source."""
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don‘t have seek will trigger an exception
            self.source.seek(0)
            for line in itertools.islice(self.source, self.limit):
                line = utils.to_unicode(line).split()
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length
        except AttributeError:
            # If it didn‘t work like a file, use it as a string filename
            with utils.smart_open(self.source, mode="r") as fin:
                for line in itertools.islice(fin, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length

our_sentences = LineSentence("./zhwiki_token.txt")
model = gensim.models.Word2Vec(our_sentences, size=200, iter=30)  # 大语料,用CBOW,适当的增大迭代次数
# model.save(save_model_file)
model.wv.save_word2vec_format("./mathWord2Vec" + ".bin", binary=True)   # 以二进制类型保存模型以便之后可以继续增量训练

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

原文地址:https://www.cnblogs.com/jiangxinyang/p/10411595.html

时间: 2024-10-08 13:38:36

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题的相关文章

解决vs2013下创建的python文件,到其他平台(如linux)下中文乱码(或运行时报SyntaxError: (unicode error) &#39;utf-8&#39; codec can&#39;t decode byte...)

Vs2013中创建python文件,在文件中没输入中文时,编码为utf-8的,如图 接着,在里面输入几行中文后,再次用notepad++查看其编码如下,在vs下运行也报错(用cmd运行就不会): 根据以有经验,这是字符编码的问题了,试着将python文件的转化为utf-8的,直接在notepad++上转utf-8 无bom编码格式的,保存,打开vs,会有以下提示 这里不要选择no吧,不然可能会提示以下类似的错误 如果有提示,直接关闭,不然的话,vs又会将此文件保存为ascii格式了 解决vs20

解决“下列引导或系统启动驱动程序无法加载: HWiNFO32 ”

开机出现错误提示,查看事件查看器提示内容:下列引导或系统启动驱动程序无法加载: HWiNFO32 原因: 该系统安装过驱动精灵,卸载驱动精灵后注册表有残余服务键值. 解决办法: 1.打开注册表编辑器,开始--运行--输入regedit--回车 2.依次找到如下两个键值,删除即可: HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\HWiNFO32HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Services\HWi

解决WP程序 重复打开出现 “正在加载...” 字样 解决方案

在开发winphone程序时候 我们经常遇到调试.在调试的时候 可能会重复打开 debug一下.可是有时候 经常遇到 "正在加载...."字样.而且很慢.效率很低. 测试发现 在 返回 程序的时候 条用一下App.Current.Terminate();  重复打开就没事了.各位 可以推测一下具体为啥这么搞就行... ps:没有这种情况的可忽略 protected virtual void BackKeyPress(CancelEventArgs e) {      App.Curre

解决:找不到或无法加载主类

造车错误的原因可能是: 1.环境变量没有配置好 2.javac xx.java打错成java xx.java 环境变量的搭建,你确定你搭建正确了吗?我开始也以为是,呵呵,不够细心啊!PATH=.;%JAVA_HOME%\binCLASSPATH=.;%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\toos.jar;JAVA_HOME=C:\Program Files\Java\jdk1.7.0 注意:PATH.CLASSPATH前面有句点和分号,注意CLASSPAT

【Android】解决RadioButton+FragmentPagerAdapter+Fragment切换页面数据加载的问题

解决RadioButton+FragmentPagerAdapter+Fragment切换页面数据加载的问题

centos7 python 中文 “UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte...”解决方法

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128) 1. 开始以为是自己写的python有问题,但后来发现同样的代码在其它电脑上运行正常... 2. 按网上说的方法什么load(sys),或者加# -*- coding: utf-8 -*-的方式,虽说都知道是python2年代的东西.但就当病急乱投医了.结果依旧无效. 3. 怀疑CentOS7 的LAN

解决UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe5 in position 0: ordinal not in range(128)

kilo版,horizon界面用中文,删除时报错."UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)" 出错原因就是python的str默认是ascii编码,和unicode编码冲突,就会报这个标题错误.那么该怎样解决呢? 解决方法如下: vim /usr/lib/python2.7/site-packages/horizon/tables/

kettle登录加载job 报错 Can&#39;t find Job 1

报错原因: Job 1是临时的新建job,我没保存数据库中,或本地文件,所以加载时找不到文件 解决方法: 目录C:\Documents and Settings\Administrator\.kettle\中有个.spoonrc文件 kettle启动会加载.spoonrc文件, 该文件中保存kettle的配置信息,包括启动时要加载哪些文件 删掉对应的 Job 1的键值对就行了,记得要调整相应键值计数. 我的环境中 tabfile1=Job 1 tabfile2=\u5237\u65B0\u5B9

dojo加载树报错

1.错误描述    error loading undefined children.    TypeError:this._arrayOfTopLevelItems is undefied. 2.错误原因 3.解决办法 dojo加载树报错