解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

　　在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库（已分词）时报如下错误：

UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xca in position 0: invalid continuation byte

　　这种编码问题真的很让人头疼，这种问题都是出现在xxx.decode("utf-8")的时候，所以接下来我们来看看gensim中的源码：

class LineSentence(object):
    """Iterate over a file that contains sentences: one line = one sentence.
    Words must be already preprocessed and separated by whitespace.

    """
    def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).
        limit : int or None
            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> sentences = LineSentence(datapath(‘lee_background.cor‘))
            >>> for sentence in sentences:
            ...     pass

        """
        self.source = source
        self.max_sentence_length = max_sentence_length
        self.limit = limit

    def __iter__(self):
        """Iterate through the lines in the source."""
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don‘t have seek will trigger an exception
            self.source.seek(0)
            for line in itertools.islice(self.source, self.limit):
                line = utils.to_unicode(line).split()
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length
        except AttributeError:
            # If it didn‘t work like a file, use it as a string filename
            with utils.smart_open(self.source) as fin:
                for line in itertools.islice(fin, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length

　　从源码中可以看到__iter__方法让LineSentence成为了一个可迭代的对象，而且文件读取的方法也都定义在__iter__方法中。一般我们输入的source参数都是一个文件路径（也就是一个字符串形式），因此在try时，self.source.seek(0)会报“字符串没有seek方法”的错，所以真正执行的代码是在except中。

　　接下来我们有两种方法来解决我们的问题：

　　1）from gensim import utils

　　　　utils.samrt_open(url, mode="rb", **kw)

　　　　在源码中用utils.smart_open()方法打开文件时默认是用二进制的形式打开的，可以将mode=“rb” 改成mode=“r”。

　　2）from gensim import utils

　　　　utils.to_unicode(text, encoding=‘utf8‘, errors=‘strict‘)

　　　　在源码中在decode("utf8")时，其默认errors=“strict”, 可以将其改成errors="ignore"。即utils.to_unicode(line, errors="ignore")

　　不过建议大家不要直接在源码上修改，可以直接将源码复制下来，例如：

import logging
import itertools
import gensim
from gensim.models import word2vec
from gensim import utils

logging.basicConfig(format=‘%(asctime)s : %(levelname)s : %(message)s‘, level=logging.INFO)

class LineSentence(object):
    """Iterate over a file that contains sentences: one line = one sentence.
    Words must be already preprocessed and separated by whitespace.

    """
    def __init__(self, source, max_sentence_length=10000, limit=None):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).
        limit : int or None
            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> sentences = LineSentence(datapath(‘lee_background.cor‘))
            >>> for sentence in sentences:
            ...     pass

        """
        self.source = source
        self.max_sentence_length = max_sentence_length
        self.limit = limit

    def __iter__(self):
        """Iterate through the lines in the source."""
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don‘t have seek will trigger an exception
            self.source.seek(0)
            for line in itertools.islice(self.source, self.limit):
                line = utils.to_unicode(line).split()
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length
        except AttributeError:
            # If it didn‘t work like a file, use it as a string filename
            with utils.smart_open(self.source, mode="r") as fin:
                for line in itertools.islice(fin, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length

our_sentences = LineSentence("./zhwiki_token.txt")
model = gensim.models.Word2Vec(our_sentences, size=200, iter=30)  # 大语料，用CBOW，适当的增大迭代次数
# model.save(save_model_file)
model.wv.save_word2vec_format("./mathWord2Vec" + ".bin", binary=True)   # 以二进制类型保存模型以便之后可以继续增量训练

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

原文地址：https://www.cnblogs.com/jiangxinyang/p/10411595.html

时间： 2024-10-08 13:38:36

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题的相关文章

解决vs2013下创建的python文件，到其他平台（如linux）下中文乱码（或运行时报SyntaxError: (unicode error) 'utf-8' codec can't decode byte...）

解决“下列引导或系统启动驱动程序无法加载: HWiNFO32 ”

解决WP程序重复打开出现 “正在加载...” 字样解决方案

解决：找不到或无法加载主类

【Android】解决RadioButton+FragmentPagerAdapter+Fragment切换页面数据加载的问题

centos7 python 中文 “UnicodeDecodeError: 'ascii' codec can't decode byte...”解决方法

解决UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

kettle登录加载job 报错 Can't find Job 1

dojo加载树报错