R - package'tm' DocumentTermMatrix get error

> dtm <- DocumentTermMatrix(corpus)
Error: 不是所有的inherits(doc, "TextDocument")都是TRUE

Solution:

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trimwon‘t necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn‘t sure how to handle a corpus of characters.

So you could change to

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Or you can run

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

From: http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument

R - package'tm' DocumentTermMatrix get error

时间: 2024-07-28 23:53:53

R - package'tm' DocumentTermMatrix get error的相关文章

bug of Alphahull in R package

I have define an area with alphahull but I can't get the right point in the area. for example: > z.def$x           [,1]            [,2] [1,]     13.61808        26.67013 [2,]     12.82682        23.04007 [3,]     12.96585        19.08577 [4,]     14.

R执行报错:Error in `[&lt;-.ts`(`*tmp*`,...only replacement of elements is allowed

原因: pred$mean是Time-Series类型,rbind函数不支持.通过as.double将其转换成double类型即可. 修改前代码: all_predata_time <- data.frame(pd=0.1,Row=1,preRow=0,pt=0.1,stasid='1',InitDate='1'); all_predata_time <- all_predata_time[-1,] stasPowerPre_Time <- function(staid){ testSr

R语言 文本挖掘 tm包 使用

#清除内存空间 rm(list=ls()) #导入tm包 library(tm) library(SnowballC) #查看tm包的文档 #vignette("tm") ##1.Data Import 导入自带的路透社的20篇xml文档 #找到/texts/crude的目录,作为DirSource的输入,读取20篇xml文档 reut21578 <- system.file("texts", "crude", package = &quo

package(1):tm

tm包是R语言中为文本挖掘提供综合性处理的package,进行操作前载入tm包,vignette命令可以让你得到相关的文档说明.使用默认安装的R平台是不带tm  package的,在安装的过程中,它会依赖于NLP','BH' ,'slam'包,所以最简单的方式就是采用在线安装. 在tm 中主要的管理文件的结构被称为语料库(Corpus),代表了一系列的文档集合 tm包安装 在安装依赖的slam包时,出现如下异常,R版本3.2.5 > install.packages("slam"

Create and format Word documents using R software and Reporters package

http://www.sthda.com/english/wiki/create-and-format-word-documents-using-r-software-and-reporters-package Install and load the ReporteRs R package Create a simple Word document Add texts : title and paragraphs of texts Format the text of a Word docum

分类算法简介 基于R

最近的关键字:分类算法,outlier detection, machine learning 简介: 此文将 k-means,decision tree,random forest,SVM(support vector mechine),人工神经网络(Artificial Neural Network,简称ANN )这几种常见的算法 apply 在同一个数据集 spam,看各种方法预测错误率,或准确率,旨在追求预测准确性,辨识出这几种方法的实用性,对背后的理论依据,大量的数学公式,不作讨论(能

R TUTORIAL: VISUALIZING MULTIVARIATE RELATIONSHIPS IN LARGE DATASETS

In two previous blog posts I discussed some techniques for visualizing relationships involving two or three variables and a large number of cases. In this tutorial I will extend that discussion to show some techniques that can be used on large datase

Apache Spark 2.2.0 中文文档 - SparkR (R on Spark) | ApacheCN

SparkR (R on Spark) 概述 SparkDataFrame 启动: SparkSession 从 RStudio 来启动 创建 SparkDataFrames 从本地的 data frames 来创建 SparkDataFrames 从 Data Sources(数据源)创建 SparkDataFrame 从 Hive tables 来创建 SparkDataFrame SparkDataFrame 操作 Selecting rows(行), columns(列) Groupin

开发自己的R包(转)

R不必说,数据统计分析可视化的必备语言,R包开发的门槛比较低,所以现在随便一篇文章都会发表一个自己的R包,这样有好处(各种需求早有人帮你解决了)也有坏处(R包太多,混乱,新手上手较难).作为生信工程师,日常就是查看别人写的R包,分析数据,然后借鉴.修改,根据自己的需求开发新的R包. 前言R是一个世界范围开发者共同协作的产物,至2013年2月共计近5000个包可在互联网上自由下载.现在作为R的使用者,有朝一日也可以成为R的开发者,把我们自己的知识做成R包分享给世界. 今天我们简单介绍如何开发自己R