Huge CSV and XML Files in Python, Error: field larger than field limit (131072)

Huge CSV and XML Files in Python

January 22, 2009. Filed under python

I, like most people, never realized I‘d be dealing with large files. Oh, I knew there would be some files with megabytes of data, but I never suspected I‘d be begging Perl to processhundreds of megabytes of XML, nor that this week I‘d be asking Python to process 6.4 gigabytes of CSV into 6.5 gigabytes of XML1.

As a few out-of-memory experiences will teach you, the trick for dealing with large files is pretty easy: use code that treats everything as a stream. For inputs, read from disk in chunks. For outputs, frequently write to disk and let system memory forge onward unburdened.

When reading and writing files yourself, this is easier to do correctly...

from __future__ import with_statement # for python 2.5

with open(‘data.in‘,‘r‘) as fin:
    with open(‘data.out‘,‘w‘) as fout:
        for line in fin:
            fout.write(‘,‘.join(line.split(‘ ‘)))

...than it is to do incorrectly...

with open(‘data.in‘,‘r‘) as fin:
    data = fin.read()

data2 = [ ‘,‘.join(x.split(‘ ‘)) for x in data ]

with open(‘data.out‘,‘w‘) as fout:
    fout.write(data2)

...at least in simple cases.

Loading Large CSV Files in Python

Python has an excellent csv library, which can handle large files right out of the box. Sort of.

>> import csv
>> r = csv.reader(open(‘doc.csv‘, ‘rb‘))
>>> for row in r:
...     print row
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: field larger than field limit (131072)

Staring at the module documentation2, I couldn‘t find anything of use. So I cracked open the csv.py file and confirmed what the _csv in the error message suggests: the bulk of the module‘s code (and the input parsing in particular) is implemented in C rather than Python.

After a while staring at that error, I began dreaming of how I would create a stream pre-processor using StringIO, but it didn‘t take too long to figure out I would need to recreate my own version of csv in order to accomplish that.

So back to the blogs, one of which held the magic grain of information I was looking for: csv.field_size_limit.

>>> import csv
>>> csv.field_size_limit()
131072
>>> csv.field_size_limit(1000000000)
131072
>>> csv.field_size_limit()
1000000000

Yep. That‘s all there is to it. The sucker just works after that.

Well, almost. I did run into an issue with a NULL byte 1.5 gigs into the data. Because the streaming code is written using C based IO, the NULL byte shorts out the reading of data in an abrupt and non-recoverable manner. To get around this we need to pre-process the stream somehow, which you could do in Python by wrapping the file with a custom class that cleans each line before returning it, but I went with some command line utilities for simplicity.

cat data.in | tr -d ‘\0‘ > data.out

After that, the 6.4 gig CSV file processed without any issues.

Creating Large XML Files in Python

This part of the process, taking each row of csv and converting it into an XML element, went fairly smoothly thanks to the xml.sax.saxutils.XMLGenerator class. The API for creating elements isn‘t an example of simplicity, but it is--unlike many of the more creative schemes--predictable, and has one killer feature: it correctly writes output to a stream.

As I mentioned, the mechanism for creating elements was a bit verbose, so I made a couple of wrapper functions to simplify (note that I am sending output to standard out, which lets me simply print strings to the file I am generating, for example creating the XML file‘s version declaration).

import sys
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl

g = XMLGenerator(sys.stdout, ‘utf-8‘)

def start_tag(name, attr={}, body=None, namespace=None):
    attr_vals = {}
    attr_keys = {}
    for key, val in attr.iteritems():
        key_tuple = (namespace, key)
        attr_vals[key_tuple] = val
        attr_keys[key_tuple] = key

    attr2 = AttributesNSImpl(attr_vals, attr_keys)
    g.startElementNS((namespace, name), name, attr2)
    if body:
        g.characters(body)

def end_tag(name, namespace=None):
    g.endElementNS((namespace, name), name)

def tag(name, attr={}, body=None, namespace=None):
    start_tag(name, attr, body, namespace)
    end_tag(name, namespace)

From there, usage looks like this:

print """<?xml version="1.0" encoding="utf-8‘?>"""
start_tag(u‘list‘, {u‘id‘:10})

for item in some_list:
    start_tag(u‘item‘, {u‘id‘: item[0]})
    tag(u‘title‘, body=item[1])
    tag(u‘desc‘, body=item[2])
    end_tag(u‘item‘)

end_tag(u‘list‘)
g.endDocument()

The one issue I did run into (in my data) was some pagebreak characters floating around (^L aka 12 aka x0c) which were tweaking the XML encoder, but you can strip them out in a variety of places, for example by rewriting the main loop:

for item in some_list:
    item = [ x.replace(‘\x0c‘,‘‘) for x in item ]
    # etc

Really, the XMLGenerator just worked, even when dealing with a quite large file.

Performance

Although my script created a different mix of XML elements than the above example, it wasn‘t any more complex, and had fairly reasonable performance. Processing of the 6.4 gig CSV file into a 6.5 gig XML file took between 19 - 24 minutes, which means it was able to read-process-write about five megabytes per second.

In terms of raw speed, that isn‘t particularly epic, but performing a similar operation (was actually XML to XML rather than CSV to XML) with Perl‘s XML::Twig it took eight minutes to process a ~100 megabyte file, so I‘m pretty pleased with the quality of the Python standard library and how it handles large files.

The breadth and depth of the standard library really makes Python a joy to work with for these simple one-shot scripts. If only it had Perl‘s easier to use regex syntax...


  1. This is a peculiar nature of data, which makes it different from media: data files can--with a large system--become infinitely large. Media files, on the other hand, can be extremely dense (a couple of gigs for a high quality movie), but conform to predictable limits.

    If you are dealing with large files, you‘re probably dealing with a company‘s logs from the last decade or the entire dump of their MySQL database.?

  2. I really want to like the new Python documentation. I mean, it certainly looks much better, but I think it has made it harder to actually find what I‘m looking for. I think they‘ve hit the same stumbling block as the Django documentation: the more you customize your documentation, the greater the learning curve for using your documentation.

    I think the big thing is just the incompleteness of the documentation that gives me trouble. They are certain to cover all the important and frequently used components (along with helpful overviews and examples), but the new docs often don‘t even mention less important methods and objects.

    For the time being, I am throwing around a lot more dir commands.?

时间: 2024-10-05 05:08:59

Huge CSV and XML Files in Python, Error: field larger than field limit (131072)的相关文章

tomcat下部署了多个项目启动报错java web error:Choose unique values for the &#39;webAppRootKey&#39; context-param in your web.xml files

应该是tomcat下部署了多个项目且都使用log4j. <!--如果不定义webAppRootKey参数,那么webAppRootKey就是缺省的"webapp.root".但最好设置,以免项目之间的名称冲突. 定义以后,在Web Container启动时将把ROOT的绝对路径写到系统变量里. 然后log4j的配置文件里就可以用${webName.root }来表示Web目录的绝对路径,把log文件存放于webapp中. 此参数用于后面的“Log4jConfigListener”

【bugRecord4】Fatal error in launcher: Unable to create process using &#39;&quot;&quot;D:\Program Files\Python36\python.exe&quot;&quot; &quot;D:\Program Files\Python36\Scripts\pip.exe&quot; &#39;

环境信息: python版本:V3.6.4 安装路径:D:\Program Files\python36 环境变量PATH:D:\Program Files\Python36;D:\Program Files\Python36\Scripts; 问题描述:命令行执行pip报错 解决方法: 1.切换到D:\Program Files\Python36\Scripts 2.执行python pip.exe install SomePackage进行安装 3.安装成功后执行pip仍报错 4.查看安装成

“Unicode Error ”unicodeescape&quot; codec can&#39;t decode bytes… Cannot open text files in Python 3

“Unicode Error ”unicodeescape" codec can't decode bytes… Cannot open text files in Python 3 问题于字符串 "C:\Users\Eric\Desktop\beeline.txt" 在这里,\U启动一个八字符的Unicode转义,例如'\ U00014321`.在你的代码中,转义后面跟着字符's',这是无效的. 需要复制所有反斜杠,或者在字符串前加上r(以生成原始字符串). python

2Python进阶强化训练之csv|json|xml|excel高

Python进阶强化训练之csv|json|xml|excel高 如何读写csv数据? 实际案例 我们可以通过http://table.finance.yahoo.com/table.csv?s=000001.sz,这个url获取中国股市(深市)数据集,它以csv数据格式存储: Date,Open,High,Low,Close,Volume,Adj Close 2016-09-15,9.06,9.06,9.06,9.06,000,9.06 2016-09-14,9.17,9.18,9.05,9.

Android项目部署时,发生AndroidRuntime:android.view.InflateException: Binary XML file line #168: Error inflating class错误

这个错误也是让我纠结了一天,当时写的项目在安卓虚拟机上运行都很正常,于是当我部署到安卓手机上时,点击登陆按钮跳转到用户主界面的时候直接结束运行返回登陆界面.    当时,我仔细检查了一下自己的代码,并没有发现什么问题,在logcat上显示的报错如下:AndroidRuntime:android.view.InflateException: Binary XML file line #168: Error inflating class(这是其中报错的最主要的一行信息).  于是我在百度上几乎查看

bug_ _图片_android.view.InflateException: Binary XML file line #1: Error inflating class &lt;unknown&gt;

=========== 1   java.lang.RuntimeException: Unable to start activity ComponentInfo{com.zgan.community/com.zgan.community.activity.CommunityPolicitalDetailActivity}: android.view.InflateException: Binary XML file line #1: Error inflating class <unknow

关于在xml文件中的 error: invalid symbol: &#39;switch&#39; 错误

在xml布局文件中使用Switch控件时,出现error: invalid symbol: 'switch'报错,代码如下: <Switch android:id="@+id/switch" android:layout_width="wrap_content" android:layout_height="wrap_content" android:textOn="开启" android:textOff="关

android 细节之android.view.InflateException: Binary XML file line #95: Error inflating class(out of m)

今天的异常很有意思,叫做android.view.InflateException: Binary XML file line #95: Error inflating class(out of memory) . 其实是因为out of memory,导致 xml是不可能被充气成功,因此activity的onCreate方法中, setContentView(R.layout.***)也就不可能成功调用. 他出现在我有多个教学动画,并且播放的动画,是基于imageView,imageView的

Binary XML file line #2: Error inflating

06-27 14:29:27.600: E/AndroidRuntime(6936): FATAL EXCEPTION: main 06-27 14:29:27.600: E/AndroidRuntime(6936): android.view.InflateException: Binary XML file line #2: Error inflating class com.example.FileListItem 06-27 14:29:27.600: E/AndroidRuntime(