python遍历文件进行数据处理

背景

之前写过一个遍历文件夹进行处理的Python程序，但因为时间太久找不着了。。导致只能自己再写一遍，于是决定将代码放置于博客之中，以便以后使用。

#!usr/bin/env python
#-*- coding:utf-8 -*-

import math
import os
import glob
import numpy as np
import jieba
import string
import jieba.analyse
def read_from_file(directions):
    decode_set=[‘utf-8‘,‘gb18030‘,‘ISO-8859-2‘,‘gb2312‘,‘gbk‘,‘Error‘]#编码集
    #编码集循环
    for k in decode_set:
        try:
            file = open(directions,"r",encoding=k)
            readfile = file.read()#这步如果解码失败就会引起错误，跳到except。

            #print("open file %s with encoding %s" %(directions,k))#打印读取成功
            #readfile = readfile.encode(encoding="utf-8",errors="replace")#若是混合编码则将不可编码的字符替换为"?"。
            file.close()
            break#打开路径成功跳出编码匹配
        except:
            if k=="Error":#如果碰到这个程序终止运行
                raise Exception("%s had no way to decode"%directions)
            continue
    return readfile

filenames = []
filenames=glob.glob(r"C:/Users/Administrator/Documents/Tencent Files/937610520/FileRecv/TXT/*.txt")
filenameslen=len(filenames)
count=0
for filename in filenames:
    print("%d : %d" %(count,filenameslen))
    names=filename.find(‘TXT‘)+4
    namee=filename.find(‘.txt‘)
    f=open(filename,"rb")
    content=f.readlines()
    content=" ".join(‘%s‘ %id for id in content)
    start=content.find(‘description‘)+16
    overflow=content.find(‘comments‘)
    end=content[start:].find(‘#‘)+start
    if end>=overflow:
        end=overflow

    file = open(r"C:/Users/Administrator/Documents/Tencent Files/937610520/FileRecv/TXT/" +filename[names:namee]+‘keyword‘ + ‘.txt‘,‘w‘)

    file_data = content[start:end]
    #基于TF-IDF算法进行关键词抽取
    tfidf=jieba.analyse.extract_tags
    keywords=tfidf(file_data)
    for i in range(len(keywords)):
        if len(keywords)<=0:
            print("error,please check your input")
            break
        file.write(keywords[i]+‘\n‘)
    file.close()
    count=count+1
print("%d : %d" %(count,filenameslen))
print("finished")

原文地址：https://www.cnblogs.com/harrysong666/p/10347124.html

时间： 2024-10-08 07:57:52

python遍历文件进行数据处理的相关文章

python 遍历文件夹文件

python 遍历文件夹文件 import os import os.path rootdir = "d:\data" # 指明被遍历的文件夹 for parent,dirnames,filenames in os.walk(rootdir): #三个参数:分别返回1.父目录 2.所有文件夹名字(不含路径) 3.所有文件名字 for dirname in dirnames: #输出文件夹信息 print "parent is:" + parent print &q

python 遍历文件夹文件代码

import os def tree(top): for path, names, fnames in os.walk(top): for fname in fnames: yield os.path.join(path, fname) for name in tree('C:\Users\XXX\Downloads\Test'): print name python 遍历文件夹文件代码

python 遍历文件夹并统计文件数量

使用python遍历文件夹下的子文件夹及文件,并统计出文件夹下文件的数量: 1 import os 2 count = 0 3 4 5 # 遍历文件夹 6 def walkFile(file): 7 for root, dirs, files in os.walk(file): 8 # root 表示当前正在访问的文件夹路径 9 # dirs 表示该文件夹下的子目录名list 10 # files 表示该文件夹下的文件list 11 12 # 遍历文件 13 for f in files: 14

python遍历文件夹下的文件

在读文件的时候往往需要遍历文件夹,python的os.path包含了很多文件.文件夹操作的方法.下面列出: os.path.abspath(path) #返回绝对路径 os.path.basename(path) #返回文件名 os.path.commonprefix(list) #返回多个路径中,所有path共有的最长的路径. os.path.dirname(path) #返回文件路径 os.path.exists(path) #路径存在则返回True,路径损坏返回False os.path

Python遍历文件夹和读写文件的方法

本文和大家分享的主要是python开发中遍历文件夹和读写文件的相关内容,一起来看看吧,希望对大家学习和使用这部分内容有所帮助. 需求分析 1.读取指定目录下的所有文件 2.读取指定文件,输出文件内容 3.创建一个文件并保存到指定目录实现过程 Python写代码简洁高效,实现以上功能仅用了40行左右的代码~ 昨天用Java写了一个写入.创建.复制.重命名文件要将近60行代码: 不过简洁的代价是牺牲了一点点运行速度,但随着硬件性能的提升,运行速度的差异会越来越小,直到人类无法察觉~ #

Python遍历文件个文件夹

Python遍历文件夹

许多次需要用python来遍历目录下文件, 这一次就整理了记录在这里. 随实际工作,不定期更新. 1 import os 2 3 class FileTraversal: 4 5 def __init__(self, rootpath): 6 7 self.rootpath = rootpath 8 9 #从顶至底的遍历(在剪短的代码里,我比较喜欢这清晰的变量名) 10 self.tracersal_from_top_to_down = True 11 12 #遍历发生错误的时候的回调函数 13

python遍历文件夹中所有文件夹和文件，os.walk

python中可以用os.walk来遍历某个文件夹中所有文件夹和文件. 例1: import os filePath = 'C:/Users/admin/Desktop/img' for dirpath, dirnames, filenames in os.walk(filePath): print(dirpath, dirnames, filenames) 输出结果: 例2: import os filePath = 'C:\\Users\\admin\\Desktop\\img' for d

python遍历文件夹下文件

#方法1:使用os.listdir import os for filename in os.listdir(r'c:\\windows'): print filename #方法2:使用glob模块,可以设置文件过滤 import glob for filename in glob.glob(r'c:\\windows\\*.exe'): print filename #方法3:通过os.path.walk递归遍历,可以访问子文件夹 import os.path def processDire