c++ fstream + string 处理大数据

一：起因

（1）之前处理文本数据时，各种清洗数据用的都是java的File,FileReader/FileWriter,BufferedReader/BufferedWriter等类，详见java读写文件

（2）应用java的原因是java里面的map非常灵活，eclipse编译器更是给力，而且ctrl 可以追踪函数等，详见java map的排序

（3）应用java的另一个原因是java里面的string类的字符串处理非常灵活，各种函数是应用尽有。

（4）上面两点算是自己的误解吧，因为c++里面也有也有与之对应的 fstream类， c++map容器类，详见c++ map简介

（5）c++里面也有相对比较成熟的string类，里面的函数也大部分很灵活，没有的也可以很容易的实现split，strim等，详见c++string 实现

（6）最近从网上，看到了一句很经典的话，c++的风fstream类 + string类也可以非常好的处理文本文件，让我们一起来见证

二：fstream的前世今生

（1）简介

包含的头文件 #include <fstream> using namespace std;

C++中的三个文件流

a ---- ofstream ofs("文件名", 打开方式); b ---- ifstream ifs("文件名", 打开方式); c ---- fstream fs("文件名",输入打开方式 | 输出打开方式); 三种文件流分别用于写文件、读文件、读写文件 ,一般用a b两种方式进行，因为一个文件同时进行读写的情况采用c 方式。

三种文件流都可先定义，再打开文件，以fstream为例

fstream fs; fs.open("文件名", 输入打开方式 | 输出打开方式);

其中“打开方式”可以不给出。若不给出，对于oftream默认为ios::out，iftream默认为ios::in

（2）文件打开函数

在C++中，对文件的操作是通过stream的子类fstream(file stream)来实现的，所以，要用这种方式操作文件，就必须加入头文件fstream.h。下面就把此类的文件操作过程一一道来。

打开文件在fstream类中，有一个成员函数open()，就是用来打开文件的，其原型是：

　　void open(const char* filename,int mode,int access);

　　参数：

　　filename：要打开的文件名

　　mode：要打开文件的方式

　　access：打开文件的属性

（3）打开方式

ios::out 输出数据覆盖现有文件（默认的写代开方式，文件不存在，创建之；若存在，则覆盖原来的内容）

ios::app 输出数据填加之现有文件末尾（追加末尾写代开方式，不覆盖原内容）

ios::ate 打开文件并移动文件指针至末尾

ios::in 打开文件以输入（默认读的打开方式）

ios::trunc 输出文件中现有内容（ios::out的默认操作）

ios::binary 二进制打开供读写

（4）文件指针定位

和C的文件操作方式不同的是，C++ I/O系统管理两个与一个文件相联系的指针。一个是读指针，它说明输入操作在文件中的位置；另一个是写指针，它下次写操作的位置。每次执行输入或输出时，相应的指针自动变化。所以，C++的文件定位分为读位置和写位置的定位，

对应的成员函数是 seekg()和seekp()，seekg()是设置读位置，seekp是设置写位置。它们最通用的形式如下：

istream &seekg(streamoff offset,seek_dir origin);

ostream &seekp(streamoff offset,seek_dir origin);

streamoff定义于 iostream.h 中，定义有偏移量 offset 所能取得的最大值，seek_dir 表示移动的基准位置，是一个有以下值的枚举：

ios::beg：　　文件开头

ios::cur：　　文件当前位置

ios::end：　　文件结尾

这两个函数一般用于二进制文件，因为文本文件会因为系统对字符的解释而可能与预想的值不同。

例：

file1.seekg(1234,ios::cur);//把文件的读指针从当前位置向后移1234个字节

file2.seekp(1234,ios::beg);//把文件的写指针从文件开头向后移1234个字节

file1.seekg(-128,ios::end); //把文件的读指针从文件末尾向前移128个字节

注意：一个汉字是占用两个字节的，一个字母占用一个字节。

（5）fstream , stream ;ifstream,istream;ofstream,ostream等的关系

三：实战篇

（1）read word by word ;no write

 //读取方式: 逐词读取, 词之间用空格区分(遇到空格认为本次读取结束)，输出之后进行下一次读取
 //read data from the file, Word By Word
 //when used in this manner, we'll get space-delimited bits of text from the file
 //but all of the whitespace that separated words (including newlines) was lost.
void ReadDataFromFileWBW()
{
    ifstream fin("data.txt");
    string s;
    cout << "*****start*******" << endl;
    while( fin >> s )
    {
          cout << "Read from file: " << s << endl;
    }
    cout << "*****over*******" << endl;
}

（2）read by line fin.getline(char*,n)

 //读取方式: 逐行读取, 将行读入字符数组, 行之间用回车换行区分
 //If we were interested in preserving whitespace,
 //we could read the file in Line-By-Line using the I/O getline() function.
void ReadDataFromFileLBLIntoCharArray()
{
    ifstream fin("data.txt",ios::in);// 默认的打开模式就是ios::in
    ofstream fout("out.txt",ios::out);// 默认代开模式就是ios::out
    const int LINE_LENGTH = 100;
    char str[LINE_LENGTH];
    fout << "****CharArray start******" << endl;
    cout << "****CharArray start******" << endl;
    fin.seekg(-20,ios::end);// -20表示从end向前移动20字节，汉字占两字节；20表示向后移动指针
    while( fin.getline(str,LINE_LENGTH) )
    {
        fout << str << endl;
        cout << "Read from file: " << str << "..." << endl;// ****str里面本身包含着换行的，原来是什么样子，现在保存的就是什么样子
    }
    fout << "*****over*******" << endl;
    cout << "*****over*******" << endl;
}

(3) read by line fin.getline(fin,string)

 //读取方式: 逐行读取, 将行读入字符串, 行之间用回车换行区分
 //If you want to avoid reading into character arrays,
 //you can use the C++ string getline() function to read lines into strings
void ReadDataFromFileLBLIntoString()
{
    ifstream fin("data.txt",ios::in);// 默认的打开模式就是ios::in
    ofstream fout("out.txt",ios::app);// 追加到文件尾的方式打开
    string s;
    cout << "****start******" << endl;
    while( getline(fin,s) )
    {
        fout << s << endl;// ofstream是默认，若文件不存在，则先建立此文件，并且再向文件写的过程中换行已经不存在了，这与cout控制台输出一样哦。。。
        cout << "Read from file: " << s << endl;//****s同str里面本身已经没有了换行的，这和原来的getline()函数是一样的；数据原来是什么样子，现在保存的就是什么样子
    }
    fout << "*****over*******" << endl;
    cout << "*****over*******" << endl;
    fout.close();
}

(4) main函数

#include <iostream>
#include <fstream>
#include <string>
#include <cstdlib>// exit()函数

using namespace std;

 //输出空行
void OutPutAnEmptyLine()
{
      cout<<"\n";
}

 //带错误检测的读取方式
 //Simply evaluating an I/O object in a boolean context will return false
 //if any errors have occurred
void ReadDataWithErrChecking()
{
    string filename = "dataFUNNY.txt";
    ifstream fin( filename.c_str());
    if( !fin )
    {
          cout << "Error opening " << filename << " for input" << endl;
          exit(-1);
    }
}

int main()
{
      ReadDataFromFileWBW(); //逐词读入字符串
      OutPutAnEmptyLine(); //输出空行

      ReadDataFromFileLBLIntoCharArray(); //逐词读入字符数组
      OutPutAnEmptyLine(); //输出空行

      ReadDataFromFileLBLIntoString(); //逐词读入字符串
      OutPutAnEmptyLine(); //输出空行

      ReadDataWithErrChecking(); //带检测的读取
     return 0;
}

data文本文件的数据格式

（5）总结

第一条，（写了这么多了，用两句话概括吧）最近从网上，看到了一句很经典的话，c++的风fstream类 + string类也可以非常好的处理文本文件；

第二条，语言仅仅是一种工具，本身并没有优劣之分

时间： 2024-10-10 16:21:07

c++ fstream + string 处理大数据

c++ fstream + string 处理大数据的相关文章

大数据与机器学习的一些博文整理

day15 分页及 JDBC 大数据的处理

大数据 --> ProtoBuf的使用和原理

【IT十八掌大数据】学习笔记

SparkRDD解密(DT大数据梦工厂)

底层战详解使用Java开发Spark程序(DT大数据梦工厂)

玩转大数据系列之Apache Pig如何与Apache Solr集成（二）

阿里天池大数据之移动推荐算法大赛总结及代码全公布

使用Stack堆栈集合大数据运算