Source Insight完美转换UTF-8 到 GB2312

前言

很多人用source
insight 打开某些源码文件时，汉字显示为一堆乱码。这个问题是因为编码方式不同。记事本和一些编辑器默认编码方式是ANSI，在这种方式下输入汉字，其实就是GB系列的编码方式。不幸的是，广收欢迎的代码查看工具Source
insight 虽然支持汉字，但是它不支持UTF-8。笔者感到疑惑的是，当初开发source
insight的这帮人现在哪里去了？这么好的工具，却不再更新了，实在让人可惜。

可惜归可惜，程序还是要看。乱码怎么办？用记事本打开源代码逐个转换的笨方法虽然简单，但当遇到大量UTF-8编码的文件时就繁琐了。这里介绍一种批处理方法。

概述

本程序是参考网上源代码修改而成。感谢原作者将该代码开源，我只花了一个下午就改成所想要的程序。能帮助人们更快的开发出更好的软件，开源万岁。希望读者也能为开源事业贡献自己的一份力量。改进后的程序有一些优点：

l 命令行执行

可以命令行执行或由bat调用，也可集成到编辑器如SourceInsight中。

l 智能识别编码方式

自动识别 UTF-8（BOM格式、非BOM格式）、纯ASCII码文件，决定是否需要转换。

目前，这个程序只针对源代码文件编写，支持后缀名为 .c .cpp .cxx .h .xml .java
.txt等文件。同理支持其它后缀的文件转换。

如何集成到 source insight?

下面，先介绍如何集成到 source insight里。

转换文件

在source insight里，选择 “选项”→“自定义命令”→“添加”，输入新命令名 CodeConvert
File 。确定后，点击“浏览”，选择我们的codeConvert.exe程序路径。在输入框里加上参数” -u2g
%f” （注意空格,双引号不要. %f的含义见SI的帮助文档“Command Line Substitutions”小节）。

这样，这个文件转换命令就添加成功。打开某个文件，按照上面步骤，选择该自定义命令点击“运行”，即可将当前打开的文件进行 UTF-8到GB2312编码转换。

转换目录

若转换的对象是目录（及子目录）则只需将 %f 换成 %d 即可。

主要源码部分

这是检测文件是否是UTF-8格式的函数，其他部分不再一一列出，请参考源代码。

/*! detect if the fiel is UTF-8 code

\param LPCTSTR filename

\return true/false

*/

bool detect_utf8file(LPCTSTR filename)

{

    const unsigned char BOM[3] = {0xEF,0xBB,0xBF};

    int count_good_utf = 0;

    int count_bad_utf = 0;

    int begin = 0;

    char buf[3]={0,};

    CStdioFile file_r ;

    if(!file_r.Open((char*)(LPCTSTR)filename, CStdioFile::modeReadWrite))

        return false;
file_r.Read(buf,3);

    if(!strncmp(buf,(char *)BOM,3)) {

        std::cout<<"this is utf(BOM). "<<endl;

        file_r.Close();

        return true;

    }

    else { // detect the whole file

    //    UTF-8 text encoding auto-detection

    //    count the following pairs of consecutive bytes as shown in the table:

    //    11..    10..    good

    //    00..    10..    bad

    //    10..    10..    don‘t care

    //    11..    00..    bad

    //    11..    11..    bad

    //    00..    00..    don‘t care

    //    10..    00..    don‘t care

    //    00..    11..    don‘t care

    //    10..    11..    don‘t care

        char current_byte,previous_byte;

        file_r.SeekToBegin();

        file_r.Read(&current_byte,1);

        previous_byte = current_byte;

        while(file_r.Read(&current_byte,1)) {

            if ((current_byte & 0xC0) == 0x80) {

                if ((previous_byte & 0xC0) == 0xC0) {

                    count_good_utf ++;

                } else if ((previous_byte & 0x80) == 0x00) {

                    count_bad_utf ++;

                }

            } else if ((previous_byte & 0xC0) == 0xC0 ){

                count_bad_utf ++;

            }

            previous_byte = current_byte;

        }

        //    the comparison ">=" handles pure ASCII files as UTF-8,

        //    replace it with ">" to change that
if(count_good_utf > count_bad_utf) {

            file_r.Close();

            std::cout<<"the code of this file is utf(no BOM). "<<endl;

            return true;

        } else if(count_good_utf == count_bad_utf) {

            file_r.Close();

            std::cout<<"the code of this file is pure ASCII file as UTF-8. "<<endl;

            return true;

        }

        else {

            file_r.Close();

            std::cout<<"the code of this file is not utf. "<<endl;

            return false;

        }

    }

}

Reference:

http://blog.chinaunix.net/uid-24118190-id-1759447.html

Source Insight完美转换UTF-8 到 GB2312

时间： 2024-12-28 23:55:40

Source Insight完美转换UTF-8 到 GB2312

Source Insight完美转换UTF-8 到 GB2312的相关文章

让Source Insight完美支持中文注释（转）

Source Insight 中文乱码问题

source insight中文注释乱码问题的解决方案

Source Insight 中文注释为乱码解决办法（完美解决，一键搞定）【转】

Source Insight 添加代码排版和编码转换

【转】Source Insight中文注释为乱码的解决办法

ubuntu14.04中 gedit 注释能显示中文,而source insight中显示为乱码的解决办法

source insight 笔记

source insight实用配置