What is a text file and what is a binary file :)

If you are not coming from a programming background it might not yet be clear what is really a file? What is a binary file and what makes something a text file?

其实总归一句话:【Files that consist exclusively of ASCII characters are known as text ?les. All other ?les are known as binary ?les.只包含ASCII字符的文件是文本文件。其他任何类型的文件均为二进制文件】

Is a Microsoft Word document a text file or a binary file?

Is an spreadsheet a text file or a binary file?

Let‘s try to explain this.

A little disclaimer:

There is actually a lot more variation to this, but I‘ll focus on files of Unix/Linux systems, Windows and Mac. Wikipedia has some more to say about text files and binary files, so if this article does not to satisfy your curiosity, then please check out those articles.

What is a file?

Basically every file is just a series of bytes one after the other. That is, numbers between 0 and 255. In order to facilitate the storage device they are on, a file might be spread out to several areas on that device. From our point of view, each file is just a series of bytes. In general every file is a binary file, but if the data in it contains only text (letter, numbers and other symbols one would use in writing, and if it consists of lines, then we consider it a text file.

What is a text file?

(I am going to simplify here a bit for clarity and for now assume that the files are only use ASCII characters.)

When you open a text file with Notepad or some other, simple text editor you will see several lines of text. The file on the disk on the other hand isn‘t broken up to such lines. It is a series of numbers one after the other. When you open the file using Notepad, it translates each number to a visual representation. For example if it encounters the number 97, it will show the letter a. We can say that the a character is represented by the number 97 in ASCII.

The reason that you see several lines in your editor is that some of the bytes in the file, that are called newlines, are actually instructions to the editor to go to the beginning of next line. Thus the character that was in the file after the newline character will be displayed in the next line.

What is a newline?

Which number represents the newline?

Actually none of the characters in the ASCII table is called a newline. When we say newline we usually mean the sign that can convince the computer to go to the beginning of next row.

There are various sets of bytes that represent a newline depending on the operating system. In the operating systems we care about in this article a newline is always represented by a combination of two characters that in the ASCII table are called LF - line feed (hexa 0x0A or decimal 10) and CR - carriage return (hexa 0x0D or decimal 13).

If you have ever seen a Typewriter, you will remember, in order to go to the next line, the user had to pull a handle towards the beginning of the line. (Usually to the left side of the paper.) This movement first pushed the "carriage" to the beginning of the paper and when it arrived to the beginning (and got stuck), further pulling of the handle turning the paper a bit so the carriage would point to the next line. Then the typewrite was ready for the next line.

That is, they used two operations carriage return, pushing the "carriage" to the beginning of the paper and line feed - going to the next line.

typewriter

(Image from Wikipedia)

Therefore on MS Windows, a newline is represented by two characters: CRLF. A Carriage return followed by a line feed.

On Unix/Linux systems and on Mac OSX, the newline is represented by a single LF (line feed).

Just for curiosity, Mac OS Classic (before OSX), Commodore, ZX Spectrum, TRS-80 all used a CR (Carriage Return) to represent a newline.

(I learned programming on a HT-1080Z which was a TRS-80 clone and later switched to a ZX Spectrum.)

Wikipedia has even more to say about newline.

So if you have a file filled with ASCII printable characters with a few "newlines" sprinkled in, then you have a text file.

Encoding

Of course if you looked at the ASCII table you saw that only very few languages could be written with those letters. Mostly the Latin based languages. Many languages that use those characters have a few extra letters. For example in Hungarian there are a few more vowels: aáeéiíoó??uúü?. The 5 from Latin and 9 extra. For fun.) You cannot represent then within the ASCII table.

Therefore people have invented other Encodings, besides ASCII. Without going into the details, each encoding is a mapping between numbers that can be saved in a computer file and "drawings" that should be displayed on the screen.

Remember, even in ASCII, you don‘t have a letter a in a file. You have a decimal number 97 saved that your computer knows to display as the letter a. The computer will display the letter a if it thinks that your file is in ASCII encoding, or in any of the ASCII-based or ASCII-compatible encodings, such as Latin1 or UTF8.

So in the ancient times people used various encoding to represent their own language, but these encodings overlapped. The same number was used to represent difference characters (drawings) in the different languages. That did not allow the mixing of these languages in the same file and if the application was used the incorrect encoding to display a file, all you got was a mix of unintelligible list of characters from some other language.

You can still see this problem when a web page is written in one of these ancient-time encodings, but the browser uses a different encoding to show it. The solution would be to include a hint about the encoding in the HTML page, but at times people forget to do this.

The other good solution is to use UTF-8 encoding as this encoding maps out all the characters in the known universe. Unfortunately Klingon is not yet included.

UTF-8 is one of good ways to map Unicode characters to numbers. As Unicode currently includes more than 110,000 characters it cannot be represented in one byte which can hold only numbers between 0 and 255. So in UTF-8, every character is represented by 1 to 4 bytes. If you open a file that was written using the UTF-8 encoding, with a tool that can only handle ASCII characters, you will see lots of "garbage". That‘s because in UTF-8 some of the characters are represented by numbers that are "control characters" in ASCII.

So to the casual viewer, the file would be indistinguishable from a binary file.

Binary file

A binary file is basically any file that is not "line-oriented". Any file where besides the actual written characters and newlines there are other symbols as well.

So a program written in the C programming language is a text file, but after you compiled it, the compiled version is binary.

A Perl program is a text file, but if you package it with PAR::Packer it will be a binary file.

A Microsoft word file is a binary file as besides the actual text, it also contains various characters representing font size and color.

An Open Office Write file is binary as it is a zipped set of XML files, but the XML files inside are considered text files. Even though they contain both text and characters that represent font-size and color.

An HTML file, is a text file too, even though it contains lots of characters that are invisible when viewed in a browser. It is considered a text file even though a newline, as described above, won‘t cause the next character to be displayed on the next line when viewed through a browsers. It is considered a text file, because all the "control characters" are themselves "printable characters", when viewed in a regular text editor.

时间: 2024-10-11 05:24:33

What is a text file and what is a binary file :)的相关文章

【转】 grep 文件报错 “Binary file ... matches”

原文链接 http://blog.csdn.net/yaochunnian/article/details/7261006 grep 文件报错 “Binary file ... matches” 原因:文件为binary文件 解决:strings vers.log.2010-03-09 | grep -i ‘mezimedia’ 或者 grep -a -i ‘mezimedia’ vers.log.2010-03-09 grep命令是linux下的行过滤工具,其参数繁多,下面就一一介绍个个参数的

3ds Max File Format (Part 1: The outer file format; OLE2)

The 3ds Max file format, not too much documentation to be found about it. There are some hints here and there about how it’s built up, but there exists no central documentation on it. Right now we are in the following situation. A few thousand of max

《Peering Inside the PE: A Tour of the Win32 Portable Executable File Format》阅读笔记(未完)

---恢复内容开始--- The format of an operating system's executable file is in many ways a mirror of the operating system. Winnt.h是一个非常重要的头文件,其中定义了大部分windows下的内部结构. The PE format is documented (in the loosest sense of the word) in the WINNT.H header file.Abo

grep命令提示"binary file matches **.log"解决方法

仔细想想,这个问题遇到很多次了,之前一直以为很复杂,一搜索发现解决这么简单,记录一下做备忘. grep test XXX.log Binary file app.log matches 此时使用-a参数接口. grep -a test XXX.log -a, --text equivalent to --binary-files=text,即让二进制文件等价于文本. 注:zgrep遇到同样问题,解决方法也是类似. 原文地址:https://www.cnblogs.com/amyzhu/p/111

ORA-01665 control file is not a standby control file

ORA-01665错误处理 问题描述: 在备库启动至mount状态时,报如下错误: ORA-01665: control file is not a standby control file 解决办法: 在主库备份一个控制文件 SQL> alter database create standby controlfile as '/home/oracle/bak.ctl'; 然后传到备库,用此控制文件启动数据库即可

snort在使用过程中遇到的问题:ERROR: OpenAlertFile() => fopen() alert file log/alert.ids:No such file or directory

转自:http://www.cnblogs.com/kathmi/archive/2010/08/09/1795405.html Snort是著名的开源入侵检测工具,不仅它的嗅探功能极佳,在服务器安全方面也可提供安全防护. 近期因为涉及此项内容,故记录下来. 使用的软件如下: Snort_2_8_6_Installer.exe(按照默认路进安装即可) WinPcap_4_1_2.exe snortrules-snapshot-2860.tar.gz(规则库,解压到Snort的安装目录,如果提示重

【开发环境】 uClinux内核编译问题<config/kconfig/mconf: cannot execute binary file>问题解决方法

一.前言 最近进行uClinux移植,make menuconfig时碰到如下问题: [[email protected] uClinux-dist]$ make menuconfig find vendors -mindepth 2 '(' -name .svn -prune ')' -o -type f -name Kconfig -print | sed 's:^:source ../:' > vendors/Kconfig config/mkconfig > Kconfig KCONF

-bash: /root/java/jdk/bin/java: cannot execute binary file

错误 -bash: /root/java/jdk/bin/java: cannot execute binary file 错误原因 安装的Linux的版本是32位的,下载的软件是64位,版本不兼容,需要换一个相同位数的版本 查看Linux的版本 file  /sbin/init 或者  file /bin/ls 这个显示你的版本是32位的 反之则是64位的

解决Sublime Text 3中文显示乱码(tab中文方块)问题

一.文本出现中文乱码问题 1.打开Sublime Text 3,按Ctrl+-打开控制行,复制粘贴以下python代码,然后回车运行. 2. 复制并粘贴如下代码: Python代码   import urllib.request,os,sys; exec("if sys.version_info < (3,) or os.name != 'nt': raise OSError('This code is for Windows ST3 only!')"); pr='Prefere