hadoop权威指南(第四版)要点翻译(5)——Chapter 3. The HDFS(5)

5) The Java Interface

a) Reading Data from a Hadoop URL.

使用hadoop URL来读取数据

b) Although we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should strive to write your code against the FileSystem abstract class, to retain portability across filesystems.

虽然我们把主要的注意力都集中在HDFS的实现上,即DistributedFileSystem,但通常你应该针对抽象类FileSystem编写代码以保持其跨文件系统的可移植性。

c) One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to open a stream to read the data from. The general idiom is:

从一个hadoop文件系统中读取一个文件最简单的方式就是使用一个java.net.URL对象打开一个数据流去从中读取数据。通常格式是:

InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}

There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is achieved by calling the setURLStreamHandlerFactory() method on URL with an instance of FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it is typically executed in a static block.

让Java识别hadoop的hdfs url方案还需要一点额外的工作,在这里可以通过FsUrlStreamHandlerFactory对象调用URL中的setURLStreamHandlerFactory()方法来实现。这个方法每一个JVM只能执行一次,因此通常在一个静态程序块中执行。

d) Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler.

使用URLStreamHandler用标准输出的方式列出一个hadoop文件系统中的文件。

public class URLCat {

    static {
        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    }

    public static void main(String[] args) throws Exception {
        InputStream in = null;
        try {
            in = new URL(args[0]).openStream();
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}

run:
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt

e) We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the finally clause, and also for copying bytes between the input stream and the output stream (System.out, in this case). The last two arguments to the copyBytes() method are the buffer size used for copying and whether to close the streams when the copy is complete. We close the input stream ourselves, and System.out doesn’t need to be closed.

我们使用了hadoop中就近的IOUtils类,并且在finally子句中关闭了数据流,并且在输入流和输出流之间复制数据(在这个例子中输出流是System.out). copyBytes()方法中最后的两个参数表示复制数据的缓存大小以及当复制完成时是否关闭数据流。在这里我们关闭了输入流,而输出流System.out不需要关闭。

f) Reading Data Using the FileSystem API.

使用FileSystem API来读取数据。

g) FileSystem is a general filesystem API, so the first step is to retrieve an instance for the filesystem we want to use — HDFS, in this case. There are several static factory methods for getting a FileSystem instance:

FileSystem类是一个通用文件系统的API,因此第一步就是获得一个文件系统的实力,在本例中是HDFS。获得一个FileSystem实例有几种静态工厂方法。

public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException

h) A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as etc/hadoop/core-site.xml. The first method returns the default filesystem (as specified in core-site.xml, or the default local filesystem if not specified there). The second uses the given URI’s scheme and authority to determine the filesystem to use, falling back to the default filesystem if no scheme is specified in the given URI. The third retrieves the filesystem as the given user, which is important in the context of security.

Configuration对象封装了客户端或者服务器端的配置,其设置成使用配置文件从类路径中读取,比如etc/hadoop/core-site.xml。第一种方法返回默认的文件系统(其在core-site.xml中指定,如果没有在这里指定的话,就是默认的本地文件系统).第二种方法根据给定的URL方案和权限来决定所使用的文件系统,如果在给定的URL中没有指定具体的方案,那么返回默认的文件系统。第三种方法会去检索给定的用户的文件系统,在强调安全的背景下,这是很重要的。

l) Example 3-3. Displaying files from a Hadoop filesystem on standard output twice, by using seek():

使用seek()方法以标准输出方式列出2次hadoop文件系统的文件

public class FileSystemDoubleCat {

public static void main(String[] args) throws Exception {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    FSDataInputStream in = null;
    try {
        in = fs.open(new Path(uri));
        IOUtils.copyBytes(in, System.out, 4096, false);
        in.seek(0); // go back to the start of the file
        IOUtils.copyBytes(in, System.out, 4096, false);
    } finally {
        IOUtils.closeStream(in);
    }
}

}

run:

% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt

j) The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io class. This class is a specialization of java.io.DataInputStream with support for random access, so you can read from any part of the stream:

FileSystem类的open()方法实际上返回的是一个FSDataInputStream,而不是一个标准的Java IO类。这个类一个继承了java.io.DataInputStream类的特殊类,且支持随机访问,因此,可以读取数据流的任何部分。

package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}

k) The Seekable interface permits seeking to a position in the file and provides a query method for the current offset from the start of the file (getPos()):

Seekable接口允许进行在文件中定位,并且提供一个当前位置相对文件起始位置的偏移量的查询方法(getPos()):

public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}

l) Example 3-3. Displaying files from a Hadoop filesystem on standard output twice, by using seek():

使用seek()方法以标准输出方式列出2次hadoop文件系统的文件

public class FileSystemDoubleCat {

public static void main(String[] args) throws Exception {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

FSDataInputStream in = null;

try {

in = fs.open(new Path(uri));

IOUtils.copyBytes(in, System.out, 4096, false);

in.seek(0); // go back to the start of the file

IOUtils.copyBytes(in, System.out, 4096, false);

} finally {

IOUtils.closeStream(in);

}

}

}

run:

% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt

bear in mind that calling seek() is a relatively expensive operation and should be done sparingly. You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing a large number of seeks.

最后,别忘了调用seek()方法是一个相对开销比较大的操作,应该谨慎使用。你应该在流数据之上(比如,MapReduce)构建应用程序访问模式,而不是执行大量的seek()方法。

n) Writing Data

o) The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a Path object for the file to be created and returns an output stream to write to:

FileSystem类有许多创建文件的方法。最简单的方法是给要创建的文件设置一个Path对象,并且返回一个可以给文件写入数据的输出流。

public FSDataOutputStream create(Path f) throws IOException

p) There’s also an overloaded method for passing a callback interface, Progressable, so your application can be notified of the progress of the data being written to the datanodes:

还有一个重载方法,用来传递一个回调接口Progressable,因此这样可以把数据写入节点的进度告知应用程序。

package org.apache.hadoop.util;
public interface Progressable {
    public void progress();
}

q) As an alternative to creating a new file, you can append to an existing file using the append() method (there are also some other overloaded versions):

作为一个创建新文件的可选方式,你可以使用append()方法来附件一个已经存在的文件(也有其他的重载版本)。

public FSDataOutputStream append(Path f) throws IOException

r) Example 3-4. Copying a local file to a Hadoop filesystem

复制一个本地文件到hadoop文件系统。

public class FileCopyWithProgress {
    public static void main(String[] args) throws Exception {
        String localSrc = args[0];
        String dst = args[1];

        InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(dst), conf);
        OutputStream out = fs.create(new Path(dst), new Progressable() {
            public void progress() {
                System.out.print(".");
            }
        });

        IOUtils.copyBytes(in, out, 4096, true);
    }
}

s) The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has a method for querying the current position in the file:

FileSystem类的create()方法返回了一个FSDataOutputStream,就像FSDataInputStream一样,也有一个方法用来查询文件中的当前位置:

package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
    public long getPos() throws IOException {
// implementation elided
    }
// implementation elided
}

However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because HDFS allows only sequential writes to an open file or appends to an already written file. In other words, there is no support for writing to anywhere other than the end of the file, so there is no value in being able to seek while writing.

然而,跟FSDataInputStream不一样,FSDataOutputStream不允许检索。这是因为HDFS仅允许连续的写入一个已经打开的文件,或者附加到一个已经存在的可写入文档。换句话说,除了支持写入文件的末尾之外,其他位置都不支持,因此写入的时候进行定位就毫无意义。

t) FileSystem provides a method to create a directory:

FileSystem类提供了一个方法去创建目录。

public boolean mkdirs(Path f) throws IOException

Often, you don’t need to explicitly create a directory, because writing a file by calling create() will automatically create any parent directories.

通常,你不需要显示的创建一个目录,因为使用create()方法写入文件时会自动的创建任何需要的父目录。

u) Querying the Filesystem

v) An important feature of any filesystem is the ability to navigate its directory structure and retrieve information about the files and directories that it stores. The FileStatus class encapsulates filesystem metadata for files and directories, including file length, block size, replication, modification time, ownership, and permission information.

任何文件系统的一个重要特征就是具有浏览和检索所存储的文件和目录的目录结构和信息。FileStatus类封装了文件系统中文件和目录的元数据,包括文件长度、块大小、备份因素、修改时间,所有者以及权限信息。

w) The method getFileStatus() on FileSystem provides a way of getting a FileStatus object for a single file or directory.

FileSystem类的getFileStatus()方法提供了一个获取文件或目录的FileStatus对象的方式。

x) Finding information on a single file or directory is useful, but you also often need to be able to list the contents of a directory. That’s what FileSystem’s listStatus() methods are for:

在一个单个文件或目录上搜寻信息是有用的,但是你也会经常需要罗列一个目录的内容。这就是FileSystem类listStatus()方法的功能。

public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException

When the argument is a file, the simplest variant returns an array of FileStatus objects of length 1. When the argument is a directory, it returns zero or more FileStatus objects representing the files and directories contained in the directory.

当参数是一个文件时,最简单变化就是返回一个长度为1的FileStatus对象数组,当参数是一个目录时,返回0个或多个FileStatus对象,代表目录中包含的文件或者目录。

y) Example 3-6. Showing the file statuses for a collection of paths in a Hadoop filesystem.

显示hadoop文件系统中一组路径的文件状态

public class ListStatus {

    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);

        Path[] paths = new Path[args.length];
        for (int i = 0; i < paths.length; i++) {
            paths[i] = new Path(args[i]);
        }

        FileStatus[] status = fs.listStatus(paths);
        Path[] listedPaths = FileUtil.stat2Paths(status);
        for (Path p : listedPaths) {
            System.out.println(p);
        }
    }
}

z) Rather than having to enumerate each file and directory to specify the input, it is convenient to use wildcard characters to match multiple files with a single expression, an operation that is known as globbing. Hadoop provides two FileSystem methods for processing globs:

不同于使用枚举的方式去指定每一个文件和目录作为输入,它可以很方便的使用通配符用一个表达式去匹配多个文件,也就是被认为的globbing操作。hadoop提供了两种FileSystem类的方法去处理globs:

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

Hadoop supports the same set of glob characters as the Unix bash shell.

hadoop支持与Unix系统bash脚本一致的通配符表达。

时间: 2024-11-10 14:34:20

hadoop权威指南(第四版)要点翻译(5)——Chapter 3. The HDFS(5)的相关文章

hadoop权威指南(第四版)要点翻译(4)——Chapter 3. The HDFS(1-4)

Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular dis

hadoop权威指南(第四版)要点翻译(2)——Chapter 1. Meet Hadoop

a) The trend is for every individual's data footprint to grow, but perhaps more significantly,the amount of data generated by machines as a part of the Internet of Things will be even greater than that generated by people. 每个人在互联网上的足迹数据(或者说痕迹)会一直增长,但

hadoop权威指南(第四版)要点翻译(3)——Chapter 2. MapReduce

Most importantly, MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal.MapReduce comes into its own for large datasets, so let's start by looking at on

hadoop权威指南(第四版)要点翻译(1)——Foreword and Preface

前期已经完成了hadoop全分布模式的部署和运行,近期想更进一步的了解hadoop原理,基于hadoop2.X的书籍最好的莫过于<hadoop权威指南(第四版)>,很可惜作者年初才刚写完,没来得及翻译,只好看英文版了,书中的要点重点在接下来的一段时间我会依次翻译出来(全部翻译不太现实,没那么多时间精力,将近900页呢),如果有翻译不妥的地方,还请大家指出来,共同进步,谢谢! 今天是个开头,就先翻译一下文中前言和序的要点 1.Foreword 1) Wesplit off the distrib

hadoop学习:《Hadoop权威指南第四版》中文PDF+英文PDF+代码

结合理论和实践,<Hadoop权威指南第四版>由浅入深,全方位介绍了Hadoop 这一高性能的海量数据处理和分析平台.5部分24 章,第Ⅰ部分介绍Hadoop 基础知识,第Ⅱ部分介绍MapReduce,第Ⅲ部分介绍Hadoop 的运维,第Ⅳ部分介绍Hadoop 相关开源项目,第Ⅴ部分提供了三个案例. Hadoop生态都有涉及,很厚很全:HDFS, MapReduce1&2(YARN), Hive, HBase, Pig, ZooKeeper, Sqoop等. 多数章节对自己的要求都是了

[hadoop]hadoop权威指南例第二版3-1、3-2

hadoop版本1.2.1 jdk1.7.0 例3-1.通过URLStreamHandler实例以标准输出方式显示Hadoop文件系统的文件 hadoop fs -mkdir input 在本地创建两个文件file1,file2,file1的内容为hello world,file2内容为hello Hadoop,然后上传到input,具体方法如Hadoop集群(第6期)_WordCount运行详解中 2.1.准备工作可以看到. 完整代码如下: 1 import org.apache.hadoop

《Hadoop权威指南 第4版》 - 第四章 关于YARN - hadoop的集群资源管理系统

简介 YARN 提供请求和使用hadoop集群资源的API 向上隐藏细节 提供更高层的API 4.1 YARN应用运行机制 - 资源请求 - 应用生命周期 - 构建yarn应用 4.2 YARN与MapReduce 1相比 (MapReduce特指hadoop1 的版本, 2/3依次对应) - 4.3 YARN中的调度 调度选项 FIFO调度器 容量调度器 (多个请求队列调用一个hadoop集群, 每个队列请求量上限不可逾越) 公平调度器 (动态平衡资源调度, 大作业多分配) 启动YARN并运行

辛星笔记之Hadoop权威指南第四篇HDFS简介

当数据集的大小超过一台独立物理计算机的存储能力时,就有必要对它进行分区并且存储到若干台单独的计算机上.管理网络中跨多台计算机存储的文件系统被称为分布式文件系统(distributed  filesystem). 分布式文件系统架构于网络智商,势必会引入网络编程的复杂性,因此分布式文件系统比普通磁盘文件系统更加复杂,比如文件系统能够容忍节点故障但是不丢失数据就是一个很大的挑战. HDFS的全称是Hadoop  Distributed  Filesystem,在非正式文档或者旧文档以及配置文件中,有

分享《Hadoop权威指南(第四版)》中文PDF+英文PDF+源代码

下载:https://pan.baidu.com/s/1YrWpwl2xgsFlf6GBS2Ry8w更多资料:http://blog.51cto.com/3215120 <Hadoop权威指南(第四版)>中文PDF+英文PDF+源代码 <Hadoop权威指南(第四版)>中文PDF+英文PDF+源代码<Hadoop权威指南(第四版)>中文PDF,734页,带书签目录.<Hadoop权威指南(第四版)>英文PDF,805页,带书签目录.配套源代码. 其中,中文版