5105 pa3 Distributed File System based on Quorum Protocol


1 Design document

1.1 System overview

We implemented a distributed file system using a quorum based protocol. The basic idea of this protocol is that the clients need to obtain permission from multiple servers before either reading or writing a file to the server. Using this system, multiple clients can share files together. The whole system is composed of 1 coordinator (also a server), 1 or more clients, and several other nodes.

Our system also supports multiple reads on the same file.

Multiple writes or read+writes to the same file is not allowed.

Writes to a single file is processed in the order of request sequence.

The client can write (update) files in the file system, or read files. If there is no corresponding filename on the file system, the client will see an appropriate error message.

The nodes will contain replicas of the files and listen to requests from the clients. When a node wants to join the file system, it contacts the coordinator using Thrift. The coordinator will then add it to the server list.  A file server which gets a request from the client will contact the coordinator to carry out the operation. That is, any servers can receive read/write requests from users and they will forward the requests to the coordinator.

The coordinator could listen to requests from the servers, and record the information whether a file is free, being read/ written/ synchronizing. The coordinator will build the quorum and then contact the other randomly chosen servers needed for the quorum to complete the operation requested by the server node. The coordinator will be well known to all file servers.

1.2 Assumptions

We made these assumptions in this system:

• Assume all the files are text file only and ignore its encoding-format.

• The number of files you need to handle will be small (< 10).

• Servers know other servers’ and coordinator’s information (IP and port).

• The file contents will be very simple (e.g., a file name with version number).

• The coordinator will hold a lock for each file.

• Accessing different files should is done concurrently.

• The system works on the CSELabs machines (separate machines), e.g., KH 4-250 (csel-kh4250-xx.cselabs.umn.edu).

1.3 Component design

1.3.1 Common part

In the common part, we defined a Thrift struct Address, which is used to describe a node (IP, port).

1.3.2 Coordinator

To initiate the supernode, input java -cp ".:/usr/local/Thrift/*" Coordinator Port NR NW, for example, “java -cp ".:/usr/local/Thrift/*" Coordinator 9090 4 4”.

The coordinator implements the following methods:

1. String Coord_Read(String filename): When a node wants to read a file, it sends the read request to the coordinator by this method. The coordinator will check whether the file can be read first (not being written). If could, the coordinator will build a quorum with NR servers. The coordinator then asks these servers to read files and pick the latest version to get it back to the server.

2. boolean Coord_Write(String filename, String fileContent): When a node wants to write a file, it sends the write request to the coordinator by this method. The coordinator will check whether the file can be written first (not being written or read). If could, the coordinator will build a quorum with NW servers. The coordinator then asks these servers to update the files.

3. void sync(): This is the background synchronization function. It updates all the files to the latest version every 5 seconds. When the file is not being processed, the synchronization function will get the latest version from all the server and then distribute it to all server nodes.

4. ConcurrentHashMap<String, Integer> Coord_lsDir(): this function return all the files in the system and its version number.

5. boolean Join(Address server): The server node calls this function to notify the coordinator to join the file system.

6. boolean reset(int NR, int NW): The function is used to reset the value of NR and NW.

1.3.3 Client

The client is the user interface. It will connect to an arbitrary node.

To initiate it, input  “java -cp ".:/usr/local/Thrift/*" Client <nodeIP> <nodePort>”, for example,  “java -cp ".:/usr/local/Thrift/*" Client cuda02 7625”.

The client could handle the following operations:

<setdir> local_dirname : Set the local working directory

<getdir> : Show the local working directory. The default working directory is ./ClientDir/

<read> remote_filename : Read a remote file. The client will call read() function on server.

<write> remote_filename local_filename : Write a remote file with the content of a local file. The client will call write() function on server.

<lsremote> : Show the list of all remote files (filename and its version) on the file system. The client will call lsDir() function on server.

<lslocal> : Show the list of all local files in working directory.

<bench-write> : Perform a benchmark: write all files in local working directory to remote file system.

<bench-read> : Perform a benchmark: read all files in remote file system.

<benchmark> tw tr : Perform a benchmark: tw times of [bench-write], then tr times of [bench-read].

<setcod> nr nw : set NR / NW on coordinator. The client will call Coord_reset() function on server.

<quit> : Quit

1.3.4 Node

The nodes will contain replicas of the files and listen to requests from the clients. The working dir of the node is ./ServerDir_xxxxx (xxxxx is a random number to ensure that the working dir is always empty when every time starting the node).

To initiate the node, input  “java -cp ".:/usr/local/Thrift/*" Node <coordinatorIP> <coordinatorPort> <nodePort>”, for example,  “java -cp ".:/usr/local/Thrift/*" Node csel-kh1250-02 9090 7625”.

The node implements the following methods:

string Read(1: string FileName): client call Read() to read a file. It will call Coord_read() on coordinator.

i32 Write(1: string FileName, 2: string FileContent): client call Write() to write a file. It will call Coord_write() on coordinator.

map<string,i32> lsDir(): client call Write() to write a file. It will call Coord_lsDir() on coordinator.

i32 GetVersion(1: string FileName): coordinator call this to get the version of file on this server. It will return the local version of the specific local file. Return -1 if the file does not exist.

string DirectRead(1: string FileName): a coordinator call DirectRead() to read file. It will return the local content of the specific local file. Return “NULL” if the file does not exist.

i32 DirectWrite(1: string FileName, 2: string FileContent, 3: i32 FileNewVer): a coordinator call DirectWrite() to write file. It will write the file on local server, and set its version to FileNewVer.

bool Coord_reset(1: i32 nr, 2: i32 nw): client call Coord_reset() to reset parameters on coordinator. It will call reset on coordinator.


2 User document

2.1 How to compile

We have written a make script to compile the whole project.

cd pa3/src

./make.sh

2.2 How to run the project

1.    Run Coordinator
cd pa3/src/
 java -cp ".:/usr/local/Thrift/*" Coordinator Port NR NW
        <Port>: The port of super
        <NR>: quorum size for read operation
            <NW>: quorum size for write operation
    Eg:  java -cp ".:/usr/local/Thrift/*" Coordinator 9090 4 4
      2.  Run compute node
    Start compute node on 7different machines.
    cd pa2/src/
    java -cp ".:/usr/local/Thrift/*" Node <coordinatorIP> <coordinatorPort> <nodePort>
        <coordinatorIP>: The ip address of coordinator
        <coordinatorPort>: The port of coordinator
        <NodePort>: The port of node
    Eg:  java -cp ".:/usr/local/Thrift/*" Node csel-kh1250-02 9090 7625
      3. Run client
    cd pa2/src/
    java -cp ".:/usr/local/Thrift/*" Client <nodeIP> <nodePort>
<nodeIP>: The ip address of node
        <nodePort>: The port of node
E.g:  java -cp ".:/usr/local/Thrift/*" Client cuda02 7625

Sample operations on client:

2.3 What will happen after running

The results and log (operation return, succeed flag, time spent, synchronization condition, file version) will be output on the screen. You will be asked to input the next command.


3 Testing Document

3.1 Testing Environment

Machines:

We use 10 machines to perform the test, including 1 coordinator machine (csel-kh1250-01), 6 server machines (csel-kh4250-03, csel-kh4250-01, csel-kh4250-22, csel-kh4250-25, csel-kh4250-34), 3 client machines (csel-kh1250-03).

Test Set:

We use a test set (./ClientDir) including 10 items, totally 314 bytes. The data uses a shared directory via NSF.

Logging:

Logging (operation return, succeed flag, time spent, synchronization condition, file version) is output on the window.

Testing Settings:

We test:

3 clients

read-heavy/ write-heavy workloads

small/big value of NR/NW

 

3.2 read/write mixed (300 read, 100 read for each client, 300 write, 100 write for each client)

Unit: ms

1)NR = 4, NW = 4

Client 1, read time: 812, write time: 2007

Client 2, read time: 817 , write time: 1876

Client 3, read time: 809, write time: 1796

2)NR = 1, NW = 7

Client 1, read time: 531, write time: 2970

Client 2, read time: 565, write time: 3276

Client 3, read time: 706 , write time: 3606

3)NR = 7, NW = 7

Client 1, read time: 1081, write time: 2754

Client 2, read time: 774, write time: 2854

Client 3, read time: 644, write time: 2875

4)NR = 7, NW = 4

Client 1, read time: 802, write time: 1822

Client 2, read time: 946, write time: 1820

Client 3, read time: 1031, write time: 2225

When NR=1, the read time is the shortest. When NW=4, the write time is the shortest. NR = 4, NW = 4 reach the shortest total time. Maybe that’s because the size of the quorum is small,  less time were spent in distributing updates to quorum and collect the latest version from Quorum. Also, we found that the time spent on single write operation is much longer than a single read operation.

3.3 write heavy  (120 read, 40 read for each client, 480 write, 160 write for each client)

 1)NR = 4, NW = 4

Client 1, read time: 284, write time: 2693

Client 2, read time: 228, write time: 2693

Client 3, read time: 262, write time: 2629

2)NR = 1, NW = 7

Client 1, read time: 180, write time: 4052

Client 2, read time: 248, write time: 4083

Client 3, read time: 266, write time: 4117

3)NR = 7, NW = 7

Client 1, read time: 310, write time: 4961

Client 2, read time: 453, write time: 4793

Client 3, read time: 374, write time: 4693

4)NR = 7, NW = 4

Client 1, read time: 301, write time: 2539

Client 2, read time: 317, write time: 2601

Client 3, read time: 361, write time: 2585

When NR=1, the read time is the shortest. When NW=4, the write time is the shortest. NR=4/NW=4 and NR=7/NW=4 reaches the best performance. Since this is the write heavy case, so minimal NW could get the best performance.

3.4 read heavy (480 read, 160 read for each client, 120 write, 40 write for each client)

1)NR = 4, NW = 4

Client 1, read time: 1114, write time: 781

Client 2, read time: 893, write time: 751

Client 3, read time: 818, write time: 539

2)NR = 1, NW = 7

Client 1, read time: 880, write time: 1252

Client 2, read time: 690, write time: 1028

Client 3, read time: 584, write time: 982

3)NR = 7, NW = 7

Client 1, read time: 1841, write time: 1646

Client 2, read time: 1740, write time: 1697

Client 3, read time: 1929, write time: 2082

4)NR = 7, NW = 4

Client 1, read time: 1205, write time: 729

Client 2, read time: 1233, write time: 876

Client 3, read time: 1503, write time: 1255

When NR=1, the read time is the shortest. When NW=4, the write time is the shortest. NR=4/NW=4 reaches the best performance. Since this is the read heavy case, so minimal NR could get the best performance.

3.5 Negative cases

We tested the 2 cases:

1. read a remote file, while the file does not exist in remote file system

2. write a local file to remote, while the file does not exist in client

原文地址:https://www.cnblogs.com/pdev/p/11331871.html

时间: 2024-08-23 08:31:26

5105 pa3 Distributed File System based on Quorum Protocol的相关文章

DFS(distributed file system)

A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for each node)

Hadoop -&gt;&gt; HDFS(Hadoop Distributed File System)

HDFS全称是Hadoop Distributed File System.作为分布式文件系统,具有高容错性的特点.它放宽了POSIX对于操作系统接口的要求,可以直接以流(Stream)的形式访问文件系统中的数据. HDFS能快速检测到硬件故障,也就是数据节点的Failover,并且自动恢复数据访问. 使用流形式的数据方法特点不是对数据访问时快速的反应,而是批量数据处理时的吞吐能力的最大化. 文件操作原则: HDFS文件的操作原则是“只写一次,多次读取”.一个文件一旦被创建再写入数据完毕后就不再

HDFS(Hadoop Distributed File System )

HDFS(Hadoop Distributed File System ) HDFS(Hadoop Distributed File System )Hadoop分布式文件系统.是根据google发表的论文翻版的.论文为GFS(Google File System)Google 文件系统(中文,英文). 1. 架构分析 基础名词解释: Block: 在HDFS中,每个文件都是采用的分块的方式存储,每个block放在不同的datanode上,每个block的标识是一个三元组(block id, n

HDFS分布式文件系统(The Hadoop Distributed File System)

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execu

【整理学习HDFS】Hadoop Distributed File System 一个分布式文件系统

Hadoop分布式文件系统(HDFS)被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统.它和现有的分布式文件系统有很多共同点.但同时,它和其他的分布式文件系统的区别也是很明显的.HDFS是一个高度容错性的系统,适合部署在廉价的机器上.HDFS能提供高吞吐量的数据访问,非常适合大规模数据集上的应用.HDFS放宽了一部分POSIX约束,来实现流式读取文件系统数据的目的.HDFS在最开始是作为Apache Nutch搜索引擎项目的基础架构而开发的.HDFS是Apac

NameNode Recovery Tools for the Hadoop Distributed File System

转自:http://blog.cloudera.com/blog/2012/05/namenode-recovery-tools-for-the-hadoop-distributed-file-system/ Warning: The procedure described below can cause data loss. Contact Cloudera Support before attempting it. Most system administrators have had to

5105 pa2 Distributed Hash Table based on Chord

1 Design document 1.1 System overview We implemented a Book Finder System using a distributed hash table (DHT) based on the Chord protocol. Using this system, the client can set and get the book’s information (title and genre for simplicity) using th

HDFS(Hadoop Distributed File System)的组件架构概述

1.hadoop1.x和hadoop2.x区别 2.组件介绍 HDFS架构概述1)NameNode(nn): 存储文件的元数据,如文件名,文件目录结构,文件属性(生成时间,副本数,文件权限),以及每个文件的块列表和块所在的DataNode等.2)DataNode(dn): 在本地文件系统存储文件块数据,以及块数据的校验和.3)SecondaryNameNode(2nn): 用来监控HDFS状态的辅助后台程序,每隔一段时间获取DHFS元数据的快照. YARN架构概述 1)ResourceManag

Parallel file system processing

A treewalk for splitting a file directory is disclosed for parallel execution of work items over a filesystem. The given work item is assigned to a worker. Thereafter, a request is sent to split the file directory to share a portion of the file direc