Notes of Principles of Parallel Programming - partial

0.1 Topic
Notes of Lin C., Snyder L.. Principles of Parallel Programming. Beijing: China Machine Press. 2008.

(1) Parallel Computer Architecture - done 2015/5/24
(2) Parallel Abstraction
(3) Scable Algorithm Techniques
(4) PP Languages: Java(Thread), MPI(local view), ZPL(global view)

0.2 Audience
Navie PP programmers who want to gain foundamental PP concepts

0.3 Related Topics
Computer Architecture, Sequential Algorithms,
PP Programming Languages

--------------------------------------------------------------------
###1 introduction
real world cases:
house construction, manufacturing pipeline, call center

ILP(Instruction Level Parallelism)
(a+b) * (c+d)

Parallel Computing V.S. Distributed Computing
the goal of PC is to provide performance, either in terms of
processor power or memory that a single processor cannot provide;
the goal of DC is to provide convenience, including availability,
realiablity and physical distribution.

Concurrency V.S. Parallelism
CONCURRENCY is widely used in OS and DB communities to describe
exceutions that are LOGICALLY simultaneous;
PARALLELISM is typically used by the architecture and supercomputing
communities to describe executions that PHYSICALLY execute simultaneoulsy.
In either case, the codes that execute simultaneously exhibit unknown
timing characteristics.

iterative sum/pair-wise summation

parallel prefix sum

Parallelism using multiple instruction streams: thread
multithreaded solutions to count 3‘s number in an array

good parallel programs‘ characteristics:
(1) correct;
(2) good performance
(3) scalable to large number of processors
(4) portable across a wide variety to parallel platforms

###2 parallel computers
6 parallel computers
(1) Chip multiprocessors *
Intel Core Duo
AMD Dual Core Opteron
(2) Symmetric Multiprocessor Architecture
Sun Fire E25K
(3) Heterogeneous Chip Design
Cell
(4) Clusters
(5) Supercomputers
BlueGene/L

sequential computer abstraction
Random Access Machine(RAM) model, i.e. the von Neumann Model
abstract a sequential computer as a device with an instruction
execution unit and an unbounded memory.

2 abstract models of parallel computers:
(1) PRAM: parallel random access machine model
the PRAM consists of an unspecified number of instruction execution units,
connected to a single unbounded shared memory that contains both
programs and data.
(2) CTA: candidate type architecture
the CTA consists of P standard sequential computers(processors,processor element),
connected by an interconnection network(communication network);
seperate 2 types of memory references: inexpensive local reference
and expensive non-local reference;

Locality Rule:
Fast programs tend to maximize the number of local memory references, and
minimize the number of non-local memory references.

3 major communication(memory reference) mechanisms:
(1) shared memory
a natural extension of the flat memory of sequential computers.
(2) one-sided communication
a relaxation of the shared memory concepts: support a single shared address space,
all threads can reference all memory location, but it doesn‘t attempt to keep the
memory coherent.
(3) message passing
memory references are used to access local memory,
message passing is userd to access non-local memory.

### 3 reasoning about parallel performance
thread: thread-based/shared memory parallel programming
process: message passing/non-shread memory parallel proframming

latency: the amount of TIME it takes to complete a given unit of work
throughput: the amount of WORK that can be completed per unit time

## source of performance loss
(1) overhead
communication
synchronization
computation
memory
(2) non-parallelizable computation
Amdahl‘s Law: portions of a computation that are sequential will,
as parallelism is applied, dominate the execution time.
(3) idle processors
idle time is often a consequence of synchronization and communication
load imbalance: uneven distribution of work to processors
memory-bound computaion: bandwidth, lantency
(4) contention for resources
spin lock, false sharing

## parallel structure
(1) dependences
an ordering relationship between two computations
(2) granularity
the frequency of interactions among threads or processes
(3) locality
temporal locality: memory references are clustered in TIME
spatial locality: memory references are clustered by ADDRESS

## performance trade-off
sequential computation: 90/10 rule
communication V.S. Computation
Memory V.S. Parallelism
Overhead V.S. Parallelism

## measuring performance
(1) execution time/latency
(2) speedup/efficiency
(3) superliear speedup

## scable performance *
is difficult to achieve

时间: 2024-10-12 12:09:20

Notes of Principles of Parallel Programming - partial的相关文章

Notes of Principles of Parallel Programming: Peril-L Notation

Content 1 syntax and semantic 2 example set 1 syntax and semantic 1.1 extending C Peril-L notation stands on the shoulder of C. 1.2 parallel threads forall(<intVar in (<index range specification>)){ <body> } 1.3 synchronized and coordinatio

CUDA Intro to Parallel Programming笔记--Lesson 1 The GPU Programming Model

1.  3 traditional ways computes run faster Faster clocks More work/clock cycle More processors 2. Parallelism A high end Gpu contains over 3,000 arithmatic units,ALUs, that can simultanously run 3,000 arithmetic operations. GPU can have tens of thous

Intro to Parallel Programming课程笔记001

Intro to Parallel Programming How do you dig a hole faster? GPU理念 很多很多简单计算单元: 清洗的并行计算模型: 关注吞吐量而非延迟: CPU: HOST GPU:DEVICE A Typical GPU Program 1,CPUallocates(分配) storage on GPU  cuda Malloc 2,CPUcopies input data from CPU-GPU          cuda Memcpy 3,C

Samples for Parallel Programming with the .NET Framework

The .NET Framework 4 includes significant advancements for developers writing parallel and concurrent applications, including Parallel LINQ (PLINQ), the Task Parallel Library (TPL), new thread-safe collections, and a variety of new coordination and s

Concurrent and Parallel Programming

What's the difference between concurrency and parallelism? Explain it to a five year old. Concurrent = Two queues and one coffee machine. Parallel = Two queues and two coffee machines. Tagged:   concurrency      parallel      programming http://joear

&quot;Principles of Reactive Programming&quot; 之&lt;Actors are Distributed&gt; (1)

week7中的前两节课的标题是”Actors are Distributed",讲了很多Akka Cluster的内容,同时也很难理解. Roland Kuhn并没有讲太多Akka Cluster自身如何工作的细节,而是更关注于如何利用Akka Cluster来把Actor分布到不同的节点上,或许这么安排是因为Akka Cluster能讲的东西太多,而Coursera的课时不够.但是,从听众的角度来说,这节课只是初步明白了下Akka Cluster能干啥,但是想要自己用起来,特别是想了解其工作的

【2014-11-23】Heterogeneous Parallel Programming &ndash; Section 1

Latency devices(CPU cores) Throughput devices(GPU cores) Use the best match for the job (heterogeneity in mobile SOC CPU: Latency Oriented Design Powerful ALU Reduced operation latency Large caches convert long latency memory accesses to short latenc

&quot;Principles of Reactive Programming&quot; 之&lt;Actors are Distributed&gt; (3)

Cluster 讲课的这哥们接下来讲了下Akka Cluster的使用,但是是通过把一个以前讲过的actor 系统改成使用cluster来介绍的Akka cluster. 这部分代码很多,还是直接看视频吧.或者,看这篇文章, Akka Clustering, Step by Step 更直白一些,不用事先了解课程中那个比较复杂的例子. 还是从正常的顺序了解下Akka Cluster吧. Akka Clustering用来解决什么问题? Akka Cluster provides a fault-

study notes: high performance linux server programming

1:linux网络API分为:socker地址API,socker基础API,网络信息API 1,socker地址API:包含IP地址和端口(ip, port).表示TCP通信的一端. 2,socker基础API:创建/命名/监听socker,接收/发起链接,读写数据,获取地址信息,检测带外标记和读取/设置socker选项.sys/socket.h 3,网络信息API:主机名和IP地址的转换,服务名和端口号的转换.netdb.h 2:socket和API的函数 和 相关知识. 1,函数. 1 I