Process/Thread Pinning Overview

https://www.nas.nasa.gov/hecc/support/kb/ProcessThread-Pinning-Overview_259.html

Pinning, the binding of a process or thread to a specific core, can improve the performance of your code by increasing the percentage of local memory accesses.

Once your code runs and produces correct results on a system, the next step is performance improvement. For a code that uses multiple cores, the placement of processes and/or threads can play a significant role in performance.

Given a set of processor cores in a PBS job, the Linux kernel usually does a reasonably good job of mapping processes/threads to physical cores, although the kernel may also migrate processes/threads. Some OpenMP runtime libraries and MPI libraries may also perform certain placements by default. In cases where the placements by the kernel or the MPI or OpenMP libraries are not optimal, you can try several methods to control the placement in order to improve performance of your code. Using the same placement from run to run also has the added benefit of reducing runtime variability.

Pay attention to maximizing data locality while minimizing latency and resource contention, and have a clear understanding of the characteristics of your own code and the machine that the code is running on.

Characteristics of NAS Systems

NAS provides two distinctly different types of systems: Pleiades, Aitken, Electra, and Merope are cluster systems, and Endeavour is a global shared-memory system. Each type is described in this section.

Pleiades, Aitken, Electra, and Merope

On Pleiades, Aitken, Electra, and Merope, memory on each node is accessible and shared only by the processes and threads running on that node. Pleiades is a cluster system consisting of different processor types: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Merope is a cluster system that currently consists of Westmere nodes that have been repurposed from Pleiades. Electra is a cluster system that consists of Broadwell and Skylake nodes, and Aitken is a cluster system that consists of Cascade Lake nodes.

Each node contains two sockets, with a symmetric memory system inside each socket. These nodes are considered non-uniform memory access (NUMA) systems, and memory is accessed across the two sockets through the Quick Path Interconnect. So, for optimal performance, data locality should not be overlooked on these processor types.

However, compared to a global shared-memory NUMA system such as Endeavour, data locality is less of a concern on the cluster systems. Rather, minimizing latency and resource contention will be the main focus when pinning processes/threads on these systems.

For more information on Pleiades, Aitken, Electra, and Merope, see the following articles:

Endeavour

Endeavour comprises two hosts (endeavour1 and endeavour2). Each host is a NUMA system that contains several dozen Sandy Bridge nodes, with memory located physically at various distances from the processors that access the data on memory. A process/thread can access the local memory on its node and the remote memory across nodes through the NUMAlink, with varying latencies. So, data locality is critical for achieving good performance on Endeavour.

Note: When developing an application, we recommend that you initialize data in parallel so that each processor core initializes the data it is likely to access later for calculation.

For more information, see Endeavour Configuration Details.

Methods for Process/Thread Pinning

Several pinning approaches for OpenMP, MPI and MPI+OpenMP hybrid applications are listed below. We recommend using the Intel compiler (and its runtime library) and the SGI MPT software on NAS systems, so most of the approaches pertain specifically to them. You can also use the mbind tool for multiple OpenMP libraries and MPI environments.

OpenMP codes

MPI codes

MPI+OpenMP hybrid codes

Checking Process/Thread Placement

Each of the approaches listed above provides some verbose capability to print out the tool‘s placement results. In addition, you can check the placement using the following approaches.

Use the ps Command

ps -C executable_name -L -opsr,comm,time,pid,ppid,lwp 

In the generated output, use the core ID under the PSR column, the process ID under the PID column, and the thread ID under the LWP column to find where the processes and/or threads are placed on the cores.

Note: The ps command provides a snapshot of the placement at that specific time. You may need to monitor the placement from time to time to make sure that the processes/threads do not migrate.

Instrument your code to get placement information

  • Call the mpi_get_processor_name function to get the name of the processor an MPI process is running on
  • Call the Linux C function sched_getcpu() to get the processor number that the process or thread is running on

For more information, see Instrumenting your Fortran Code to Check Process/Thread Placement.

原文地址:https://www.cnblogs.com/dhcn/p/12272871.html

时间: 2024-10-14 15:01:07

Process/Thread Pinning Overview的相关文章

Linux中的task,process, thread 简介

本文的主要目的是介绍在Linux内核中,task,process, thread这3个名字之间的区别和联系.并且和WINDOWS中的相应观念进行比较.如果你已经很清楚了,那么就不用往下看了. LINUX版本:2.6.18ARCH: X86 首先要明确的是,按照LKD 2里面的说法,LINUX和其他OS 比如WINDOWS, SOLARIS之间一个很大的不同是没有严格定义的线程(thread).那么你也许会问,如果LINUX中没有线程,那么如何来表示类似WINDOWS 线程的那种执行观念呢?答案是

Main Thread Pinning

https://software.intel.com/en-us/mpi-developer-reference-windows-main-thread-pinning Main Thread Pinning Use this feature to pin a particular MPI thread to a corresponding CPU within a node and avoid undesired thread migration. This feature is availa

<操作系统>进程与线程的比较,process, thread

1.进程是资源分配单位, 线程是CPU调度单位 2.进程拥有一个完整的资源平台, 而线程只独享指令流执行的必要资源,如registers和stack 3.线程具有 就绪, 阻塞, 运行 三种基本状态和状态间的转换关系 4. 线程能减少并发执行的时间空间开销 多线程的引入: 在进程内部增加一类实体满足以下特性: (1)实体间可以并发执行 (2)实体之间共享相同的地址空间 这就是线程. 原文地址:https://www.cnblogs.com/dynmi/p/12604033.html

Difference between Process and thread?

What are the differences between a process and a thread? How are they similar? How can 2 threads communicate? How can 2 process communicate? Both processes and threads are independent sequences of execution. The main difference is that threads (of th

Improve Scalability With New Thread Pool APIs

Pooled Threads Improve Scalability With New Thread Pool APIs Robert Saccone Portions of this article are based on a prerelease version of Windows Server 2008. Details contained herein are subject to change. Code download available at: VistaThreadPool

Thread学习

1.定义 2.作用 3.和进程的比较 4.多线程(multithreading)的优点和缺陷 5.调度(scheduling) 6.线程相关概念 定义 线程就是最小的可编程指令序列,是进程的子集.一个进程由一个或多个线程组成. 作用 线程的作用就是要加速程序的执行,提高计算机的性能,满足多任务需求.比如word,你可以编辑,同时语法检查也正在进行,至少就是两个线程了.再比如播放电影,视频线程和音频线程同时执行,同时有两种可能,一种是在多核cpu下,每个进程在不同的cpu,是真正的同时执行:另一种

Migrating Oracle on UNIX to SQL Server on Windows

Appendices Published: April 27, 2005 On This Page Appendix A: SQL Server for Oracle Professionals Appendix B: Getting the Best Out of SQL Server 2000 and Windows Appendix C: Baselining Appendix D: Installing Common Drivers and Applications Installing

OS Watcher

今天客户需要使用Os Watcher,就简单的学习了一下.这里贴出官方手册,方便没有mos账号的同学 OSWatcher now provides an analysis tool oswbba which analyzes the log files produced by OSWatcher. This tool allows OSWatcher to be self-analyzing. This tool also provides a graphing capability to gr

Amoeba变形虫

我们通过路由选择来决定操作时访问那个数据库,而路由的选择方式不外乎以下几种: 1) SpringAOP方式:spring底层配置多个数据源,配置路由(面向切面编程)手工写很多代码(废除) 2) MySql proxy编程lua(脚本语言)(几乎没人直接使用) 3) Amoeba国人基于mysql proxy.不用编程,配置两个xml文件即可,对程序没有侵入性. 4) MyCat阿里,基于mysqlProxy 这里主要应用Amoeba. Amoeba读写分离,通过配置xml文件来声明有几个数据库服