当前位置：网站首页>Parallel and distributed framework

Parallel and distributed framework

2022-07-18 20:00:00 【Grass in the crack of the wall】

List of articles

Parallel and distributed framework
- Common techniques of parallel computing

Parallel and distributed framework

Common techniques of parallel computing

Overview of shared memory system and distributed memory system

Insert picture description here

Unified memory access (UMA) System or symmetric multiprocessor (smp)
stay numa Inner multicore cpu according to cpu If the dominant frequency of is different, its affinity will also be different , The higher the dominant frequency, the higher the affinity , The lower the power consumption is .

OPENMP technology

(Open Multi-Processing, Open multiprocessing )
It is mainly used for parallel programming on shared memory computers , Support multi platform shared memory multi processor programming C/C++ and Fortran Language norms and API.
OPENMP Provide high-level abstraction for parallel description , It reduces the difficulty and complexity of the parallel Century City .

OpenMP Programming model

Adopt traditional fork-join Pattern , Memory is shared between control flows executed in parallel .
OpenMP Program single thread execution starts , stay parallel Pseudo instruction call , The main thread produces multiple sub threads
Insert picture description here
shortcoming ： It is not suitable for complex inter thread communication .
It is not very good to use on non shared memory systems .

OPENMP API constitute

Compile pseudo instructions

Pragma Pseudo instruction ： Guide the compiler to process the corresponding code segment through identification , The compiler automatically parallelizes the program according to instructions , And add means such as synchronization and mutual exclusion .
eg:#pragma omp parallel

Runtime functions

eg:int omp_get_num_threads();

environment variable

eg: export OMP_THREAD_LIMIT=8

OpenMP Commonly used instructions

prallel, Used before a code snippet , Multiple threads will execute the code in parallel
for, be used for for Before the cycle , Allocate loops to multiple threads and execute them in parallel , It must be ensured that there is no correlation between each cycle .
prallfor,parallel and for Combination of statements , Also used in a for Before the cycle , Express for The code of the loop will be executed by multiple threads in parallel .
sections, Used before code segments that may be executed in parallel
pallel sctions, parallel and sections two A combination of sentences
citical,, Used before the critical area of a piece of code
singe, Used before a piece of code that is executed by a single thread , Indicates that the following code segment will be executed by a single thread .
brrier, Used for thread synchronization of code in parallel area , All threads execute to barrier Stop when , Until all threads execute to barrier Only then continues to carry on .
atomic, Used to specify that a memory area is updated
master, Used to specify that a block of code is executed by the main thread
ordered, The loops used to specify parallel regions are executed sequentially
thedpivate, Used to specify that a variable is thread private .

OpenMP Common functions

omp_ get_ num_ procs, Returns the number of processors of the multiprocessor running this thread .
omp get_ num_ threads, Returns the number of active threads in the current parallel region .
omp get. thread_ num, Returns the thread number
omp_ set. nugn threads, Set the number of threads when executing code in parallel
omp_ init Jock, Initialization one - A simple lock
omp_ set_ lock , Lock operation
omp_ unset_ lock, Unlock operation , Want to be with omp_ set_ lock Function pairing .
omp_ destroy_ lock ,omp_ init_ _lock Pairing operation function of function , Close a lock

OpenMP environment variable

●OMP NUM_ THREADS: Specify the default number of threads to use when executing parallel regions .
●OMP_ PROC_ BIND: Determines whether threads are allowed to migrate to other processors . Set to True when , The runtime does not migrate threads between processors .
●OMP THREAD_ LIMIT: Appoint The system can create OpenMP The maximum number of threads .
●OMP NESTED: Determines whether nested threads are supported ,True To support .

MPI technology

●MPI It's a cross language communication protocol . Support point-to-point and broadcast .
●MPI yes - - A messaging application interface , Including protocol and semantic description , They indicate how they can use their features in various implementations .
●MPI Our goal is high performance , Large scale , And portability .
● And OpenMP Parallel programs are different ,MPI It is a parallel programming technology based on information transmission .MPI The standard defines a set of portable programming interfaces .

MPI The messaging

Insert picture description here
MPI There are three phases in the messaging discipline :
1. The message assembly takes the sending data out of the sending buffer and adds the message envelope to form a complete message
2. Message passing passes the assembled message from the sender to the receiver
3. Message disassembly takes data from the received message and sends it to the receiving buffer

communicator

● Communicator defines ← Groups of processes that can send messages to each other . In this process , Each process will be assigned a large sequence number , It's called rank (rank)- Six processes communicate explicitly by specifying rank .
● Communication is based on sending and receiving operations between different processes .J– A process can pass . Specify another - The rank of processes and a unique message tag (tag) To send a message to another process . The recipient of a specific request can send a tag ( Or you can ignore the label completely , Receive any message ), Then process the received data in turn . A communication like this involving a sender and a receiver is called , Point to point (point-to-point) signal communication .
Insert picture description here

Nvidia NCCL technology

●NCCL yes Nvidia Collective multi-GPU Communication Library For short , It is an implementation of multiple GPU Of collective communication signal communication (all-gather, reduce, broadcast) library ,Nvidia do A lot of optimization , In the PCle、Nvlink、 InfiniBand. High communication speed is realized on .
Insert picture description here