当前位置:网站首页>[literature reading] tenet: a framework for modeling tensor dataflow based on relational centric notation
[literature reading] tenet: a framework for modeling tensor dataflow based on relational centric notation
2022-07-19 01:52:00 【feimla】
subject :TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation
Time :2021
meeting :ISCA
Research Institute : Peking University
TENET
summary
There is an iteration domain D S D_S DS, Access function A S , F A_{S,F} AS,F, as well as PE Interconnection. We define an affine transformation , use m × n m\times n m×n The matrix represents , This matrix is what we will optimize dataflow, namely S and spacetime-stamp Of relation. thus , You can get a series of relationship sets :data assignment relation( The relationship between timestamp and tensor , come from dataflow)、spacetime-stamp map relation( It will be used as a constraint in performance calculation , from PE interconnection obtain ). thus , Calculate the performance based on these sets . There are constraints in the calculation of relation sets , That is, the constraints of the optimization problem . Using the method of integer programming , Calculate the optimal solution .
I. Introduction
A typical spatial architecture consists of processing elements (PE) Array and register . The main feature of spatial architecture is a variety of hardware data flow choices .
hardware dataflow It describes :1) Loop instances to PE array Distribution and 2)PE Execution sequence of loop instances in .
Summary of some existing work : Hardware data flow is crucial to achieve high throughput and low latency , Because it decides PE utilization 、 Data access mode and on-chip bandwidth requirements . Different tensor calculations prefer different hardware data streams . for example , Google's tensor processing unit (TPU) [22] Connect with pulsating data flow PE, Each of them PE Responsible for a multiplication and addition operation . and Camircon [31] Connect through multicast communication network PE, Each of them PE Execute a dot product . Other spatial structures , Such as DySER [19] and Plasticine [42], Integrate in a flexible way PE And its interconnection , Therefore, it can support a wider range of applications .
Now? ,a formal notation is still strongly desired to represent hardware dataflow. A mark (notation) It should be able to systematically cover the entire data flow design space , And promote simple and accurate performance modeling . There are two kinds :compute-centric The measurement of cannot measure data transmission 、 reusing ;data-centric Measured MAESTRO The model is defective .
This article defines TENET, The core part is relation-centric notation. We define the following relationship :1) Loop instances and execute calculations PE The relationship between ,2) Loop instance and its in PE Order of execution in ,3)PE And the corresponding assigned tensor elements , as well as 4)PE Connect to the interconnection network . Overall speaking ,TENET Be able to estimate various hardware indicators , Including data reuse 、 Delay 、PE Communication bandwidth and on-chip memory bandwidth .
II. Background
A. Spatial Architectures
The PE also contains register files for data storage.
Usually , The spatial architecture has a three-tier memory hierarchy , namely PE Register level 、 On chip registers and off chip memories . For the sake of simplicity , We make the following two assumptions when modeling the hardware behavior . First ,ALU Be able to perform a multiplication and addition (MAC) operation . secondly , By adjacency PE Data transmission between interconnections takes a cycle .
When designing data flow , Data reuse It is the key factor to achieve high performance and low energy consumption , It can be further divided into time reuse and space reuse . Time reuse occurs when the same data is reused in different cycles , Spatial reuse occurs when the same data is different PE When reusing .
B. Notation Basics
In this paper, TENET supports tensor applications with perfectly-nest loops and single unconditional statement.
What is a perfectly nested loop?
A perfectly nested loop is a nested loop, where all the assignment instructions are inside the innermost loop.

Iteration domain: Iteration domain . for example , D S = { S [ i , j ] : 0 ≤ i < 4 , 0 ≤ j < 3 } D_S=\{ S[i, j]: 0 \leq i<4,0 \leq j<3 \} DS={ S[i,j]:0≤i<4,0≤j<3}.
Access function:Given a loop instance, the access function returns the tensor elements accessed by the statement S. We use a relation to represent the access function of tensor F. A S , F = { S [ n ⃗ ] → F [ f ⃗ ] } A_{S,F} = \{ S[\vec{n}] \rightarrow F[\vec{f}] \} AS,F={ S[n]→F[f]}
for example , In the figure 1 Of 1D-CONV In operator , tensor Y The access function of is { S [ i , j ] → Y [ i ] : 0 ≤ i < 4 , 0 ≤ j < 3 } \{ S[i,j] \rightarrow Y[i]: 0 \leq i<4,0 \leq j<3 \} { S[i,j]→Y[i]:0≤i<4,0≤j<3}, Represents a loop instance S[i, j] Access tensor elements Y[i].
C. Limitations of Existing Dataflow Notations
III. TENET Overview
IV. Relationship centered notation
TENET Four relationships are defined , Including loop instances to PE Array mapping (Section IV-A)、 Data distribution (Section IV-B)、PE interconnection (Section IV-C)、 Mapping between different spatiotemporal markers (Section IV- D). Use relationship centric symbols , We can accurately determine the execution position and time of each cycle instance in the spatial architecture , Access the location and time of tensor elements , And how tensor elements span PE Move .
A. Dataflow Relation
Definition 1:Dataflow:Given a statement S with iteration domain DS and iteration vector n ⃗ \vec{n} n, the dataflow is defined as
Θ S , D = { S [ n ⃗ ] → ( P E [ p ⃗ ] ∣ T [ t ⃗ ] ) } , S [ n ⃗ ] ∈ D S \Theta_{S,D} = \{ S[\vec{n}] \rightarrow (PE[\vec{p}] \, | \, T[\vec{t}]) \}, \quad S[\vec {n}]∈D_S ΘS,D={ S[n]→(PE[p]∣T[t])},S[n]∈DS
It will loop the instance S[n] Assign to a timestamp , It is a pair of space stamps PE[p] And time stamp T[t].Θ The footmark of D It should refer to the collection of space-time stamps , instead of S Iteration domain of DS. This is later IV-D Part can also reflect .
Upper figure , Θ S , D = { S [ i , j , k ] → ( P E [ i , j ] ∣ T [ i + j + k ] ) } \Theta_{S,D} = \{ S[i,j,k] \rightarrow (PE[i,j] \, | \, T[i+j+k]) \} ΘS,D={ S[i,j,k]→(PE[i,j]∣T[i+j+k])}. generally speaking , Θ \Theta Θ Is equivalent to loop iterator S By affine transformation to timestamp .
Dataflow Design space :n Is the number of layers of the cycle , n × n n\times n n×n Is the size of the affine transformation matrix , Each element of the matrix is 0 or 1, So it is 2 The index of , The design space size is O ( 2 n 2 ) O(2^{n^2}) O(2n2). I think we should use m × n m \times n m×n Matrix times n × 1 n \times 1 n×1 Vector (S Index vector of ), The result is m × 1 m\times 1 m×1 Vector ,m Is the total dimension of time stamp . For example, above m=3,PE There are two dimensions ,T There is one dimension .
B. Data Assignment Relation
Definition 2:Data assignment: Given Dataflow Θ S , D \Theta_{S,D} ΘS,D, Definition data assignment by :
A The footmark of D It's a collection of time stamps . In an effort to 3 For example , tensor Y Of data assignment yes :
For this example , We observe each of PE Always calculate the same output tensor at different timestamps (Y), This means tensor Y[i, j] Keep still , And reuse iteratively with different timestamps , Until the calculation is complete .
C. Interconnection Relation
Definition 3:PE Interconnection: Given a PE array,interconnection It's a collection , Each of these elements is described from a PE To another PE Mapping .
c1, … , ck Is used to describe PE Topological conditions .PE Only through other PE Or register (scratchpad) get data . this paper , We modeled 3 Common PE Interconnection topology :

D. Spacetime-stamp Map Relation
Definition 4:Spacetime-map: Given a data stream ,spacetime-map Is a relational set , It will stamp a set of time and space D Map to another set of spatiotemporal markers D’. D and D’ From a given data stream .
We can assume that the time distance of D and D’ is within 1, and the PE specified by D and D’ are interconnected. So as to write some map, In an effort to 3 For example :
combination data assignment, Yes :
The illustration above Y[0,0] The reuse situation of .
The illustration above A[0,0]、B[0,0] The reuse situation of .
Practical application , We can :1) Limit D and D’ Contains the same PE, Time reuse of data can be captured (temporal reuse);2) Limit D and D’ Include connected PE, Can capture spatial reuse of data (spatial reuse).
V. Performance Model
A. Data reuse and Volume Model

surface 2 Medium sum Indicates the number of elements in the calculation set . Are for a particular tensor F In terms of the .
TotalVolume: Every time the data used is calculated . for example ,the TotalVolume of tensor A in Figure 3 can be calculated as
ReuseVolume: Reuse data .
UniqueVolume: New data at each point in time , Get from the register .
ReuseFactor Describes the average number of times to reuse data obtained from the register memory .
SpatialReuseVolume: The SpatialReuseVolume sums up the volume of data with spatial reuse at different space-stamps, where D and D’ contain interconnected PEs only.
TemporalReuseVolume: only refers to the temporal reuse at the same PE, where D and D’ contain the same PE.
therefore , Yes ReuseVolume = SpatialReuseVolume + TemporalReuseVolume.
B. Latency and Bandwidth Model
The delay of data flow includes each PE Communication delay and calculation delay in . We assume that buffers、networkds And arithmetic units work in a pipeline , And use double buffering Wait for technology to hide latency . that , The overall delay is only the maximum value of communication delay and calculation delay .
PE Array and scratchpad Communication delay :
Calculation delay :
among D S D_S DS Is the iteration domain ,sum Is the total number of iteration fields , U t i l P E Util_{PE} UtilPE yes PE The average utilization rate of .
IBW(interconnection bandwidth):
SBW(scratchpad bandwidth):
C. Model implementation
Used C++ Of ISL Kuhe Barvinok library .
VI. Evaluation
benchmark: GEMM, 2D-CONV, Matrix multiplication chain (MMc), Matricized tensor times Khatri-Rao product (MTTKRP), Jacobi-2D






边栏推荐
- MapReduce environment preparation
- 玩转集群配置中心,一文带你了解Taier控制台
- binary search
- touchID 和 FaceID~2
- 5章 性能平台GodEye源码分析-第三方模块
- The following packages have unmet dependencies: deepin.com.wechat:i386 : Depends: deepin-wine:i386
- Leveraging Semi-Supervised Learning for Fairness using Neural Networks
- Xcode11添加引导页(升级后Launch Images Source选项不见了)
- 通信感知一体化应用场景、关键技术和网络架构
- 深度伪造对国家安全的挑战及应对
猜你喜欢

4章 性能平台GodEye源码分析-监控模块
【Go语言】代码覆盖测试(gcov)

如何建设实时开发平台,深入释放企业实时数据价值?

基于开源流批一体数据同步引擎ChunJun数据还原—DDL解析模块的实战分享

What is "digital collection"?

Show Me the Code之MXNet网络模型(三)

【文献阅读】VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

Xcode11新建项目后的一些问题

【文献阅读】Multi-state MRAM cells for hardware neuromorphic computing

binary search
随机推荐
Solve the problem that Scala cannot initialize the class of native
ChunJun支持异构数据源DDL转换与自动执行 丨DTMO 02期回顾(内含课程回放+课件)
蛟分承影,雁落忘归——袋鼠云一站式全自动化运维管家ChengYing(承影)正式开源
Fairness in Deep Learning: A Computational Perspective
CMTime简单介绍
js实用小技巧
One vs One Mitigation of Intersectional Bias
【pycharm】Cannot find reference ‘XXX‘ in ‘__init__.py‘ 解决办法
如何建设实时开发平台,深入释放企业实时数据价值?
安全多方计算体系架构及应用思考
Frustratingly Simple Few-Shot Object Detection
防抖debounce和节流throttle的使用
Activity的启动模式
【文献阅读】TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation
开源项目丨 Taier 1.1 版本正式发布,新增功能一览为快
Punch in 10 interview questions every day - JVM article
集成学习
IPFs file persistence operation
The platform of digital collection NFT is good
nodejs-uuid