土建柱子钢筋图解纸柱子定位图上2/5-grid-3f 2-col-3f 117+115+95 2-col-4f-dim是什么意思

cad中将图形旋转快捷键是什么_百度知道
cad中将图形旋转快捷键是什么
我有更好的答案
&RO&是“旋转”的快捷键CAD中常用的快捷键是:绘图,标注,修改里的工具命令(里面命令太多,如不清楚可从英文板CAD里有)画图前要准备好所需的工具,实物,彩图,画图前请先不要急着画图:1。要先思考所画图的内容,怎样才表达的更清楚 2。要先思考所要画的内容的过程,常握技巧,僻免出现重画或修改
辽工大在校生
本回答被提问者采纳
常见的快捷命令 (一)字母类 1、对象特性 ADC, *ADCENTER(设计中心“Ctrl+2”) CH, MO *PROPERTIES(修改特性“Ctrl+1”) MA, *MATCHPROP(属性匹配) ST, *STYLE(文字样式) COL, *COLOR(设置颜色) LA, *LAYER(图层操作) LT, *LINETYPE(线形) LTS, *LTSCALE(线形比例) LW, *LWEIGHT (线宽) UN, *UNITS(图形单位) ATT, *ATTDEF(属性定义) ATE, *ATTEDIT(编辑属性) BO, *BOUNDARY(边界创建,包括创建闭合多段线和面域) AL, *ALIGN(对齐) EXIT, *QUIT(退出) EXP, *EXPORT(输出其它格式文件) IMP, *IMPORT(输入文件) OP,PR *OPTIONS(自定义CAD设置) PRINT, *PLOT(打印) PU, *PURGE(清除垃圾) R, *REDRAW(重新生成) REN, *RENAME(重命名) SN, *SNAP(捕捉栅格) DS, *DSETTINGS(设置极轴追踪) OS, *OSNAP(设置捕捉模式) PRE, *PREVIEW(打印预览) TO, *TOOLBAR(工具栏) V, *VIEW(命名视图) AA, *AREA(面积) DI, *DIST(距离) LI, *LIST(显示图形数据信息) 2、绘图命令: PO, *POINT(点) L, *LINE(直线) XL, *XLINE(射线) PL, *PLINE(多段线) ML, *MLINE(多线) SPL, *SPLINE(样条曲线) POL, *POLYGON(正多边形) REC, *RECTANGLE(矩形) C, *CIRCLE(圆) A, *ARC(圆弧) DO, *DONUT(圆环) EL, *ELLIPSE(椭圆) REG, *REGION(面域) MT, *MTEXT(多行文本) T, *MTEXT(多行文本) B, *BLOCK(块定义) I, *INSERT(插入块) W, *WBLOCK(定义块文件) DIV, *DIVIDE(等分) H, *BHATCH(填充) 3、修改命令: CO, *COPY(复制) MI, *MIRROR(镜像) AR, *ARRAY(阵列) O, *OFFSET(偏移) RO, *ROTATE(旋转) M, *MOVE(移动) E, DEL键 *ERASE(删除) X, *EXPLODE(分解) TR, *TRIM(修剪) EX, *EXTEND(延伸) S, *STRETCH(拉伸) LEN, *LENGTHEN(直线拉长) SC, *SCALE(比例缩放) BR, *BREAK(打断) CHA, *CHAMFER(倒角) F, *FILLET(倒圆角) PE, *PEDIT(多段线编辑) ED, *DDEDIT(修改文本) 4、视窗缩放: P, *PAN(平移) Z+空格+空格, *实时缩放 Z, *局部放大 Z+P, *返回上一视图 Z+E, *显示全图 5、尺寸标注: DLI, *DIMLINEAR(直线标注) DAL, *DIMALIGNED(对齐标注) DRA, *DIMRADIUS(半径标注) DDI, *DIMDIAMETER(直径标注) DAN, *DIMANGULAR(角度标注) DCE, *DIMCENTER(中心标注) DOR, *DIMORDINATE(点标注) TOL, *TOLERANCE(标注形位公差) LE, *QLEADER(快速引出标注) DBA, *DIMBASELINE(基线标注) DCO, *DIMCONTINUE(连续标注) D, *DIMSTYLE(标注样式) DED, *DIMEDIT(编辑标注) DOV, *DIMOVERRIDE(替换标注系统变量) (二)常用CTRL快捷键 【CTRL】+1 *PROPERTIES(修改特性) 【CTRL】+2 *ADCENTER(设计中心) 【CTRL】+O *OPEN(打开文件) 【CTRL】+N、M *NEW(新建文件) 【CTRL】+P *PRINT(打印文件) 【CTRL】+S *S***E(保存文件) 【CTRL】+Z *UNDO(放弃) 【CTRL】+X *CUTCLIP(剪切) 【CTRL】+C *COPYCLIP(复制) 【CTRL】+V *PASTECLIP(粘贴) 【CTRL】+B *SNAP(栅格捕捉) 【CTRL】+F *OSNAP(对象捕捉) 【CTRL】+G *GRID(栅格) 【CTRL】+L *ORTHO(正交) 【CTRL】+W *(对象追踪) 【CTRL】+U *(极轴) (三)常用功能键 【F1】 *HELP(帮助) 【F2】 *(文本窗口) 【F3】 *OSNAP(对象捕捉) 【F7】 *GRIP(栅格) 【F8】 *ORTHO(正交) ADCENTER:管理内容 ADC ALIGN:在二维和三维空间中将某对象与其他对象对齐 AL APPLOAD:加载或卸载应用程序并指定启动时要加载的应用程序 AP ARC:创建圆弧 A AREA:计算对象或指定区域的面积和周长 AA ARRAY:创建按指定方式排列的多重对象副本 AR ATTDEF:创建属性定义 ATT ATTEDIT:改变属性信息 ATE ATTEXT:提取属性数据 DDATTEXT BHATCH:使用图案填充封闭区域或选定对象 H、BH BLOCK:根据选定对象创建块定义 B BOUNDARY:从封闭区域创建面域或多段线 BO BREAK:部分删除对象或把对象分解为两部分 BR CHAMFER:给对象的边加倒角 CHA CHANGE:修改现有对象的特性 -CH CIRCLE:创建圆形 C COLOR:定义新对象的颜色 COL COPY:复制对象 CO、CP DBCONNECT:为外部数据库表提供 AutoCAD 接口 AAD、AEX、ALI、ASQ、ARO、ASE、DBC DDEDIT:编辑文字和属性定义 ED DDVPOINT:设置三维观察方向 VP DIMALIGNED:创建对齐线性标注 DAL DIMANGULAR:创建角度标注 DAN DIMBASELINE:从上一个或选定标注的基线处创建线性、角度或坐标标注 DBA DIMCENTER:创建圆和圆弧的圆心标记或中心线 DCE DIMCONTINUE:从上一个或选定标注的第二尺寸界线处创建线性、角度或坐标标注 DCO DIMDIAMETER:创建圆和圆弧的直径标注 DDI DIMEDIT:编辑标注 DED DIMLINEAR:创建线性尺寸标注 DLI DIMORDINATE:创建坐标点标注 DOR DIMOVERRIDE:替代标注系统变量 DOV DIMRADIUS:创建圆和圆弧的半径标注 DRA DIMSTYLE:创建或修改标注样式 D DIMTEDIT:移动和旋转标注文字 DIMTED DIST:测量两点之间的距离和角度 DI DIVIDE:将点对象或块沿对象的长度或周长等间隔排列 DIV DONUT:绘制填充的圆和环 DO DRAWORDER:修改图像和其他对象的显示顺序 DR DSETTINGS:指定捕捉模式、栅格、极坐标和对象捕捉追踪的设置 DS、RM、SE DSVIEWER:打开“鸟瞰视图”窗口 AV DVIEW:定义平行投影或透视视图 DV ELLIPSE:创建椭圆或椭圆弧 EL ERASE:从图形中删除对象 E EXPLODE:将组合对象分解为对象组件 X EXPORT:以其他文件格式保存对象 EXP EXTEND:延伸对象到另一对象 EX EXTRUDE:通过拉伸现有二维对象来创建三维原型 EXT FILLET:给对象的边加圆角 F FILTER:创建可重复使用的过滤器以便根据特性选择对象 FI GROUP:创建对象的命名选择集 G HATCH:用图案填充一块指定边界的区域 -H HATCHEDIT:修改现有的图案填充对象 HE HIDE:重生成三维模型时不显示隐藏线 HI IMAGE:管理图像 IM IMAGEADJUST:控制选定图像的亮度、对比度和褪色度 IAD IMAGEATTACH:向当前图形中附着新的图像对象 IAT IMAGECLIP:为图像对象创建新剪裁边界 ICL IMPORT:向 AutoCAD 输入文件 IMP INSERT:将命名块或图形插入到当前图形中 I INTERFERE:用两个或多个三维实体的公用部分创建三维复合实体 INF INTERSECT:用两个或多个实体或面域的交集创建复合实体或面域并删除交集以外的部分 IN INSERTOBJ:插入链接或嵌入对象 IO LAYER:管理图层和图层特性 LA -LAYOUT:创建新布局,重命名、复制、保存或删除现有布局 LO LEADER:创建一条引线将注释与一个几何特征相连 LEAD LENGTHEN:拉长对象 LEN LINE:创建直线段 L LINETYPE:创建、加载和设置线型 LT LIST:显示选定对象的数据库信息 LI、LS LTSCALE:设置线型比例因子 LTS LWEIGHT: LW MATCHPROP:设置当前线宽、线宽显示选项和线宽单位 MA MEASURE:将点对象或块按指定的间距放置 ME MIRROR:创建对象的镜像副本 MI MLINE:创建多重平行线 ML MOVE:在指定方向上按指定距离移动对象 M MSPACE:从图纸空间切到模型空间视口 MS MTEXT:创建多行文字 T、MT MVIEW:创建浮动视口和打开现有的浮动视口 MV OFFSET:创建同心圆、平行线和平行曲线 O OPTIONS:自定义 AutoCAD 设置 GR、OP、PR OSNAP:设置对象捕捉模式 OS PAN:移动当前视口中显示的图形 P PASTESPEC:插入剪贴板数据并控制数据格式 PA PEDIT:编辑多段线和三维多边形网格 PE PLINE:创建二维多段线 PL PRINT :将图形打印到打印设备或文件 PLOT POINT:创建点对象 PO POLYGON:创建闭合的等边多段线 POL PREVIEW:显示打印图形的效果 PRE PROPERTIES:控制现有对象的特性 CH、MO PROPERTIESCLOSE:关闭“特性”窗口 PRCLOSE PSPACE:从模型空间视口切换到图纸空间 PS PURGE:删除图形数据库中没有使用的命名对象,例如块或图层 PU QLEADER:快速创建引线和引线注释 LE QUIT:退出 AutoCAD EXIT RECTANG:绘制矩形多段线 REC REDRAW:刷新显示当前视口 R REDRAWALL:刷新显示所有视口 RA REGEN:重生成图形并刷新显示当前视口 RE REGENALL:重新生成图形并刷新所有视口 REA REGION:从现有对象的选择集中创建面域对象 REG RENAME:修改对象名 REN RENDER:创建三维线框或实体模型的具有真实感的渲染图像 RR REVOLVE:绕轴旋转二维对象以创建实体 REV RPREF:设置渲染系统配置 RPR ROTATE:绕基点移动对象 RO SCALE:在 X、Y 和 Z 方向等比例放大或缩小对象 SC SCRIPT:用脚本文件执行一系列命令 SCR SECTION:用剖切平面和实体截交创建面域 SEC SETVAR:列出系统变量并修改变量值 SET SLICE:用平面剖切一组实体 SL SNAP:规定光标按指定的间距移动 SN SOLID:创建二维填充多边形 SO SPELL:检查图形中文字的拼写 SP SPLINE:创建二次或三次 (NURBS) 样条曲线 SPL SPLINEDIT:编辑样条曲线对象 SPE STRETCH:移动或拉伸对象 S STYLE:创建或修改已命名的文字样式以及设置图形中文字的当前样式 ST SUBTRACT:用差集创建组合面域或实体 SU TABLET:校准、配置、打开和关闭已安装的数字化仪 TA THICKNESS:设置当前三维实体的厚度 TH TILEMODE:使“模型”选项卡或最后一个布局选项卡当前化 TI、TM TOLERANCE:创建形位公差标注 TOL TOOLBAR:显示、隐藏和自定义工具栏 TO TORUS:创建圆环形实体 TOR TRIM:用其他对象定义的剪切边修剪对象 TR UNION:通过并运算创建组合面域或实体 UNI UNITS:设置坐标和角度的显示格式和精度 UN VIEW:保存和恢复已命名的视图 V VPOINT:设置图形的三维直观图的查看方向 -VP WBLOCK:将块对象写入新图形文件 W WEDGE:创建三维实体使其倾斜面尖端沿 X 轴正向 WE XATTACH:将外部参照附着到当前图形中 XA XBIND:将外部参照依赖符号绑定到图形中 XB XCLIP:定义外部参照或块剪裁边界,并且设置前剪裁面和后剪裁面 XC XLINE:创建无限长的直线(即参照线) XL XREF:控制图形中的外部参照 XR ZOOM:放大或缩小当前视口对象的外观尺寸 3DARRAY:创建三维阵列 3A 3DFACE:创建三维面 3F 3DORBIT:控制在三维空间中交互式查看对象 3DO
RO 图形旋转
其他5条回答
为您推荐:
其他类似问题
您可能关注的内容
cad的相关知识
换一换
回答问题,赢新手礼包
个人、企业类
违法有害信息,请在下方选择后提交
色情、暴力
我们会通过消息、邮箱等方式尽快将举报结果通知您。 上传我的文档
 下载
 收藏
该文档贡献者很忙,什么也没留下。
 下载此文档
建筑图纸上的符号字母所代表的意思
下载积分:837
内容提示:建筑图纸上的符号字母所代表的意思
文档格式:DOC|
浏览次数:356|
上传日期: 22:46:24|
文档星级:
全文阅读已结束,如果下载本文需要使用
 837 积分
下载此文档
该用户还上传了这些文档
建筑图纸上的符号字母所代表的意思
关注微信公众号木目金戒指 2 & 八爪鱼戒指_行家视角_珠宝腕表频道_VOGUE时尚网
您现在的位置:
&&&&&&&&&&&&&&&&&&
木目金戒指 2 & 八爪鱼戒指
作者:Penny
内容来源:&&
我出生于日本,现在在意大利弗洛伦萨工作。日本对美有着独到的审慎眼光,认为美的表现形式应当力求自然,使之表现出天然的生命力,而非技艺产物,我的珠宝作品也是从中汲取灵感。
木目金戒指 2
材料为925纯银和铜。戒指采用丝捻技术结合&Mokumegane&(木目金,一种混杂金属层压制品与特别层状样式)。
八爪鱼戒指
材料为925纯银、18克拉黄金和纯金。采用锻锤技术。
$(document).ready(function(){
var url = "//comment.vogue.com.cn/comment.php?cid=48";
$("#iframecommentbox").html('');
document.domain = "vogue.com.cn";
关注官方微信
开启互动之旅
相关网站:
更多VOGUE国际网站:
京ICP 证070504号 京网文[27号
将文章:木目金戒指 2 & 八爪鱼戒指 喜欢到中。
喜欢理由:
0){ addcollect(51040,document.getElementById('myfavinfo').value); }else{ $.alert('喜欢理由不能为空'); }" />
经验: +2 , 金币 +2
您的喜欢已完成,如您需要查看或者管理喜欢列表,请点击
您的喜欢已完成,如您需要查看或者管理喜欢列表,请点击This document includes math equations
(highlighted in red) which are best viewed with
version 4.0
or higher, or another . There is also a .
Documented restriction that operator-overloads cannot be __global__ functions in .
Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0.
Passing __restrict__ references to __global__ functions is now supported. Updated comment in .
Documented CUDA_ENABLE_CRC_CHECK in .
now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16.
Added new Unified Memory sections: , ,
Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit
or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and
very high memory bandwidth, as illustrated by
Figure 1. Floating-Point Operations per Second for the CPU and GPU
Figure 2. Memory Bandwidth for the CPU and GPU
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized
for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such
that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated
Figure 3. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations
- the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic
operations to memory operations. Because the same program is executed for each data element, there is a lower requirement
for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory
access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets
can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices
are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images,
video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel
processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel
processing, from general signal processing or physics simulation to computational finance or computational biology.
In November 2006, NVIDIA introduced CUDA(R), a general
purpose parallel computing platform and programming model that leverages
the parallel compute engine in NVIDIA GPUs to solve many complex
computational problems in a more efficient way than on a CPU.
CUDA comes with a software environment that allows developers to use C
as a high-level programming language. As illustrated by ,
other languages, application programming interfaces, or directives-based
approaches are supported, such as FORTRAN, DirectCompute, OpenACC.
Figure 4. GPU Computing Applications. CUDA is designed to support various languages and application
programming interfaces.
The advent of multicore CPUs and manycore GPUs means that mainstream
processor chips are now parallel systems. Furthermore, their parallelism
continues to scale with Moore's law. The challenge is to develop
application software that transparently scales its parallelism to
leverage the increasing number of processor cores, much as 3D graphics
applications transparently scale their parallelism to manycore GPUs with
widely varying numbers of cores.
The CUDA parallel programming model is designed to overcome this
challenge while maintaining a low learning curve for programmers familiar
with standard programming languages such as C.
At its core are three key abstractions - a hierarchy of thread groups,
shared memories, and barrier synchronization - that are simply exposed to
the programmer as a minimal set of language extensions.
These abstractions provide fine-grained data parallelism and thread
parallelism, nested within coarse-grained data parallelism and task
parallelism. They guide the programmer to partition the problem into
coarse sub-problems that can be solved independently in parallel by
blocks of threads, and each sub-problem into finer pieces that can be
solved cooperatively in parallel by all threads within the block.
This decomposition preserves language expressivity by allowing threads
to cooperate when solving each sub-problem, and at the same time enables
automatic scalability. Indeed, each block of threads can be scheduled on
any of the available multiprocessors within a GPU, in any order,
concurrently or sequentially, so that a compiled CUDA program can execute
on any number of multiprocessors as illustrated by , and only
the runtime system needs to know the physical multiprocessor count.
This scalable programming model allows the GPU architecture to span a
wide market range by simply scaling the number of multiprocessors and
memory partitions: from the high-performance enthusiast GeForce GPUs and
professional Quadro and Tesla computing products to a variety of
inexpensive, mainstream GeForce GPUs (see
for a list of all CUDA-enabled GPUs).
Figure 5. Automatic ScalabilityNote: A GPU is built around an array of Streaming
Multiprocessors (SMs) (see
more details). A multithreaded program is partitioned into blocks of
threads that execute independently from each other, so that a GPU with
more multiprocessors will automatically execute the program in less
time than a GPU with fewer multiprocessors.
This document is organized into the following chapters:
is a general introduction to CUDA.
outlines the CUDA programming model.
describes the programming interface.
describes the hardware implementation.
gives some guidance on how to achieve maximum performance.
lists all CUDA-enabled devices.
is a detailed description of all extensions to the C language.
describes synchronization primitives for various groups of CUDA threads.
describes how to launch and synchronize one kernel from another.
lists the mathematical functions supported in CUDA.
lists the C++ features supported in device code.
gives more details on texture fetching
gives the technical specifications of various devices, as well as more architectural details.
introduces the low-level driver API.
lists all the CUDA environment variables.
introduces the Unified Memory programming model.
This chapter introduces the main concepts behind the CUDA programming model by outlining how they are exposed in C. An extensive
description of CUDA C is given in .
Full code for the vector addition example used in this chapter and the next can be found in the vectorAdd CUDA sample.
CUDA C extends C by allowing the programmer to define C functions,
called kernels, that, when called, are executed N times in
parallel by N different CUDA threads, as opposed to only
once like regular C functions.
A kernel is defined using the __global__ declaration
specifier and the number of CUDA threads that execute that kernel for a
given kernel call is specified using a new
&&&...&&&execution
configuration syntax (see ). Each thread that executes the kernel
is given a unique thread ID that is accessible within the
kernel through the built-in threadIdx variable.
As an illustration, the following sample code adds two vectors
A and B of size N and stores the
result into vector C:
__global__ void VecAdd(float* A, float* B, float* C)
int i = threadIdx.x;
C[i] = A[i] + B[i];
int main()
VecAdd&&&1, N&&&(A, B, C);
}Here, each of the N threads that execute
VecAdd() performs one pair-wise addition.
For convenience, threadIdx is a 3-component vector, so
that threads can be identified using a one-dimensional, two-dimensional,
or three-dimensional thread index, forming a
one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block. This
provides a natural way to invoke computation across the elements in a
domain such as a vector, matrix, or volume.
The index of a thread and its thread ID relate to each other in a
straightforward way: For a one-dimensional block, for
a two-dimensional block of size (Dx, Dy),the
thread ID of a thread of index (x, y) is (x + y
Dx); for a three-dimensional block of size
(Dx, Dy, Dz), the thread ID of a
thread of index (x, y, z) is (x + y Dx + z
As an example, the following code adds two matrices A and
B of size NxN and stores the result into matrix
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
int main()
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd&&&numBlocks, threadsPerBlock&&&(A, B, C);
}There is a limit to the number of threads per block, since all threads
of a block are expected to reside on the same processor core and must
share the limited memory resources of that core. On current GPUs, a
thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped thread
blocks, so that the total number of threads is equal to the number of
threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or
three-dimensional grid of thread blocks as illustrated by
. The number of
thread blocks in a grid is usually dictated by the size of the data being
processed or the number of processors in the system, which it can greatly
Figure 6. Grid of Thread Blocks
The number of threads per block and the number of blocks per grid
specified in the &&&...&&& syntax can be of
type int or dim3. Two-dimensional
blocks or grids can be specified as in the example above.
Each block within the grid can be identified by a one-dimensional,
two-dimensional, or three-dimensional index accessible within the kernel
through the built-in blockIdx variable. The dimension of
the thread block is accessible within the kernel through the built-in
blockDim variable.
Extending the previous MatAdd() example to handle
multiple blocks, the code becomes as follows.
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i & N && j & N)
C[i][j] = A[i][j] + B[i][j];
int main()
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd&&&numBlocks, threadsPerBlock&&&(A, B, C);
}A thread block size of 16x16 (256 threads), although arbitrary in this
case, is a common choice. The grid is created with enough blocks to have
one thread per matrix element as before. For simplicity, this example
assumes that the number of threads per grid in each dimension is evenly
divisible by the number of threads per block in that dimension, although
that need not be the case.
Thread blocks are required to execute independently: It must be possible
to execute them in any order, in parallel or in series. This independence
requirement allows thread blocks to be scheduled in any order across any
number of cores as illustrated by , enabling programmers to
write code that scales with the number of cores.
Threads within a block can cooperate by sharing data through some
shared memory and by synchronizing their execution to
coordinate memory accesses. More precisely, one can specify
synchronization points in the kernel by calling the
__syncthreads()
__syncthreads()
acts as a barrier at which all threads in the
block must wait before any is allowed to proceed.
gives an example of
using shared memory. In addition to __syncthreads(),
provides a rich set of thread-synchronization
primitives.
For efficient cooperation, the shared memory is expected to be a
low-latency memory near each processor core (much like an L1 cache) and
__syncthreads() is expected to be lightweight.
CUDA threads may access data from multiple memory spaces during their execution as illustrated by . Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the
same lifetime as the block. All threads have access to the same global memory.
There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The
global, constant, and texture memory spaces are optimized for different memory usages (see
). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see ).
The global, constant, and texture memory spaces are persistent across kernel launches by the same application.
Figure 7. Memory Hierarchy
As illustrated by , the CUDA programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C program. This is the case, for example, when the kernels execute on a GPU and the rest of the C program executes
The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM,
referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls
to the CUDA runtime (described in ). This includes device memory allocation and deallocation as well as data transfer between host and device memory.
Unified Memory provides managed memory to bridge the host and device memory spaces. Managed memory is accessible from all CPUs and GPUs in the system as a single,
coherent memory image with a common address space. This capability enables oversubscription of device memory and can greatly
simplify the task of porting applications by eliminating the need to explicitly mirror data on host and device. See
for an introduction to Unified Memory.”
Figure 8. Heterogeneous ProgrammingNote: Serial code executes on the host while parallel code executes on the device.
The compute capability of a device is represented by a
version number, also sometimes called its "SM version".
This version
number identifies the features supported by the GPU hardware and is
used by applications at runtime to determine which hardware features
and/or instructions are available on the present GPU.
The compute capability comprises a major revision number X and a minor
revision number Y and is denoted by X.Y.
Devices with the same major revision number are
of the same core architecture. The major revision number is 7 for devices
based on the Volta architecture, 6 for devices based on the
Pascal architecture, 5 for
devices based on the Maxwell architecture, 3 for devices
based on the Kepler architecture, 2 for devices based on
the Fermi architecture, and 1 for devices based on the
Tesla architecture.
The minor revision number corresponds to an incremental improvement
to the core architecture, possibly including new features.
lists of all
CUDA-enabled devices along with their compute capability.
technical specifications of each compute capability.
Note: The compute capability version of a particular GPU should not be
confused with the CUDA version (e.g., CUDA 7.5, CUDA 8, CUDA 9),
which is the version of the CUDA software platform.
platform is used by application developers to create applications that
run on many generations of GPU architectures, including future GPU
architectures yet to be invented.
While new versions of the CUDA
platform often add native support for a new GPU architecture by
supporting the compute capability version of that architecture, new
versions of the CUDA platform typically also include software features
that are independent of hardware generation.
The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and CUDA 9.0, respectively.
CUDA C provides a simple path for users familiar with the C programming
language to easily write programs for execution by the device.
It consists of a minimal set of extensions to the C language and a
runtime library.
The core language extensions have been introduced in . They allow programmers to define a kernel
as a C function and use some new syntax to specify the grid and block
dimension each time the function is called. A complete description of all
extensions can be found in . Any
source file that contains some of these extensions must be compiled with
nvcc as outlined in .
The runtime is introduced in . It
provides C functions that execute on the host to allocate and deallocate
device memory, transfer data between host memory and device memory,
manage systems with multiple devices, etc. A complete description of the
runtime can be found in the CUDA reference manual.
The runtime is built on top of a lower-level C API, the CUDA driver API,
which is also accessible by the application. The driver API provides an
additional level of control by exposing lower-level concepts such as CUDA
contexts - the analogue of host processes for the device - and CUDA
modules - the analogue of dynamically loaded libraries for the device.
Most applications do not use the driver API as they do not need this
additional level of control and when using the runtime, context and
module management are implicit, resulting in more concise code. The
driver API is introduced in
described in the reference manual.
Kernels can be written using the CUDA instruction set architecture, called PTX, which is described in the PTX reference manual. It is however usually more effective to use a high-level programming language
such as C. In both cases, kernels must be compiled into binary code by nvcc to execute on the device.
nvcc is a compiler driver that simplifies the process of compiling C or PTX code: It provides simple and familiar command line options and executes them by invoking the collection of tools that implement
the different compilation stages. This section gives an overview of nvcc workflow and command options. A complete description can be found in the nvcc user manual.
Source files compiled with nvcc can include a mix of host code (i.e., code that executes on the host) and device code (i.e., code that executes on the device).
nvcc's basic workflow consists in separating device code from host code and then:
compiling the device code into an assembly form (PTX code) and/or binary form (cubin object),
and modifying the host code by replacing the &&&...&&& syntax introduced in
(and described in more details in ) by the necessary CUDA C runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object.
The modified host code is output either as C code that is left to be compiled using another tool or as object code directly
by letting nvcc invoke the host compiler during the last compilation stage.
Applications can then:
Either link to the compiled host code (this is the most common case),
Or ignore the modified host code (if any) and use the CUDA driver API (see ) to load and execute the PTX code or cubin object.
Binary code is architecture-specific. A cubin object is
generated using the compiler option
-code that specifies the targeted
architecture: For example, compiling with
-code=sm_35 produces binary code for
devices of
3.5. Binary compatibility is guaranteed from one minor revision
to the next one, but not from one minor revision to the previous one or
across major revisions. In other words, a cubin object
generated for compute capability X.y will only execute
on devices of compute capability X.z where z≥y.
Some PTX instructions are only supported on devices of higher compute capabilities. For example,
are only supported on devices of compute capability 3.0 and above. The -arch compiler option specifies the compute capability that is assumed when compiling C to PTX code.
So, code that contains warp shuffle, for example, must be compiled with -arch=compute_30 (or higher).
PTX code produced for some specific compute capability can
always be compiled to binary code of greater or equal compute capability.
Note that a binary compiled from an earlier PTX version may not make
use of some hardware features. For example, a binary targeting devices
of compute capability 7.0 (Volta) compiled from PTX generated for
compute capability 6.0 (Pascal) will not make use of Tensor Core
instructions, since these were not available on Pascal. As a result,
the final binary may perform worse than would be possible if the binary
were generated using the latest version of PTX.
To execute code on devices of specific compute capability, an
application must load binary or PTX code that is compatible
with this compute capability as described in
In particular, to be able to execute code on future architectures with
higher compute capability (for which no binary code can be generated
yet), an application must load PTX code that will be
just-in-time compiled for these devices (see ).
Which PTX and binary code gets embedded in a CUDA C
application is controlled by the -arch and
-code compiler options or the
-gencode compiler option as detailed in
the nvcc user manual. For example,
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_60,code=\'compute_60,sm_60\'embeds binary code compatible with compute capability 3.5 and 5.0 (first
and second
-gencode options) and PTX
and binary code compatible with compute capability 6.0 (third
-gencode option).
Host code is generated to automatically select at runtime the most
appropriate code to load and execute, which, in the above example, will
3.5 binary code for devices with compute capability 3.5 and 3.7,
5.0 binary code for devices with compute capability 5.0 and 5.2,
6.0 binary code for devices with compute capability 6.0 and 6.1,
PTX code which is compiled to binary code at runtime for devices with compute capability 7.0 and higher.
x.cu can have an optimized code path that uses warp shuffle
operations, for example, which are only supported in devices of compute
capability 3.0 and higher. The __CUDA_ARCH__ macro can
be used to differentiate various code paths based on compute capability.
It is only defined for device code. When compiling with
-arch=compute_35 for example,
__CUDA_ARCH__ is equal to 350.
Applications using the driver API must compile code to separate files
and explicitly load and execute the most appropriate file at runtime.
The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of
in previous architecures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect
results. To aid migration while implementing the corrective actions detailed in , Volta developers can opt-in to Pascal's thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.
The nvcc user manual lists various shorthand for the
-code, and
-gencode compiler options. For example,
-arch=sm_35 is a shorthand for
-arch=compute_35-code=compute_35,sm_35 (which is the same as
-gencodearch=compute_35,code=\'compute_35,sm_35\').
The front end of the compiler processes CUDA source files according to C++ syntax
rules. Full C++ is supported for the host code. However, only a subset of C++ is fully
supported for the device code as described in .
The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-bit). Device code compiled in 64-bit mode is only supported with
host code compiled in 64-bit mode.
Similarly, the 32-bit version of nvcc compiles device code in 32-bit mode and device code compiled in 32-bit mode is only supported with host code compiled in
32-bit mode.
The 32-bit version of nvcc can compile device code in 64-bit mode also using the -m64 compiler option.
The 64-bit version of nvcc can compile device code in 32-bit mode also using the -m32 compiler option.
There is no explicit initialization func it
initializes the first time a runtime function is called (more
specifically any function other than functions from the device and
version management sections of the reference manual). One needs to keep
this in mind when timing runtime function calls and when interpreting the
error code from the first call into the runtime.
During initialization, the runtime creates a CUDA context for each
device in the system (see
for more details on
CUDA contexts). This context is the primary context for this
device and it is shared among all the host threads of the application.
As part of this context creation, the device code is just-in-time compiled if necessary (see ) and loaded into device memory.
This all happens under the hood and the runtime does not expose the
primary context to the application.
When a host thread calls cudaDeviceReset(), this
destroys the primary context of the device the host thread currently
operates on (i.e., the current device as defined in ). The next runtime function call made by
any host thread that has this device as current will create a new primary
context for this device.
As mentioned in , the CUDA
programming model assumes a system composed of a host and a device, each
with their own separate memory.
Kernels operate out of device memory, so
the runtime provides functions to allocate, deallocate, and copy device
memory, as well as transfer data between host memory and device memory.
Device memory can be allocated either as linear memory or
as CUDA arrays.
CUDA arrays are opaque memory layouts optimized for texture fetching.
They are described in .
Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one
another via pointers, for example, in a binary tree.
Linear memory is typically allocated using cudaMalloc()
and freed using cudaFree() and data transfer between
host memory and device memory are typically done using
cudaMemcpy(). In the vector addition code sample of
, the vectors need to be copied from
host memory to device memory:
__global__ void VecAdd(float* A, float* B, float* C, int N)
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i & N)
C[i] = A[i] + B[i];
int main()
int N = ...;
size_t size = N * sizeof(float);
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
float* d_A;
cudaMalloc(&d_A, size);
float* d_B;
cudaMalloc(&d_B, size);
float* d_C;
cudaMalloc(&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock - 1) / threadsPerB
VecAdd&&&blocksPerGrid, threadsPerBlock&&&(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}Linear memory can also be allocated through
cudaMallocPitch() and cudaMalloc3D().
These functions are recommended for allocations of 2D or 3D arrays as it
makes sure that the allocation is appropriately padded to meet the
alignment requirements described in , therefore ensuring best performance
when accessing the row addresses or performing copies between 2D arrays
and other regions of device memory (using the
cudaMemcpy2D() and cudaMemcpy3D()
functions). The returned pitch (or stride) must be used to access array
elements. The following code sample allocates a width x
height 2D array of floating-point values and shows how
to loop over the array elements in device code:
int width = 64, height = 64;
float* devP
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel&&&100, 512&&&(devPtr, pitch, width, height);
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
for (int r = 0; r & ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c & ++c) {
float element = row[c];
}The following code sample allocates a width x
height x depth 3D array of
floating-point values and shows how to loop over the array elements in
device code:
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
height, depth);
cudaPitchedPtr devPitchedP
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel&&&100, 512&&&(devPitchedPtr, width, height, depth);
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
int width, int height, int depth)
char* devPtr = devPitchedPtr.
size_t pitch = devPitchedPtr.
size_t slicePitch = pitch *
for (int z = 0; z & ++z) {
char* slice = devPtr + z * sliceP
for (int y = 0; y & ++y) {
float* row = (float*)(slice + y * pitch);
for (int x = 0; x & ++x) {
float element = row[x];
}The reference manual lists all the various functions used to copy memory
between linear memory allocated with cudaMalloc(),
linear memory allocated with cudaMallocPitch() or
cudaMalloc3D(), CUDA arrays, and memory allocated for
variables declared in global or constant memory space.
The following code sample illustrates various ways of accessing global
variables via the runtime API:
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
__device__ float devD
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));
__device__ float* devP
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));cudaGetSymbolAddress() is used to retrieve the address
pointing to the memory allocated for a variable declared in global memory
space. The size of the allocated memory is obtained through
cudaGetSymbolSize().
As detailed in
shared memory
is allocated using the __shared__ memory space specifier.
Shared memory is expected to be much faster than global memory as
mentioned in
and detailed in . Any opportunity to replace global memory
accesses by shared memory accesses should therefore be exploited as
illustrated by the following matrix multiplication example.
The following code sample is a straightforward implementation of matrix
multiplication that does not take advantage of shared memory. Each thread
reads one row of A and one column of B and computes the
corresponding element of C as illustrated in .
A is therefore read B.width times from global memory and
B is read A.height times.
typedef struct {
#define BLOCK_SIZE 16
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
void MatMul(const Matrix A, const Matrix B, Matrix C)
Matrix d_A;
d_A.width = A. d_A.height = A.
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = B. d_B.height = B.
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);
Matrix d_C;
d_C.width = C. d_C.height = C.
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel&&&dimGrid, dimBlock&&&(d_A, d_B, d_C);
cudaMemcpy(C.elements, Cd.elements, size,
cudaMemcpyDeviceToHost);
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int e = 0; e & A. ++e)
Cvalue += A.elements[row * A.width + e]
* B.elements[e * B.width + col];
C.elements[row * C.width + col] = C
}Figure 9. Matrix Multiplication without Shared Memory
The following code sample is an implementation of matrix multiplication
that does take advantage of shared memory. In this implementation, each
thread block is responsible for computing one square sub-matrix
Csub of C and each thread within the block is
responsible for computing one element of Csub. As
illustrated in ,
Csub is equal to the product of two rectangular
matrices: the sub-matrix of A of dimension (A.width,
block_size) that has the same row indices as
Csub, and the sub-matrix of B of dimension
(block_size, A.width )that has the same column indices as
Csub. In order to fit into the device's resources,
these two rectangular matrices are divided into as many square matrices
of dimension block_size as necessary and Csub is
computed as the sum of the products of these square matrices. Each of
these products is performed by first loading the two corresponding square
matrices from global memory to shared memory with one thread loading one
element of each matrix, and then by having each thread compute one
element of the product. Each thread accumulates the result of each of
these products into a register and once done writes the result to global
By blocking the computation this way, we take advantage of fast shared
memory and save a lot of global memory bandwidth since A is only
read (B.width / block_size) times from global memory and B
is read (A.height / block_size) times.
The Matrix type from the previous code sample is augmented
with a stride field, so that sub-matrices can be efficiently
represented with the same type.
functions are used to get and set
elements and build any sub-matrix from a matrix.
typedef struct {
__device__ float GetElement(const Matrix A, int row, int col)
return A.elements[row * A.stride + col];
__device__ void SetElement(Matrix A, int row, int col,
float value)
A.elements[row * A.stride + col] =
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
Asub.width
= BLOCK_SIZE;
Asub.height
= BLOCK_SIZE;
Asub.stride
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
#define BLOCK_SIZE 16
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
void MatMul(const Matrix A, const Matrix B, Matrix C)
Matrix d_A;
d_A.width = d_A.stride = A. d_A.height = A.
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = d_B.stride = B. d_B.height = B.
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);
Matrix d_C;
d_C.width = d_C.stride = C. d_C.height = C.
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel&&&dimGrid, dimBlock&&&(d_A, d_B, d_C);
cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
float Cvalue = 0;
int row = threadIdx.y;
int col = threadIdx.x;
for (int m = 0; m & (A.width / BLOCK_SIZE); ++m) {
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
__syncthreads();
for (int e = 0; e & BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
__syncthreads();
SetElement(Csub, row, col, Cvalue);
}Figure 10. Matrix Multiplication with Shared Memory
The runtime provides functions to allow the use of page-locked (also known as
pinned) host memory (as opposed to regular pageable host memory allocated by
malloc()):
cudaHostAlloc() and cudaFreeHost() allocate and free
cudaHostRegister() page-locks a range of memory allocated by
malloc() (see reference manual for limitations).
Using page-locked host memory has several benefits:
Copies between page-locked host memory and device memory can be performed concurrently with
kernel execution for some devices as mentioned in .
On some devices, page-locked host memory can be mapped into the address space of the device,
eliminating the need to copy it to or from device memory as detailed in .
On systems with a front-side bus, bandwidth between host memory and device memory is higher
if host memory is allocated as page-locked and even higher if in addition it is
allocated as write-combining as described in .
Page-locked host memory is a scarce resource however, so allocations in page-locked memory will start failing long before
allocations in pageable memory. In addition, by reducing the amount of physical memory available to the operating system for
paging, consuming too much page-locked memory reduces overall system performance.
The simple zero-copy CUDA sample comes with a detailed document on the page-locked memory APIs.
A block of page-locked memory can be used in conjunction with any device in the system (see
for more details on multi-device systems), but by default, the benefits of using page-locked memory described above are only
available in conjunction with the device that was current when the block was allocated (and with all devices sharing the same
unified address space, if any, as described in ). To make these advantages available to all devices, the block needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister().
By default page-locked host memory is allocated as cacheable. It can
optionally be allocated as write-combining instead by
passing flag cudaHostAllocWriteCombined to
cudaHostAlloc(). Write-combining memory frees up the
host's L1 and L2 cache resources, making more cache available to the rest
of the application. In addition, write-combining memory is not snooped
during transfers across the PCI Express bus, which can improve transfer
performance by up to 40%.
Reading from write-combining memory from the host is prohibitively slow,
so write-combining memory should in general be used for memory that the
host only writes to.
A block of page-locked host memory can also be mapped into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc() or by passing flag cudaHostRegisterMapped to cudaHostRegister(). Such a block has therefore in general two addresses: one in host memory that is returned by cudaHostAlloc() or malloc(), and one in device memory that can be retrieved using cudaHostGetDevicePointer() and then used to access the block from within a kernel. The only exception is for pointers allocated with cudaHostAlloc() and when a unified address space is used for the host and the device as mentioned in .
Accessing host memory directly from within a kernel has several advantages:
There is no need to allocate a block in device memory and copy data between this block and the
transfers are implicitly performed as
There is no need to use streams (see ) to overlap data transfers w the kernel-originated data transfers automatically overlap with kernel
execution.
Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses
using streams or events (see ) to avoid any potential read-after-write, write-after-read, or write-after-write hazards.
To be able to retrieve the device pointer to any mapped page-locked memory, page-locked memory mapping must be enabled by
calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise, cudaHostGetDevicePointer() will return an error.
cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory. Applications may query this capability
by checking the canMapHostMemory device property (see ), which is equal to 1 for devices that support mapped page-locked host memory.
Note that atomic functions (see ) operating on mapped page-locked memory are not atomic from the point of view of the host or other devices.
Also note that CUDA runtime requires that 1-byte, 2-byte, 4-byte, and 8-byte naturally aligned loads and stores to host memory
initiated from the device are preserved as single accesses from the point of view of the host and other devices. On some platforms,
atomics to memory may be broken by the hardware into separate load and store operations. These component load and store operations
have the same requirements on preservation of naturally aligned accesses. As an example, the CUDA runtime does not support
a PCI Express bus topology where a PCI Express bridge splits 8-byte naturally aligned writes into two 4-byte writes between
the device and the host.
CUDA exposes the following operations as independent tasks that can operate concurrently with one another:
Memory transfers from th
Memory transfers from th
Memory transfers within the memo
Memory transfers among devices.
The level of concurrency achieved between these operations will depend on the feature set and compute capability of the device
as described below.
Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before
the device completes the requested task.
Using asynchronous calls, many device operations can be queued up together to be executed by the CUDA driver when appropriate
device resources are available.
This relieves the host thread of much of the responsibility to manage the device, leaving it free for other tasks.
The following device operations are asynchronous with respect to the host:
Memory copies within a single device'
Memory copies from host to device of a memory block of 64 KB
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications
running on a system by setting the CUDA_LAUNCH_BLOCKING environment
variable to 1. This feature is provided for debugging purposes only and should not be
used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent
kernel profiling is enabled.
Async memory copies will also be synchronous if they involve host memory that is not page-locked.
Some devices of compute capability 2.x and higher can execute multiple kernels concurrently.
Applications may query this capability by checking the concurrentKernels device property (see ), which is equal to 1 for devices that support it.
The maximum number of kernel launches that a device can execute concurrently depends on its compute capability and is listed
A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.
Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.
Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution.
Applications may query this capability by checking the asyncEngineCount device property (see ), which is greater than zero for devices that support it.
If host memory is involved in the copy, it must be page-locked.
It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-device copies are initiated using the standard memory copy functions with destination and source addresses
residing on the same device.
Some devices of compute capability 2.x and higher can overlap copies to and from the device.
Applications may query this capability by checking
the asyncEngineCount device property (see ), which is equal to 2 for devices that
support it.
In order to be overlapped, any host memory involved in the transfers must be page-locked.
Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams,
on the other hand, may execute their commands out of order with respect to one ano this behavior is not
guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined).
A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches
and host &-& device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in page-locked memory.
cudaStream_t stream[2];
for (int i = 0; i & 2; ++i)
cudaStreamCreate(&stream[i]);
float* hostP
cudaMallocHost(&hostPtr, 2 * size);Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel
launch, and one memory copy from device to host:
for (int i = 0; i & 2; ++i) {
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
MyKernel &&&100, 512, 0, stream[i]&&&
(outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr.
describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.
Streams are released by calling cudaStreamDestroy().
for (int i = 0; i & 2; ++i)
cudaStreamDestroy(stream[i]);
In case the device is still doing work in the stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with the stream will be released automatically
once the device has completed all work in the stream.
Kernel launches and host &-& device memory
copies that do not specify any stream parameter, or equivalently that set
the stream parameter to zero, are issued to the default stream. They are
therefore executed in order.
For code that is compiled using the --default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular stream and
each host thread has its own default stream.
For code that is compiled using the --default-stream legacy compilation flag, the default stream is a special stream called the NULL stream
and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization
as described in
For code that is compiled without specifying a --default-stream compilation flag, --default-stream legacy is assumed as the default.
There are various ways to explicitly synchronize streams with each other.
cudaDeviceSynchronize() waits until all preceding
commands in all streams of all host threads have completed.
cudaStreamSynchronize()takes a stream as a parameter
and waits until all preceding commands in the given stream have
completed. It can be used to synchronize the host with a specific stream,
allowing other streams to continue executing on the device.
cudaStreamWaitEvent()takes a stream and an event as
parameters (see
for a description of events)and
makes all the commands added to the given stream after the call to
cudaStreamWaitEvent()delay their execution until the
given event has completed. The stream can be 0, in which case all the
commands added to any stream after the call to
cudaStreamWaitEvent()wait on the event.
cudaStreamQuery()provides applications with a way to
know if all preceding commands in a stream have completed.
To avoid unnecessary slowdowns, all these synchronization functions are
usually best used for timing purposes or to isolate a launch or memory
copy that is failing.
Two commands from different streams cannot run concurrently if any one of the following
operations is issued in-between them by the host thread:
a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,
any CUDA command to the NULL stream,
a switch between the L1/shared memory configurations described in
For devices that support concurrent kernel execution and are of compute capability 3.0 or
lower, any operation that requires a dependency check to see if a streamed kernel launch
is complete:
Can start executing only when all thread blocks of all prior kernel launches from any stream
in the CUDA context ha
Blocks all later kernel launches from any stream in the CUDA context until the kernel launch
being checked is complete.
Operations that require a dependency check include any other commands within the same stream as
the launch being checked and any call to cudaStreamQuery() on that
stream. Therefore, applications should follow these guidelines to improve their potential for
concurrent kernel execution:
All independent operations should be issued before dependent operations,
Synchronization of any kind should be delayed as long as possible.
The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and
whether or not the device supports overlap of data transfer and kernel execution (see ), concurrent kernel execution (see ), and/or concurrent data transfers (see ).
For example, on devices that do not support concurrent data transfers, the two streams of the code sample of
do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device
to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed.
If the code is rewritten the following way (and assuming the device supports overlap of data transfer and kernel execution)
for (int i = 0; i & 2; ++i)
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
for (int i = 0; i & 2; ++i)
MyKernel&&&100, 512, 0, stream[i]&&&
(outputDevPtr + i * size, inputDevPtr + i * size, size);
for (int i = 0; i & 2; ++i)
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0].
On devices that do support concurrent data transfers, the two streams of the code sample
do overlap: The memory copy from
host to device issued to stream[1] overlaps with the memory copy from device to host
issued to stream[0] and even with the kernel launch issued to stream[0] (assuming the
device supports overlap of data transfer and kernel execution). However, for devices of
compute capability 3.0 or lower, the kernel executions cannot possibly overlap because
the second kernel launch is issued to stream[1] after the memory copy from device to host
is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0]
is complete as per . If the code is rewritten
as above, the kernel executions overlap (assuming the device supports concurrent kernel
execution) since the second kernel launch is issued to stream[1] before the memory copy
from device to host is issued to stream[0]. In that case however, the memory copy from
device to host issued to stream[0] only overlaps with the last thread blocks of the
kernel launch issued to stream[1] as per , which
can represent only a small portion of the total execution time of the kernel.
The runtime provides a way to insert a callback at any point into a stream via cudaStreamAddCallback(). A callback is a function that is executed on the host once all commands issued to the stream before the callback have completed.
Callbacks in stream 0 are executed once all preceding tasks and commands issued in all streams before the callback have completed.
The following code sample adds the callback function
MyCallback to each of two streams
after issuing a host-to-device memory copy, a kernel launch and a
device-to-host memory copy into each stream. The callback will
begin execution on the host after each of the device-to-host memory
copies completes.
void CUDART_CB MyCallback(cudaStream_t stream, cudaError_t status, void *data){
printf("Inside callback %d\n", (size_t)data);
for (size_t i = 0; i & 2; ++i) {
cudaMemcpyAsync(devPtrIn[i], hostPtr[i], size, cudaMemcpyHostToDevice, stream[i]);
MyKernel&&&100, 512, 0, stream[i]&&&(devPtrOut[i], devPtrIn[i], size);
cudaMemcpyAsync(hostPtr[i], devPtrOut[i], size, cudaMemcpyDeviceToHost, stream[i]);
cudaStreamAddCallback(stream[i], MyCallback, (void*)i, 0);
The commands that are issued in a stream (or all commands issued to any stream if the callback is issued to stream 0) after
a callback do not start executing before the callback has completed.
The last parameter of cudaStreamAddCallback() is reserved for future use.
A callback must not make CUDA API calls (directly or indirectly), as it might end up waiting on itself if it makes such a
call leading to a deadlock.
The relative priorities of streams can be specified at creation using cudaStreamCreateWithPriority(). The range of allowable priorities, ordered as [ highest priority, lowest priority ] can be obtained using the cudaDeviceGetStreamPriorityRange() function. At runtime, as blocks in low-priority schemes finish, waiting blocks in higher-priority streams are scheduled in
their place.
The following code sample obtains the allowable range of priorities for the current device, and creates streams with the
highest and lowest available priorities
int priority_high, priority_
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStream_t st_high, st_
cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);
The runtime also provides a way to closely monitor the device's
progress, as well as perform accurate timing, by letting the
application asynchronously record events at any point in
the program and query when these events are completed. An event has
completed when all tasks - or optionally, all commands in a given
stream - preceding the event have completed. Events in stream zero are
completed after all preceding tasks and commands in all streams are
completed.
The following code sample creates two events:cudaEvent_t start,
cudaEventCreate(&start);
cudaEventCreate(&stop);They are destroyed this way:cudaEventDestroy(start);
cudaEventDestroy(stop);
When a synchronous function is called, control is not returned to the host thread before the
device has completed the requested task. Whether the host thread will then yield, block, or spin
can be specified by calling cudaSetDeviceFlags()with some specific flags (see
reference manual for details) before any other CUDA call is performed by the host thread.
A host system can have multiple devices. The following code sample shows how to enumerate these devices, query their properties,
and determine the number of CUDA-enabled devices.
int deviceC
cudaGetDeviceCount(&deviceCount);
for (device = 0; device & deviceC ++device) {
cudaDeviceProp deviceP
cudaGetDeviceProperties(&deviceProp, device);
printf("Device %d

我要回帖

更多关于 土建柱子钢筋图解 的文章

 

随机推荐