Khóa luận A parallel implementation on modern hardware for geo-Electrical tomographical software

INTRODUCTION 1

CHAPTER 1. HIGH PERFORMANCE COMPUTING ON MODERN HARDWARE 3

1.1 An overview of modern parallel architectures 3

1.1.1 Instruction-Level Parallel Architectures 4

1.1.2 Process-Level Parallel Architectures 5

1.1.3 Data parallel architectures 8

1.1.4 Future trends in hardware 13

1.2 Programming tools for scientific computing on personal desktop systems 15

1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++ 16

1.2.2 GPU programming with CUDA 21

1.2.3 Heterogeneous programming and OpenCL 26

CHAPTER 2. THE FORWARD PROBLEM IN RESISTIVITY TOMOGRAPHY 28

2.1 Inversion theory 28

2.2 The geophysical model 30

2.3 The forward problem by differential method 35

CHAPTER 3 SOFTWARE IMPLEMENTATION 39

3.1 CPU implementation 39

3.2 Example Results 40

3.3 GPU Implementation using CUDA 42

CONCLUSION 46

REFERENCES 47

57 trang | Chia sẻ: maiphuongdc | Lượt xem: 1554 | Lượt tải: 2

Bạn đang xem trước 20 trang tài liệu Khóa luận A parallel implementation on modern hardware for geo-Electrical tomographical software, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Each core can run a Linux OS independently. The processor also has Dynamic Distributed Cache technology which provides a fully coherent shared cache system across an arbitrary sized array of tiles. Programming can be done normally on a Linux derivative with full support for C and C++ and Tilera parallel libraries. The processor utilizes VLIW (Very Long Instruction Word) with RISC instructions for each core. The primary focus of this processor is for networking, multimedia and clouding computing with a strong emphasis on integer computation to complement GPU’s floating point computation. From all these trends, it would be reasonable to assume that in the near future, we will be able to see new architectures which resemble all current architectures, such as many-core processors where each core has a CPU core and stream-processors as co-processors. Such systems would provide tremendous computing power per processor that would cause major changes in the field of computing. 1.2 Programming tools for scientific computing on personal desktop systems Traditionally, most scientific computing tasks have been done on clusters. However, with the advent of modern hardware that provide great level of parallelism, many small to medium-sized tasks can now be run on a single high-end desktop computer in reasonable time. Such systems are called “personal supercomputers”. Although they have variable configurations, most today employ multicore CPUs with multiple GPUs. An example is the Fastra II desktop supercomputer [3] at University of Antwep, Belgium, which can achieve 12 Tflops computing power. The FASTRA II contains six NVIDIA GTX295 dual-GPU cards, and one GTX275 single-GPU card with a total cost of less than six thousands euros. The real processing speed of this system can equal that of a cluster with thousands of CPU cores. Although these systems are more cost-effective, consume less power and provide greater convenience for their users, they pose serious problems for software developers. Traditional programming tools and algorithms for cluster computing are not appropriate for exploiting the full potential of multi-core CPUs and GPUs. There are many kinds of interaction between components in such heterogeneous systems. The link between the CPU and the GPUs is through the PCI Express bus. The GPU has to go through the CPU to access system memory. The inside of the multicore CPU is a SMP system. As each GPU has separate graphics memory, the relationship between GPUs is like in a distributed-memory system. As these systems are in early stages of development, programming tools do not provide all the functionalities programmers need and many tasks still need to be done manually. Algorithms also need to be adapted to the limitations of current hardware and software tools. In the following parts, we will present some programming tools for desktop systems with multi-core CPUs and multi GPUs that we that we consider useful for exploiting parallelism in scientific computing. The grouping is just for easy comparison between similar tools as some tools provide more than one kind of parallelization. 1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++ Windows and Linux (and other Unixes) provide API’s for creating and manipulating operating system threads using WinAPI threads and POSIX threads (Pthreads), respectively. These threading approaches may be convenient when there's a natural way to functionally decompose an application - for example, into a user interface thread, a compute thread or a render thread. However, in the case of more complicated parallel algorithms, the manual creating and scheduling thread can lead to more complex code, longer development time and not optimal execution. The alternative is to program atop a concurrency platform — an abstraction layer of software that coordinates, schedules, and manages the multicore resources. Using thread pools is a parallel pattern that can provide some improvements. A thread pool is a strategy for minimizing the overhead associated with creating and destroying threads and is possibly the simplest concurrency platform. The basic idea of a thread pool is to create a set of threads once and for all at the beginning of the program. When a task is created, it executes on a thread in the pool, and returns the thread to the pool when finished. A problem is when the task arrives and the pool has no thread available. The pool then suspends the task and wakes it up when a new thread is available. This requires synchronization such as locks to ensure atomicity and avoid concurrency bugs. Thread pools are common for the server-client model but for other tasks, scalability and deadlocks still pose problems. This calls for concurrency platforms with higher levels of abstraction that provide more scalability, productivity and maintainability. Some examples are OpenMP, Intel Threading Building Blocks, and Cilk++ . OpenMP (Open Multiprocessing) [25] is an open concurrency platform with support for multithreading through compiler pragmas in C, C++ and Fortran. It is an API specification and compilers can provide different implementations. OpenMP is governed by the OpenMP Architecture Review Board (ARB). The first OpenMP specification came out in 1997 with support for Fortran, followed by C/C++ support in 1998. Version 2.0 was released in 2000 for Fortran and 2002 for C/C++. Version 3.0 was released in 2008 and is the current API specification. It contains many major enhancements, especially the task construct. Most recent compilers have added some level of support for OpenMP. Programmers can inspect the code to find places that require parallelization and insert the pragmas to tell the compiler to produce multithreaded code. This makes the code have a fork-join model, in which when the parallel section has finished; all the threads join back the master thread. Workloads in one loop are given to threads using work-sharing. There are four kinds of loop workload scheduling in OpenMP: Static scheduling, each thread is given an equal chunk of iterations. Dynamic scheduling, the iterations are assigned to threads as the threads request them. The thread executes the chunk of iterations (controlled through the chunk size parameter), then requests another chunk until there are no more chunks to work on. Guided scheduling is almost the same as dynamic scheduling, except that for a chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations, divided by the number of threads, decreasing to 1. For a chunk size of “k” (k > 1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations. Runtime scheduling, if this schedule is selected, the decision regarding scheduling kind is made at run time. The schedule and (optional) chunk size are set through the OMP_SCHEDULE environment variable. Beside scheduling clauses, OpenMP also has clauses for data sharing attribute, synchronization, IF control, initialization, data copying, reduction and other concurrent operations. A typical OpenMP parallelized loop may look like: #pragma omp for schedule(dynamic, CHUNKSIZE) for(int i = 2; i <= N-1; i++) for(int j = 2; j <= i; j++) for(int k = 1; k <= M; k++) b[i][j]+=a[i-1][j]/k+a[i+1][j]/k; Figure 9 OpenMP fork-join model. Intel’s Threading Building Blocks (TBB) [4] is an open source C++ template library developed by Intel for writing task based multithreaded applications with ideas and models inherited from many previous languages and libraries. While OpenMP uses the pragma approach for parallelization, TBB uses the library approach. The first version came out in August 2006 and since then TBB has seen widespread use in many applications, especially game engines such as Unreal. TBB is available with both a commercial license and an open source license. The latest version 2.2 was introduced in August 2009. The library has also received Jolt Productivity award and InfoWorld OSS award. It is a library based on generic programming, requires no special compiler support, and is processor and OS independent. This makes TBB ideal for parallelizing legacy applications. TBB has support for Windows, Linux, OS X, Solaris, PowerPC, Xbox, QNX, FreeBSD and can be compiled using Visual C++, Intel C++, gcc and other popular compilers. TBB is not a thread-replacement library but provides a higher level of abstraction. Developers do not work directly with threads but tasks, which are mapped to threads by the library runtime. The number of threads are automatically managed by the library or set manually by the user, just like the case with OpenMP. Beside basic loop parallelizing parallel_for constructs, TBB also have parallel patterns such as parallel_reduce, parallel_scan, parallel_do; concurrent data containers including vectors, queues and hash map; scalable memory allocator and synchronization primitives such as atomics and mutexes; and pipelining. TBB parallel algorithms operate on the concept of blocked_range, which is the iteration space. Work is divided between threads using work stealing, in which the range is recursively divided into smaller tasks. A thread work on the task it meets depth first and steals the task breadth first following the principles: do task from own queue (FIFO) steal task from another queue. This can be applied to one, two or three-dimensional ranges, allowing for effective blocking on data structures with dimensions greater than one. Task stealing also enables better cache utilization and avoid false sharing as much as possible. The figures below illustrate the mechanism behind task stealing. Figure 10 TBB’s range split Figure 11 TBB’s task stealing illustration Normally, TBB calls using function objects can be quite verbose as parallelization is done through libraries. However, with the new C++0x lambda syntax, TBB code can be much shorter and easier to read. Below is an example TBB call using C++0x lambdas: void ParallelApplyFoo(float a[], size_t n ){ parallel_for( blocked_range( 0, n ), [=](const blocked_range& range) { for(int i= range.begin();i!=range.end();++i) Foo(a[i]); }, auto_partitioner() );} } Another platform for multi-threading development is Cilk++. Cilk++ extends the C++ programming language with three new keywords: cilk_spawn, cilk_sync, cilk_for; and Cilk++ hyper objects which can be global variables but avoid data races. Parallelized code using Cilk++ can therefore be quite compact. void matrix_multiply(matrix_t A,matrix_t B,matrix_t C) { cilk_for (int i = 0; i < A.rows; ++i) { cilk_for (int j = 0; j < B.cols; ++j) { for (int k = 0; k < A.cols; ++k) C[i][j] += A[i][k] * B[k][j]; } } } Matrix multiplication implementation in Cilk++ However, this requires the Cilk++ compiler which extends on a standard C++ compiler such as Visual C++ or GCC. Cilk++ also uses work stealing like TBB for scheduling threads but with its own modification. Both are based on the work stealing method by the MIT Cilk project. Intel has recently acquired the Cilk++ company and Cilk++ will be integrated into Intel’s line of parallel programming tools together with TBB and other tools. 1.2.2 GPU programming with CUDA The first GPU platform we present is CUDA [16]. CUDA is Nvidia’s parallel computing architecture that allows programmer to program GPGPU applications on Nvidia’s graphics card architecture including Geforce, Quadro and Tesla. CUDA was introduced in the late 2006 and received a lot of attention when released, especially from high performance computing communities. It is now the basis for all other kinds of parallel programming tools on the Nvidia hardware. CUDA is now the most popular tool for programming GPGPU applications with various research papers, libraries and commercial development tools. Previous to CUDA, GPGPU was done using graphics API. Although speedup for this was great, there were several drawbacks that limits the growth of GPGPU. Firstly, the programmer is required to possess in-depth knowledge of graphics APIs and GPU architecture. Secondly, problems had to be expressed in terms of vertex coordinates, textures and shader programs, which results in greatly increased program complexity, low productivity and poor code maintenance. Thirdly, basic programming features such as random reads and writes to memory were not supported, greatly restricting the programming model and algorithms available. CUDA has to some extent made GPU programming more accessible to programmers with a more intuitive programming model and a good collections of libraries for popular tasks. Figure 12 CUDA processing flow [16] As CUDA is used for an architecture which supports massively data parallel programming, programming using CUDA is also different from traditional programming on the CPU. The processing flow for CUDA applications is shown in Figure 12. As the GPU does not have direct access to main memory, processing data must be copied from the main memory to GPU memory through the PCI Express lanes. The GPU is also not a totally self-control processor and needs the CPU to instruct the processing. The GPU then executes the data processing in parallel on its core using its hardware scheduler. When the processing is done, processed data is copied back the main memory. All the memory copy operations are much more expensive than computation operations so it is best to keep the processing on the GPU for as long as possible to avoid expensive memory copies. C for CUDA extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads. A kernel is defined using the __global__ declaration specifier and the number of CUDA threads for each call is specified using a new >> syntax. The first parameter for the >> call is the number of blocks and the second one is the block size. For example: // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // Kernel invocation VecAdd>>(A, B, C); } All files containing CUDA syntax must be compiled with the Nvidia nvcc compiler. It can be used together with both Visual C++, GCC and other compilers. All the CUDA threads run the same kernel when invoked. Threads are grouped into blocks of the same size and blocks together form a grid. Both grid and block can be one, two or three-dimensional. Through the CUDA extensions blockIdx, blockDim and threadIdx, the kernel implementer can get the co-ordinate of the thread in the grid and process the corresponding data element. Thread blocks are required to execute independently which means it must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores, enabling programmers to write code that is scalable with the increasing number of cores. The number of blocks in a grid is typically governed by the size of the data being processed rather than by the number of stream processors in the system, which it can greatly exceed. CUDA’s hierarchy of threads is mapped to the hierarchy of processors on the GPU. A GPU executes one or more kernel grids. A streaming multiprocessor (SM) executes one or more thread blocks, and CUDA cores and other execution units in the SM execute threads. The SM executes threads in groups of 32 threads called a warp. While programmers can generally ignore warp execution for functional correctness and think of programming one thread, greatly improved performance can be achieved by having threads in a warp execute the same code path and access memory in nearby addresses. Figure 13 shows an example of thread grids with each block in the grids containing executable threads. For a problem of certain size, the number of threads per block and the number of blocks and the grid and block dimension can be manually controlled by the programmer at kernel launch. Choosing these numbers appropriately can have great effects on execution speed. As threads are scheduled in warps of 32 threads each, the number of threads per block should be a multiple of 32. Each generation of processor also has maximum limitations on number of threads, number of threads per stream processor and number of blocks per stream processor. Careful considerations of these parameters can enable the kernel to utilize the full potential of the GPU. For example the G80 architecture can schedule up to 8 blocks and 768 threads per stream processor. If there are 64 threads per block, there are 12 blocks but only 8 blocks can go into the stream processor so only 512 threads is scheduled. With 256 threads per block, each stream processor can execute three blocks and achieve full capacity. For 1024 threads per block, even one block can not fit into a stream processor and cannot be scheduled. Choosing the right parameter for kernel launch is a difficult problem which has different solutions for each situation and optimal execution can only be obtained with some amount of experiments. Figure 13 Grid of thread blocks [8] The CUDA memory hierarchy is also different from that of traditional C programming with much more complexities. Each thread has a private local memory called registers. Registers are the fastest form of memory in the GPU but only accessible by the thread and has the life time of the thread. Each thread block has a shared memory visible to all threads of the block and with the same lifetime as the block. Shared memory can be as fast as registers when there are no bank conflicts or when reading from the same address. Finally, all threads have access to the same global memory. Global memory is much slower than shared memory and registers, requiring 400 to 600 clock cycles for one access. As a result, programmers should strive to utilize registers and shared memory to increase bandwidth. When reading from global memory, coalesced memory access also help avoid performance penalties of global memory access. There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages. Constant memory is read only from kernels and is optimized for the case when threads read from the same memory location. Texture memory can offer the ability to cache global memory and interact with the dedicated graphics hardware of the GPU. In appropriate cases, using texture memory can provide considerable performance gain over global memory access. However, this depends on particular situations and using constant memory and texture memory effectively is still an expert topic in CUDA as careless uses of these advanced features can lead to even slower performance, erroneous execution or reduced code clarity. Figure 14 The CUDA Memory Hierarchy [8] The CUDA programming model also assumes that the host (CPU) and the device (GPU) maintain separate memory as host memory and device memory. A CUDA application has to manually manage all the memory related to the device including memory allocation and deallocation using cudaMalloc and cudaFree; and data transfer between host and device using cudaMemcpy. Since CUDA 2.2, there is also cudaHostAlloc which allows host memory to be into CUDA address space providing asynchronous transparent access to data without explicit copy using cudaMemcpy. A notable property that applications using CUDA have is the memory sharing between CUDA memory and OpenGL or DirectX memory. This makes it quite convenient for scientific simulations that can push computed results on GPU shaders directly to the display device by calling appropriate APIs. This feature is also useful in games and other graphic tasks. Further information about CUDA programming can be found in the Nvidia CUDA Programming Guide [8]. 1.2.3 Heterogeneous programming and OpenCL The next platform is OpenCL (Open Computing Language) [24]. It was proposed by Apple and has now become a standard maintained by the Khronos group just like OpenGL. The first version of OpenCL came out in December 2008. OpenCL shares many similarities with CUDA but support heterogeneous computing on a wide range of platforms, including Intel and AMD CPUs; Nvidia and ATI GPUs; Cell processors, FPGAs, digital signal processors and other platforms. At the time of this writing, there are OpenCL implementations on most major parallel platforms although features and performance still varies. OpenCL has a lower-level interface than CUDA, like the CUDA driver API. Programmers still write compute kernels but have to load them into the device command queue. This also makes OpenCL more flexible than CUDA as it can support both task parallelism and data parallelism. Data movements between the host and compute devices, as well as OpenCL tasks, are coordinated via command queues. Command queues provide a general way of specifying relationships between tasks, ensuring that tasks are executed in an order that satisfies the natural dependences in the computation. The OpenCL runtime is free to execute tasks in parallel if their dependencies are satisfied, which provides a general-purpose task parallel execution model. Tasks themselves can be comprised of data-parallel kernels, which apply a single function over a range of data elements, in parallel, allowing only restricted synchronization and communication during the execution of a kernel. OpenCL’s data parallelism is quite like CUDA. There are many similar concepts to CUDA but with different names. Instead of grid and blocks, the kernel is executed over a similar index space with equal workgroups. The memory hierarchy also share many similarities with CUDA. While the division of a kernel into work-items and work-groups supports data-parallelism, task parallelism is supported via the command queue. Different kernels representing different tasks may be enqueued. Dependency between tasks can also be specified. Multiple command queues can also be created. Figure 15 OpenCL data parallelism and task parallelism Below is an example OpenCL code in the Nvidia GPU Computing SDK for computing dot product: __kernel void DotProduct (__global float* a, __global float* b, __global float* c, int iNumElements) { // find position in global arrays int iGID = get_global_id(0); // bound check (equivalent to the limit on a 'for' loop for standard/serial C code if (iGID >= iNumElements) { return; } // process int iInOffset = iGID << 2; c[iGID] = a[iInOffset] * b[iInOffset] + a[iInOffset + 1] * b[iInOffset + 1] + a[iInOffset + 2] * b[iInOffset + 2] + a[iInOffset + 3] * b[iInOffset + 3]; } Compared to CUDA, OpenCL has both task parallelism and data parallelism while CUDA has limited support for task parallelism. However, this makes the OpenCL have a steeper learning curve. Writing portable OpenCL code with good performance on all architectures is still hard as there are many differences which call for platform specific optimizations. For example, while Nvidia hardware encourages the use of scalar code, AMD GPUs is designed for vector types such as float4 to get good performance. The efficiency of OpenCL model on CPUs versus CPU-only technologies such as OpenMP or TBB is a problem that needs further experiment. OpenCL drivers and support tools are also not as mature as established technologies. Support libraries is also a factor worth considering. While CUDA has cuBlas, cuFFT and many other third-party libraries, OpenCL has virtually none of these. As a result, OpenCL has not seen widespread adoption and the market share is still low compared to CUDA and other technologies. The mixture of data parallelism and task parallelism is also a high learning curve for programmers coming from other homogeneous parallel platforms. However, OpenCL is an open standard and in the long term, with more quality implementations, better support and the natural industrial trend towards open standards, it should be able to gain more acceptances and receive broader use than now. It is also highly possible that new parallel frameworks will be built on top of OpenCL to offer higher level parallel programming as well as provide additional utilities for various common tasks such as scientific computing, image processing, etc. Such frameworks are being considered by most major parallel software providers and would provide a great boost to OpenCL adoption. Third-party libraries may also see big improvements as well. Nevertheless, understanding OpenCL at its current level would make the programmer use these future tools more efficiently. Chapter 2. The Forward Problem in Resis

Các file đính kèm theo tài liệu này:

Nguyen Hoang Vu_K51KHMT_Khoa luan tot nghiep dai hoc.doc