Research Computing >> Training and Publications >> Using Parallel Architectures

# Using Parallel Architectures for Business Research

## Parallel vs. Serial Code

• Performance
• Why Parallelization?
• Technological limitations on how fast a single process can run
• Progress in CPU speed has slowed down a lot
• Pentium 4 (2000): 3.8 GHz
• Core2 i7 (2011): 3.5 GHz (but much faster!)
• Many physical limitations (power, heating, …)
• Progress in memory speed has slowed down more
• Progress in disk speed has slowed down more yet
• Solution: parallelization
• Parallel vs. Serial code
• What to Expect (Amdahl's Law)
• $$\text{SPEEDUP} = \frac{1}{\frac{\text{%PARALLEL}}{\text{NUM_CPU}}+\text{%SEQUENTIAL}}$$
• 8 CPUs, 90\% parallel $$\Rightarrow$$ speedup = 4.7
• 1000 CPUs, 90\% parallel $$\Rightarrow$$ speedup = 9.9

## Facilities at Northwestern

• Quest: 7056 Intel cores, ~40GB RAM
• C, Fortran, R, Matlab
• SSCC: 132 AMD cores, up to 64GB RAM
• all of the above + lots of statistical software!
• Multi-core PC's
• GPU's

## What is inside a typical PC (node)?

• CPUs
• Cores
• Cache (L1, L2, L3, …)
• Control logic
• RAM
• I/O: disk, network
• GPU (graphics card)

## Types of Parallelism

• By physical location
• Parallelism within a CPU core: vectorization
• Multiple cores in a CPU: shared memory
• Multiple CPUs in a node (PC): shared memory
• Multiple nodes in a cluster: distributed memory
• Data parallelism (aka domain decomposition, SPMD)
• Run the same analysis for different stocks/days
• Monte-Carlo: run the same simulation with different random sequences
• Solve the same model with different parameters
• Partially parallelizeable:
• Solve a problem iteratively on a grid

## Software for Computational Economics

• Statistical languages (SAS, Stata, R)
• Regressions, statistical analysis, statistical graphics…
• Matrix languages (Matlab, Ox, Gauss)
• Simulations, signal processing…
• Symbolic software (Maple, Mathematica)
• Computing close-form solutions
• General-purpose interpreters (Python/numpy/scipy)
• General-purpose compilers (Fortran, C, C++)
• usually the fastest and most flexible
• Fortran 90 is very similar in syntax to Matlab!
• No "comprehensive" toolbox system as in Matlab
• Most libraries are not as well documented
• A few semi-automated code conversion tools exist

## Different ways to implement parallelization

• Vectorization
• Usually is done for you already. Writing vectorized code helps.
• Automatic parallelization (Stata/MP, SAS, Matlab)
• Specify the parallel resources available, and the computer does everything else for you
• Ideal case! But only if it works…
• Parallelized library functions (Gauss, Matlab, cuBLAS, ScaLAPACK)
• Likely to be highly optimized
• Might not be available, or not the most efficient for your problem
• Job-level parallelization
• Run multiple copies of the code on different cores or nodes
• Easy for data-parallel tasks (but I/O matters!)
• We have a set of functions to simplify this in Matlab
• Guided parallelization (OpenMP, MPI, Matlab Parallel Toolbox, PGI directives)
• Explicitly tell software how to parallelize the code
• Can be as easy as adding a comment or replacing a keyword
• Gets tricky in complicated cases
• Low-level parallelization: do everything manually

## Package-specific details

• What is available varies a lot
• Lowest-level languages (C, Fortran) have the most options
• Higher-level languages sometimes offer easy facilities
• Combining languages is often optimal
• If the job can be split into parts, also very easy
• All languages have easy functionality to:
• save a dataset in a text format
• execute a program in any other language
• load results from a file
• Sometimes you can directly read a file from another program
• If saving/loading is not an option, things get trickier…

## High-level languages (SAS, Stata, Matlab, etc.)

### Stata/MP

• Most functions are parallelized
• You have to pay (a lot) more for parallel licenses!
• Max. 8 cores
• Combining with other languages:
• Directly calling C code (via plugins) is very easy
• But inefficient when called many times (eg. nl/mle)
• save/exec/load is easy and works with any Stata

### SAS

• Some procedures are parallelized across multiple cores
• sometimes the gain is small
• specify "options threads cpucount=actual;"
• Procedures: SORT, SUMMARY, MEANS, REPORT, TABULATE, SQL, GLM, LOESS, REG, ROBUSTREG
• Procedures in other languages:
• possible but quite tricky
• save/exec/load typically preferrable

### Matlab

• Requires Matlab Parallel Toolbox
• Available in SSCC, Quest, Kellogg desktop installations
• Limited to max. 8 cores
• can do more, but with a (very) expensive server
• Invoke "matlabpool open" to enable parallel processing
• Some operations (matrix multiplication, optimization routines) are partially parallelized
• Implements several approaches to parallelization (parfor, coarrays)
• But, somewhat inefficient and sometimes tricky
• Also has GPU functionality

## Low-level languages (C, Fortran)

• Trivial: a compiler option to ask it to try parallelizing loops
• Usually fails: resulting parallel code can be slower than sequential
• Easy: OpenMP
• Tell the compiler which loops to parallelize
• Requires very minimal code modification
• Works well, but limited to single machine
• Less easy: Parallelized libraries
• Replace matrix multiplications etc. in your code with library calls
• MKL (Intel PC), ACML (AMD PC, AMD GPU), ScaLAPACK (clusters), cuBLAS (Nvidia GPU), etc.
• May also help in sequential code
• Sometimes tricky to install or link
• gfortran has an option to auto-generate BLAS calls
• Trickier: MPI
• Scales to large clusters
• Relies on process communication

## GPU computing

• Designed for graphics processing in video games
• Has a lot of cores and very fast memory
• Each core is somewhat primitive
• Works best for applying the same function to a large array
• Speedups of 200x vs. CPU have been claimed (but treat such claims with care!)
• Major manufacturers: Nvidia (GeForce/Tesla) and ATI (Radeon)
• Nvidia is far more popular for general-purpose computing
• Programming GPU's efficiently is very tricky!
• CUDA C or PGI Fortran
• A lot of different GPU models
• Code optimized for one can fail or work slowly on another
• High-level (easier) options:
• Matlab Parallel Toolbox
• Matlab/Jacket
• C/Fortran: PGI directives
• Third-party packages and libraries (R, cuBLAS, etc.)

## Parallelization Methods: Summary

MethodSingle machineClusterGPU
Automatic parallelization
- SAS, Matlab, Stata/MPlimited--
- C, Fortranworks poorly--
Parallel librariesxsomesome
Job-level parallelizationxx-
Manual parallelization
Matlabmax. 8 coresMatlab Serverx
Matlab/PBS script-x-
Gaussx--
Rpackagespackagespackages
Mathematicax-plugin
OpenMP (C, Fortran)x--
PGI directives (C/Fortran)--x
MPI (C, Fortran)xx-

© 2001-2011 Kellogg School of Management, Northwestern University