# Using Parallel Architectures for Business Research

## Contents

- Parallel vs. Serial Code
- Facilities at Northwestern
- What is inside a typical PC (node)?
- Types of Parallelism
- Software for Computational Economics
- Different ways to implement parallelization
- Package-specific details
- High-level languages (SAS, Stata, Matlab, etc.)
- Low-level languages (C, Fortran)
- Online library directories and collections
- GPU computing
- Parallelization Methods: Summary

## Parallel vs. Serial Code

- Performance
- Why Parallelization?
- Technological limitations on how fast a single process can run
- Progress in CPU speed has slowed down a lot
- Pentium 4 (2000): 3.8 GHz
- Core2 i7 (2011): 3.5 GHz (but much faster!)
- Many physical limitations (power, heating, …)

- Progress in memory speed has slowed down more
- Progress in disk speed has slowed down more yet
- Solution: parallelization

- Parallel vs. Serial code
- What to Expect (Amdahl's Law)
- \(\text{SPEEDUP} = \frac{1}{\frac{\text{%PARALLEL}}{\text{NUM_CPU}}+\text{%SEQUENTIAL}}\)
- 8 CPUs, 90\% parallel \(\Rightarrow\) speedup = 4.7
- 1000 CPUs, 90\% parallel \(\Rightarrow\) speedup = 9.9

## Facilities at Northwestern

- Quest: 7056 Intel cores, ~40GB RAM
- C, Fortran, R, Matlab

- SSCC: 132 AMD cores, up to 64GB RAM
- all of the above + lots of statistical software!

- Multi-core PC's
- skew4: 8 cores + 64GB RAM
- skew5: 24 cores + 256GB RAM
- Your desktop!
- For Windows, use CPU-z to find out: http://www.cpuid.com/softwares/cpu-z.html

- GPU's

## What is inside a typical PC (node)?

- CPUs
- Cores
- Cache (L1, L2, L3, …)
- Control logic

- RAM
- I/O: disk, network
- GPU (graphics card)

## Types of Parallelism

- By physical location
- Parallelism within a CPU core: vectorization
- Multiple cores in a CPU: shared memory
- Multiple CPUs in a node (PC): shared memory
- Multiple nodes in a cluster: distributed memory

- Data parallelism (aka domain decomposition, SPMD)
- Run the same analysis for different stocks/days

- Task parallelism
- Monte-Carlo: run the same simulation with different random sequences
- Solve the same model with different parameters

- Partially parallelizeable:
- Solve a problem iteratively on a grid

## Software for Computational Economics

- Statistical languages (SAS, Stata, R)
- Regressions, statistical analysis, statistical graphics…

- Matrix languages (Matlab, Ox, Gauss)
- Simulations, signal processing…

- Symbolic software (Maple, Mathematica)
- Computing close-form solutions

- General-purpose interpreters (Python/numpy/scipy)
- General-purpose compilers (Fortran, C, C++)
- usually the fastest and most flexible
- Fortran 90 is
**very**similar in syntax to Matlab!- No "comprehensive" toolbox system as in Matlab
- Most libraries are not as well documented

- A few semi-automated code conversion tools exist

## Different ways to implement parallelization

- Vectorization
- Usually is done for you already. Writing vectorized code helps.

- Automatic parallelization (Stata/MP, SAS, Matlab)
- Specify the parallel resources available, and the computer does everything else for you
- Ideal case! But only if it works…

- Parallelized library functions (Gauss, Matlab, cuBLAS, ScaLAPACK)
- Likely to be highly optimized
- Might not be available, or not the most efficient for your problem

- Job-level parallelization
- Run multiple copies of the code on different cores or nodes
- Easy for data-parallel tasks (but I/O matters!)
- We have a set of functions to simplify this in Matlab

- Guided parallelization (OpenMP, MPI, Matlab Parallel Toolbox, PGI directives)
- Explicitly tell software how to parallelize the code
- Can be as easy as adding a comment or replacing a keyword
- Gets tricky in complicated cases

- Low-level parallelization: do everything manually

## Package-specific details

- What is available varies a lot
- Lowest-level languages (C, Fortran) have the most options
- Higher-level languages sometimes offer easy facilities
- Combining languages is often optimal
- If the job can be split into parts, also very easy
- All languages have easy functionality to:
- save a dataset in a text format
- execute a program in any other language
- load results from a file

- Sometimes you can directly read a file from another program
- If saving/loading is not an option, things get trickier…

## High-level languages (SAS, Stata, Matlab, etc.)

### Stata/MP

- Most functions are parallelized
- You have to pay (a lot) more for parallel licenses!
- Max. 8 cores

- Combining with other languages:
- Directly calling C code (via plugins) is very easy
- But inefficient when called many times (eg. nl/mle)

- save/exec/load is easy and works with any Stata

### SAS

- Some procedures are parallelized across multiple cores
- sometimes the gain is small

- specify "options threads cpucount=actual;"
- Procedures: SORT, SUMMARY, MEANS, REPORT, TABULATE, SQL, GLM, LOESS, REG, ROBUSTREG
- Procedures in other languages:
- possible but quite tricky
- save/exec/load typically preferrable

### Others

- Gauss:
- Both automatic and manual parallelization
- Several parallel libraries (Maximum Likelihood, Optimization, …)
- Limited to the cores in a single machine
- http://www.aptech.com/g11_threadingtutorial_1.html

- R:
- No native parallel facilities
- Many user-provided packages for multicore, GPU, clusters…
- Not very well documented
- Calling C or Fortran routines is very easy!
- http://cran.r-project.org/web/views/HighPerformanceComputing.html

- Python:
- mostly through additional packages
- http://www.scipy.org/ParallelProgramming

### Matlab

- Requires Matlab Parallel Toolbox
- Available in SSCC, Quest, Kellogg desktop installations
- Limited to max. 8 cores
- can do more, but with a (very) expensive server

- Invoke "matlabpool open" to enable parallel processing
- Some operations (matrix multiplication, optimization routines) are partially parallelized
- Implements several approaches to parallelization (parfor, coarrays)
- But, somewhat inefficient and sometimes tricky

- Also has GPU functionality

## Low-level languages (C, Fortran)

- Trivial: a compiler option to ask it to try parallelizing loops
- Usually fails: resulting parallel code can be slower than sequential

- Easy: OpenMP
- Tell the compiler which loops to parallelize
- Requires very minimal code modification
- Works well, but limited to single machine

- Less easy: Parallelized libraries
- Replace matrix multiplications etc. in your code with library calls
- MKL (Intel PC), ACML (AMD PC, AMD GPU), ScaLAPACK (clusters), cuBLAS (Nvidia GPU), etc.
- May also help in sequential code
- Sometimes tricky to install or link
- gfortran has an option to auto-generate BLAS calls

- Intel MKL link line advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

- Trickier: MPI
- Scales to large clusters
- Relies on process communication

## Online library directories and collections

- NIST: http://math.nist.gov/
- Digital Library of Mathematical Functions: http://dlmf.nist.gov/
- GAMS (Guide to Available Mathematical Software): http://gams.nist.gov/

- StatLib at Carnegie-Mellon: http://lib.stat.cmu.edu/
- StatCodes: http://www2.astro.psu.edu/statcodes/
- STARPAC (Standard Time Series and Regression Package) : http://www.cisl.ucar.edu/softlib/STARPAC.html
- GNU GSL: http://www.gnu.org/software/gsl/
- TAO (Toolkit for Advanced Optimization): http://www.mcs.anl.gov/research/projects/tao/index.html
- PETSc: http://www.mcs.anl.gov/petsc/petsc-2/index.html
- Gilli et.al., 2002 : option pricing

- http://www.nhse.org/hpc-netlib/
- http://www.indiana.edu/~statmath/bysubject/numerics.html
- PDE: http://www.mathcom.com/corpdir/techinfo.mdir/q260.html

## GPU computing

- Designed for graphics processing in video games
- Has
**a lot**of cores and very fast memory- Each core is somewhat primitive
- Works best for applying the same function to a large array
- Speedups of 200x vs. CPU have been claimed (but treat such claims with care!)

- Major manufacturers: Nvidia (GeForce/Tesla) and ATI (Radeon)
- Nvidia is far more popular for general-purpose computing

- Programming GPU's efficiently is
*very*tricky!- CUDA C or PGI Fortran
- A lot of different GPU models
- Code optimized for one can fail or work slowly on another

- High-level (easier) options:
- Matlab Parallel Toolbox
- Matlab/Jacket
- C/Fortran: PGI directives
- Third-party packages and libraries (R, cuBLAS, etc.)

## Parallelization Methods: Summary

Method | Single machine | Cluster | GPU |
---|---|---|---|

Automatic parallelization | |||

- SAS, Matlab, Stata/MP | limited | - | - |

- C, Fortran | works poorly | - | - |

Parallel libraries | x | some | some |

Job-level parallelization | x | x | - |

Manual parallelization | |||

Matlab | max. 8 cores | Matlab Server | x |

Matlab/PBS script | - | x | - |

Gauss | x | - | - |

R | packages | packages | packages |

Mathematica | x | - | plugin |

OpenMP (C, Fortran) | x | - | - |

PGI directives (C/Fortran) | - | - | x |

MPI (C, Fortran) | x | x | - |