Research Computing >> Training and Publications >> Using Parallel Architectures

Using Parallel Architectures for Business Research

Parallel vs. Serial Code

  • Performance
  • Why Parallelization?
    • Technological limitations on how fast a single process can run
    • Progress in CPU speed has slowed down a lot
      • Pentium 4 (2000): 3.8 GHz
      • Core2 i7 (2011): 3.5 GHz (but much faster!)
      • Many physical limitations (power, heating, …)
    • Progress in memory speed has slowed down more
    • Progress in disk speed has slowed down more yet
    • Solution: parallelization
  • Parallel vs. Serial code
  • What to Expect (Amdahl's Law)
    • \(\text{SPEEDUP} = \frac{1}{\frac{\text{%PARALLEL}}{\text{NUM_CPU}}+\text{%SEQUENTIAL}}\)
    • 8 CPUs, 90\% parallel \(\Rightarrow\) speedup = 4.7
    • 1000 CPUs, 90\% parallel \(\Rightarrow\) speedup = 9.9

Facilities at Northwestern

  • Quest: 7056 Intel cores, ~40GB RAM
    • C, Fortran, R, Matlab
  • SSCC: 132 AMD cores, up to 64GB RAM
    • all of the above + lots of statistical software!
  • Multi-core PC's
  • GPU's

What is inside a typical PC (node)?

  • CPUs
    • Cores
    • Cache (L1, L2, L3, …)
    • Control logic
  • RAM
  • I/O: disk, network
  • GPU (graphics card)

Types of Parallelism

  • By physical location
    • Parallelism within a CPU core: vectorization
    • Multiple cores in a CPU: shared memory
    • Multiple CPUs in a node (PC): shared memory
    • Multiple nodes in a cluster: distributed memory
  • Data parallelism (aka domain decomposition, SPMD)
    • Run the same analysis for different stocks/days
  • Task parallelism
    • Monte-Carlo: run the same simulation with different random sequences
    • Solve the same model with different parameters
  • Partially parallelizeable:
    • Solve a problem iteratively on a grid

Software for Computational Economics

  • Statistical languages (SAS, Stata, R)
    • Regressions, statistical analysis, statistical graphics…
  • Matrix languages (Matlab, Ox, Gauss)
    • Simulations, signal processing…
  • Symbolic software (Maple, Mathematica)
    • Computing close-form solutions
  • General-purpose interpreters (Python/numpy/scipy)
  • General-purpose compilers (Fortran, C, C++)
    • usually the fastest and most flexible
    • Fortran 90 is very similar in syntax to Matlab!
      • No "comprehensive" toolbox system as in Matlab
      • Most libraries are not as well documented
    • A few semi-automated code conversion tools exist

Different ways to implement parallelization

  • Vectorization
    • Usually is done for you already. Writing vectorized code helps.
  • Automatic parallelization (Stata/MP, SAS, Matlab)
    • Specify the parallel resources available, and the computer does everything else for you
    • Ideal case! But only if it works…
  • Parallelized library functions (Gauss, Matlab, cuBLAS, ScaLAPACK)
    • Likely to be highly optimized
    • Might not be available, or not the most efficient for your problem
  • Job-level parallelization
    • Run multiple copies of the code on different cores or nodes
    • Easy for data-parallel tasks (but I/O matters!)
    • We have a set of functions to simplify this in Matlab
  • Guided parallelization (OpenMP, MPI, Matlab Parallel Toolbox, PGI directives)
    • Explicitly tell software how to parallelize the code
    • Can be as easy as adding a comment or replacing a keyword
    • Gets tricky in complicated cases
  • Low-level parallelization: do everything manually

Package-specific details

  • What is available varies a lot
  • Lowest-level languages (C, Fortran) have the most options
  • Higher-level languages sometimes offer easy facilities
  • Combining languages is often optimal
    • If the job can be split into parts, also very easy
    • All languages have easy functionality to:
      • save a dataset in a text format
      • execute a program in any other language
      • load results from a file
    • Sometimes you can directly read a file from another program
    • If saving/loading is not an option, things get trickier…

High-level languages (SAS, Stata, Matlab, etc.)


  • Most functions are parallelized
    • You have to pay (a lot) more for parallel licenses!
    • Max. 8 cores
  • Combining with other languages:
    • Directly calling C code (via plugins) is very easy
    • But inefficient when called many times (eg. nl/mle)
  • save/exec/load is easy and works with any Stata


  • Some procedures are parallelized across multiple cores
    • sometimes the gain is small
  • specify "options threads cpucount=actual;"
  • Procedures in other languages:
    • possible but quite tricky
    • save/exec/load typically preferrable



  • Requires Matlab Parallel Toolbox
    • Available in SSCC, Quest, Kellogg desktop installations
    • Limited to max. 8 cores
      • can do more, but with a (very) expensive server
  • Invoke "matlabpool open" to enable parallel processing
  • Some operations (matrix multiplication, optimization routines) are partially parallelized
  • Implements several approaches to parallelization (parfor, coarrays)
    • But, somewhat inefficient and sometimes tricky
  • Also has GPU functionality

Low-level languages (C, Fortran)

  • Trivial: a compiler option to ask it to try parallelizing loops
    • Usually fails: resulting parallel code can be slower than sequential
  • Easy: OpenMP
    • Tell the compiler which loops to parallelize
    • Requires very minimal code modification
    • Works well, but limited to single machine
  • Less easy: Parallelized libraries
    • Replace matrix multiplications etc. in your code with library calls
    • MKL (Intel PC), ACML (AMD PC, AMD GPU), ScaLAPACK (clusters), cuBLAS (Nvidia GPU), etc.
    • May also help in sequential code
    • Sometimes tricky to install or link
      • gfortran has an option to auto-generate BLAS calls
    • Intel MKL link line advisor:
  • Trickier: MPI
    • Scales to large clusters
    • Relies on process communication

Online library directories and collections

GPU computing

  • Designed for graphics processing in video games
  • Has a lot of cores and very fast memory
    • Each core is somewhat primitive
    • Works best for applying the same function to a large array
    • Speedups of 200x vs. CPU have been claimed (but treat such claims with care!)
  • Major manufacturers: Nvidia (GeForce/Tesla) and ATI (Radeon)
    • Nvidia is far more popular for general-purpose computing
  • Programming GPU's efficiently is very tricky!
    • CUDA C or PGI Fortran
    • A lot of different GPU models
    • Code optimized for one can fail or work slowly on another
  • High-level (easier) options:
    • Matlab Parallel Toolbox
    • Matlab/Jacket
    • C/Fortran: PGI directives
    • Third-party packages and libraries (R, cuBLAS, etc.)

Parallelization Methods: Summary

MethodSingle machineClusterGPU
Automatic parallelization
- SAS, Matlab, Stata/MPlimited--
- C, Fortranworks poorly--
Parallel librariesxsomesome
Job-level parallelizationxx-
Manual parallelization
Matlabmax. 8 coresMatlab Serverx
Matlab/PBS script-x-
OpenMP (C, Fortran)x--
PGI directives (C/Fortran)--x
MPI (C, Fortran)xx-

© 2001-2011 Kellogg School of Management, Northwestern University