Research Computing >> Training and Publications >> Parallelization for an RBC Model

Parallelization for an RBC Model

The model

  • Aldrich, Fernandez-Villaverde, Gallant, Rubio-Ramirez (2010): "Tapping the Supercomputer under Your Desk: Solving Dynamic Equilibrium Models with Graphics Processors".

Households

Households choose consumption \(c_t\) and capital \(k_t\) to maximize $$\text{E}_0 \sum_{t=0}^{\infty} \beta_t \frac{c^{1-\eta}_t}{1-\eta}$$ subject to a budget constraint $$c_t+i_t = w_t + r_t k_t$$ law of motion for capital: $$k_{t+1} = (1-\delta) k_t + i_t$$

Production

Representative firm with technology $$y_t = z_t k_t^\alpha$$ Technological shock: $$\log z_t = \rho \log z_{t-1} + \varepsilon_t, \>\>\> \varepsilon_t \sim N(0,\sigma^2)$$

Recursive formulation:

Select next period capital to maximize the value function: $$V(k,z) = \max_{k'} u(c(k',k,z))+\beta \text{E}\{V(k',z') | z\}$$ where $$c(k',k,z) = zk^\alpha + (1-\delta)k - k'$$ Notation: define continuation value \(\psi(k',z)\) as the present expected value of investing \(k'\) for the next period, given that we are today in state \(z\): $$\psi(k', z) \equiv \beta \text{E}\{V(k',z') | z\}$$ the optimization problem then becomes $$V(k,z) = \max_{k'} u(c(k',k,z)) + \psi(k',z)$$ …If we knew the continuation value, we could solve this for $V$!

Numerical Solution

Value function iteration: theory

  • Guess the value funciton at \(t=0\) and proceed backwards
    • i.e. to time t = -1, -2, -3, …
    • if we go far enough, initial guess does not matter any more!
  • Define \(V_0(k,z) \equiv 0\)
  • For all $t=-1, -2, …$ :
    • Compute \(\psi_{t-1}(k', z) \equiv \beta \text{E}_{t-1}\{V_t(k',z') | z\}\)
      • This is the present expected value of investing \(k'\), given that today we are in state \(z\)
    • Compute \(V_{t-1}(k,z) = \max_{k'} u\left(c(k',k,z)\right) + \psi(k',z)\)
    • Repeat until \(\max_{k,z}|V_t(k,z) - V_{t-1}(k,z)| < \varepsilon\)
  • Once we have the value function, we have everything!

Value function iteration: implementation

  • Problem is discretized and solved on a grid of \(k\) and \(z\)
  • Grid search each period to find the maximum
  • Advantages:
    • Simple to implement
    • An approximate model, solved exactly
      • as opposed to an exact model, solved approximately
    • Does not rely on interpolation
    • Does not rely on convex solvers
      • Robust to local maxima
    • Robust to kinks and jumps (e.g. liquidity constraints)
  • Disadvantages: slow!
    • Many ways to improve the speed by a lot
    • But let us try parallelization first…
  • This is an illustrative example, not the optimal method!

Testing System

  • Dual Intel Xeon X5670 (2x6 = 12c at 2.8GHz), 24G RAM
  • NVidia GFX480 GPU (480 CUDA cores at 700MHz, 1.5G RAM)
  • Fedora 13, Intel compiler suite 11.1, GCC 4.4.5

Languages: Matlab and Fortran

Matlab

  • Easy to use
  • Great documentation, built-in graphics
  • Extensive toolbox library
  • Requires Matlab Parallel Toolbox for parallelization
    • Available in SSCC, Quest, Kellogg desktop installations
    • Limited to max. 12 cores (as of R2011b)
      • can do more, but with a (very) expensive server
    • Invoke matlabpool open to enable parallel processing
  • Some operations (matrix multiplication, optimization routines) are partially parallelized
  • Also has (very limited) GPU functionality

Fortran basics

  • Fortran is one of the oldest programming languages
  • Many existing codes, libraries, etc.
  • Substantial additions and updates over the years (ongoing)
    • Fortran 66, 77, 90, 95, 2003, 2008, …
  • Syntax similar to Matlab
    • Especially Fortran 90 and later
    • Built-in array types, array math, and vector indexing
    • Array indexes can start anywhere: A(-5:2, 13:18)
    • First index changes fastest (i.e. storage by column)
  • Popular compilers: Intel (icc), gfortran, PGI (pgf90)
  • Can be combined with Matlab or other languages
    • save a file in Matlab, process in Fortran, read computed results back into Matlab for plotting, etc.

Fortran vs. Matlab: matrix algebra

Here, A, B,… are matrices; a,b,… are vectors:

MatlabFortran 95
A.*BA*B
A*Bmatmul(A,B)
A'transpose(A)
sum(a*b')dotproduct(a,b)
a(a>0)pack(a,a>0)
A(:, 2:end)A(:, 2:)
A = 2 + 3*BA = 2 + 3*B
A = 2*ones(size(A))A = 2
a = [2 3 4]a = [2, 3, 4]
A = reshape(B,[2 3])A = reshape(B, [2,3])

Fortran vs. Matlab: control statements

MatlabFortran 95
for i=1:10; …; enddo i=1,10 …. end do
if i==j; …; else … ; endif(i==j) then… ; else… ; end if
x = f(y)x = f(y)
function x = f(y)real function f(y)
real :: y
[x,y] = f(z)call f(x,y,z)
function [x,y] = f(z)subroutine f(x,y,z)
real :: x,y,z
global x y zcommon x,y,z

Sequential code

Many options

  • Languages, compilers and libraries matter
  • Matlab: great documentation, can run in parallel
    • OK for "vectorized" programs, slow in loops
  • Fortran 90: syntax similar to Matlab
  • C: the classic
  • R: itself slow, but has many packages, easy to combine with C and

Floating-point precision

  • Most common format is double precision (8 bytes)
    • Matlab format is double precision
  • For some models, single precision (4 bytes) is enough
    • C: "float" vs. "double"
    • Fortran: "real" vs. "double precision"
      • Force double: -r8 (ifort, PGI), -fdefault-real-8 (gfortran)
  • The gain is higher on GPUs
    • especially older models
  • Fortran also has "quad precision" (16 bytes, slow)
  • Make sure you choose precision that is sufficient for your problem!

Matlab implementation

Initialization parameters

%% Initialize the parameters
beta = 0.984;      % Discount rate
eta = 2;           % Risk aversion parameter
alpha = 0.35;      % Technology parameter
delta = 0.01;      % Depreciation rate
rho = 0.95;        % Tech. shock persistence
sigma = 0.005;     % Tech. shock st. dev.

Discretizing productivity

  • Method: Tauchen ("Finite State Markov-Chain Approximations…", Economic Letters, 1986)
  • Gaussian random variable \(z\) is approximated by a 4-state Markov chain
%% Grid for productivity z
nz = 4;            % Grid size
zmin=-0.0480384;  
zmax=0.0480384;    
% [1 x 4] grid of values for z
zgrid = exp(zmin:((zmax-zmin)/(nz-1)):zmax)
% [4 x 4] Markov transition matrix of z
tran_z = [0.996757 0.00324265 0 0; ...
    0.000385933 0.998441 0.00117336 0 ;...
    0 0.00117336 0.998441 0.000385933 ;...
    0 0 0.00324265 0.996757 ] ;

Discretizing capital

  • Grid for capital is the primary determinant of accuracy
  • Baseline: \(nk = 4800\) points
    • Somewhat of an overkill in this case
    • In slightly more complex problems, grids can easily get much larger!
    • with \(N\) states we have an \(N\)-dimensional grid
    • Why 4800? It is nice to have the grid size divisible by the number of cores in the CPU (12) or GPU (480).
%% Grid for capital k
nk = 4800;     % Number of points for capital
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(end)))*((1/beta)-1+delta)^(1/(alpha-1));
kstep = (kmax - kmin)/(nk-1);
% [1 x 4800] grid of possible values of k
kgrid = kmin:kstep:kmax;

Initial values

%% Compute the solution: Sequential 
tic                   % <----- Start the timer 
v = zeros(nk,nz);     % Value Function
v0 = zeros(nk,nz);    % v at the previous iteration
ev = zeros(nk,nz);    % continuation value
c0 = zeros(nk,nz);    % total wealth
c = zeros(nk,1);      % c(k'), given (k,z)

% Compute initial wealth c0(k,z)
for iz=1:nz;
  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end

% Define the utility function
uf = @(c) c.^(1-eta)/(1-eta);

Solve the problem

tol = 1e-8;                       % Tolerance for V
cnt = 1;                          % Iteration counter
while(1)    
    for iz=1:nz;                  % For all values of z 
        for ik = 1:nk;            % For all values of k
            c = c0(ik,iz)-kgrid;  % Compute c(k')
            ind = c>0;            % Check if c(k') > 0
            % Grid search to find v(k,z):
            v(ik,iz) = max(uf(c(ind))+ev(ind,iz)');                      
        end
    end
    ev = beta*v*tran_z';          % Compute the expectation
    % Check convergence:
    dif = max(abs(v(:)-v0(:)));
    v0 = v;                       % Save the value function
    if(~mod(cnt,100)) fprintf('%d : %f\n',cnt,dif); end;
    cnt = cnt+1;
    if(dif<tol) break; end;
end
toc  % <----- Stop the timer

Results

  • Time: 3718.57 seconds
  • What if we need to compute this for different values of parameters?

Fortran implementation

Declarations

Need to declare variables first:

program rbc
  use omp_lib     ! For timing
  implicit none

  real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &
       rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;
  integer, parameter :: nz = 4, nk=4800;
  real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);
  real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);
  real, dimension(nz,nk) :: v=0., v0=0., ev=0., c0=0.;
  integer :: i, iz, ik, cnt;
  logical :: ind(nk);
  real(kind=8) :: start, finish   ! For timing
  real :: tmpmax, c1  

Main loop

MatlabFortran
while(1)do while(dif>tol)
for iz=1:nz;do iz=1,nz;
for ik = 1:nk;do ik = 1,nk;
c = c0(ik,iz)-kgrid;c = c0(ik,iz)-kgrid;
ind = c>0;ind = c>0;
v(ik,iz) = max(uf(c(ind))+ev(ind,iz)');v(ik,iz) = maxval(pack(c,ind)**(1-eta)/(1-eta)+pack(ev(:,iz), ind));
endend do
endend do
ev = beta*v*tranz';ev = beta*matmul(v,ttranz)
dif = max(abs(v(:)-v0(:)));dif = maxval(abs(v-v0))
v0 = v;v0 = v
if(~mod(cnt,100)) fprintf('%d : %f\n',cnt,dif); end;if(mod(cnt,100)==0) write(*,*) cnt, ':', dif
cnt = cnt+1;cnt = cnt+1
if(dif<tol) break; end;
endend do

Loops are faster in Fortran!

This "vectorized" piece is valid Fortran code:

c = c0(ik,iz)-kgrid;                  
ind = c>0;
v(ik,iz) = maxval(pack(c,ind)**(1-eta)/(1-eta)+pack(ev(:,iz), ind));

An explicit loop over \(k'\) can avoid computation for \(k' > c0\):

tmpmax = -huge(0.)
do i = 1,nk
   c1 = c0(ik,iz) - kgrid(i)
   if(c1<0) exit
   c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
   if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax

This would be (much) slower in Matlab, but faster in Fortran.

Timings

There are many Fortran and C compilers…

Flagifort/iccpgf95/pgccgfortran/gcc
Default double (Fort. only)-r8-r8-fdefault-real-8
Auto-optimize for speed-fast-fast
Extra optimizations-O3-O3-O3
Optimize for recent Intel-mtune=core2
Optimize for recent AMD-mtune=amdfam10
Use "fast math"-ffast-math
Link math library (C only)-ml
Use OpenMP-openmp-mp-fopenmp
  • Compiler string:
    • ifort -r8 -fast -O3 rbc.f90
  • Time: 707.87 sec (5.25x Matlab)

Parallelized code

Matlab implementation

Matlab parallel computing and parfor

  • Matlab parallel toolbox:
    • Allows to run code in parallel across multiple CPUs/cores
    • skew4, Quest, SSCC, Kellogg desktop licenses
    • Without Matlab server, limited to 8 "local" threads
  • Easiest way to parallelize: use parfor in place of for
    • The index of the parfor loop cannot be combined with any other!
    • Creates some communication overhead
    • requires additional programming
  • parfor is just for when running sequentially
    • But overhead is still there…

Matlab: parallelizing code

Parallelized code requires some adjustments in Matlab:

parfor ik=1:nk;            % Put the longer loop first
    vtmp = zeros(nz,1);    % Temporary
    cc = c0(ik,:);         % Temporary
    c = zeros(nk,1);       % Fill with zeroes
    ind = zeros(nk,1);     % Fill with zeroes
    for iz = 1:nz;
        c = cc(iz)-kgrid;
        ind = c>0;
        vtmp(iz) = max(uf(c(ind))+ev(ind,iz)');                      
    end
    v(ik,:) = vtmp;        % Assign to v
end
  • Time: 478.94 sec (7.8x sequential Matlab, 1.5x seq. Fortran)
  • On a multi-core machine, faster than (sequential) compiled code!

Fortran implementation

Fortran and OpenMP

  • An industry standard for parallelization in C and Fortran
  • Implemented as comments with compiler directives
  • Supported by all major compilers
  • Easy to program
  • Allows sharing variables between threads
  • Limited to shared-memory computers (e.g. a single node)

Fortran: parallelizing code (OpenMP)

The only difference is the two comment lines!

!$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)
do ik=1,nk;        
   do iz = 1,nz;
      tmpmax = -huge(0.)
      do i = 1,nk
         c1 = c0(ik,iz) - kgrid(i)
         if(c1<0) exit
         c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
         if(tmpmax<c1) tmpmax = c1
      end do
      v(ik,iz) = tmpmax
   end do
end do
!$omp end parallel do

We need to tell the compiler that we want to parallelize with OpenMP:

  • ifort -r8 -fast -O3 -openmp rbc.f90
  • gfortran -fdefault-real-8 -O3 -fopenmp -ffast-math rbc.f90
  • pgf95 -r8 -fast -O3 -mp rbc.f90

Timings

  • Sequential Matlab: 3718.6 sec
  • Sequential Fortran: 707.9 sec
  • Parallel Matlab: 478.94 sec
  • Parallel Fortran: 66.22 sec

GPU computation

PGI directives

We'll use PGI compiler directives. The PGI compilers are available on SSCC. Note that we only had to add four extra lines!!

!$acc region
!$acc do parallel
do ik = 1,nk;
   !$acc do parallel
   do iz=1,nz;        
      tmpmax = -huge(0.)
      do i = 1,nk
         c1 = c0(ik,iz) - kgrid(i)
         if(c1<0) exit
         c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
         if(tmpmax<c1) tmpmax = c1
      end do
      v(ik,iz) = tmpmax
   end do
end do
!$acc end region

Timings

CodeTimeSpeedup
Sequential Matlab3718.61
Sequential Fortran707.95.3
Sequential Fortran (single prec.)343.710.8
Parallel Matlab478.97.8
Parallel Fortran66.256.1
Parallel Fortran (single prec.)34.0109.5
GPU Fortran19.0195.7
GPU Fortran (single prec.)7.0531.2

© 2001-2011 Kellogg School of Management, Northwestern University