Research Computing >> Training and Publications >> Parallelization for an RBC Model

# Parallelization for an RBC Model

## The model

• Aldrich, Fernandez-Villaverde, Gallant, Rubio-Ramirez (2010): "Tapping the Supercomputer under Your Desk: Solving Dynamic Equilibrium Models with Graphics Processors".

### Households

Households choose consumption $$c_t$$ and capital $$k_t$$ to maximize $$\text{E}_0 \sum_{t=0}^{\infty} \beta_t \frac{c^{1-\eta}_t}{1-\eta}$$ subject to a budget constraint $$c_t+i_t = w_t + r_t k_t$$ law of motion for capital: $$k_{t+1} = (1-\delta) k_t + i_t$$

### Production

Representative firm with technology $$y_t = z_t k_t^\alpha$$ Technological shock: $$\log z_t = \rho \log z_{t-1} + \varepsilon_t, \>\>\> \varepsilon_t \sim N(0,\sigma^2)$$

### Recursive formulation:

Select next period capital to maximize the value function: $$V(k,z) = \max_{k'} u(c(k',k,z))+\beta \text{E}\{V(k',z') | z\}$$ where $$c(k',k,z) = zk^\alpha + (1-\delta)k - k'$$ Notation: define continuation value $$\psi(k',z)$$ as the present expected value of investing $$k'$$ for the next period, given that we are today in state $$z$$: $$\psi(k', z) \equiv \beta \text{E}\{V(k',z') | z\}$$ the optimization problem then becomes $$V(k,z) = \max_{k'} u(c(k',k,z)) + \psi(k',z)$$ …If we knew the continuation value, we could solve this for $V$!

## Numerical Solution

### Value function iteration: theory

• Guess the value funciton at $$t=0$$ and proceed backwards
• i.e. to time t = -1, -2, -3, …
• if we go far enough, initial guess does not matter any more!
• Define $$V_0(k,z) \equiv 0$$
• For all $t=-1, -2, …$ :
• Compute $$\psi_{t-1}(k', z) \equiv \beta \text{E}_{t-1}\{V_t(k',z') | z\}$$
• This is the present expected value of investing $$k'$$, given that today we are in state $$z$$
• Compute $$V_{t-1}(k,z) = \max_{k'} u\left(c(k',k,z)\right) + \psi(k',z)$$
• Repeat until $$\max_{k,z}|V_t(k,z) - V_{t-1}(k,z)| < \varepsilon$$
• Once we have the value function, we have everything!

### Value function iteration: implementation

• Problem is discretized and solved on a grid of $$k$$ and $$z$$
• Grid search each period to find the maximum
• Simple to implement
• An approximate model, solved exactly
• as opposed to an exact model, solved approximately
• Does not rely on interpolation
• Does not rely on convex solvers
• Robust to local maxima
• Robust to kinks and jumps (e.g. liquidity constraints)
• Many ways to improve the speed by a lot
• But let us try parallelization first…
• This is an illustrative example, not the optimal method!

### Testing System

• Dual Intel Xeon X5670 (2x6 = 12c at 2.8GHz), 24G RAM
• NVidia GFX480 GPU (480 CUDA cores at 700MHz, 1.5G RAM)
• Fedora 13, Intel compiler suite 11.1, GCC 4.4.5

## Languages: Matlab and Fortran

### Matlab

• Easy to use
• Great documentation, built-in graphics
• Extensive toolbox library
• Requires Matlab Parallel Toolbox for parallelization
• Available in SSCC, Quest, Kellogg desktop installations
• Limited to max. 12 cores (as of R2011b)
• can do more, but with a (very) expensive server
• Invoke matlabpool open to enable parallel processing
• Some operations (matrix multiplication, optimization routines) are partially parallelized
• Also has (very limited) GPU functionality

### Fortran basics

• Fortran is one of the oldest programming languages
• Many existing codes, libraries, etc.
• Fortran 66, 77, 90, 95, 2003, 2008, …
• Syntax similar to Matlab
• Especially Fortran 90 and later
• Built-in array types, array math, and vector indexing
• Array indexes can start anywhere: A(-5:2, 13:18)
• First index changes fastest (i.e. storage by column)
• Popular compilers: Intel (icc), gfortran, PGI (pgf90)
• Can be combined with Matlab or other languages
• save a file in Matlab, process in Fortran, read computed results back into Matlab for plotting, etc.

### Fortran vs. Matlab: matrix algebra

Here, A, B,… are matrices; a,b,… are vectors:

MatlabFortran 95
A.*BA*B
A*Bmatmul(A,B)
A'transpose(A)
sum(a*b')dotproduct(a,b)
a(a>0)pack(a,a>0)
A(:, 2:end)A(:, 2:)
A = 2 + 3*BA = 2 + 3*B
A = 2*ones(size(A))A = 2
a = [2 3 4]a = [2, 3, 4]
A = reshape(B,[2 3])A = reshape(B, [2,3])

### Fortran vs. Matlab: control statements

MatlabFortran 95
for i=1:10; …; enddo i=1,10 …. end do
if i==j; …; else … ; endif(i==j) then… ; else… ; end if
x = f(y)x = f(y)
function x = f(y)real function f(y)
real :: y
[x,y] = f(z)call f(x,y,z)
function [x,y] = f(z)subroutine f(x,y,z)
real :: x,y,z
global x y zcommon x,y,z

## Sequential code

### Many options

• Languages, compilers and libraries matter
• Matlab: great documentation, can run in parallel
• OK for "vectorized" programs, slow in loops
• Fortran 90: syntax similar to Matlab
• C: the classic
• R: itself slow, but has many packages, easy to combine with C and

### Floating-point precision

• Most common format is double precision (8 bytes)
• Matlab format is double precision
• For some models, single precision (4 bytes) is enough
• C: "float" vs. "double"
• Fortran: "real" vs. "double precision"
• Force double: -r8 (ifort, PGI), -fdefault-real-8 (gfortran)
• The gain is higher on GPUs
• especially older models
• Fortran also has "quad precision" (16 bytes, slow)
• Make sure you choose precision that is sufficient for your problem!

### Matlab implementation

#### Initialization parameters

%% Initialize the parameters
beta = 0.984;      % Discount rate
eta = 2;           % Risk aversion parameter
alpha = 0.35;      % Technology parameter
delta = 0.01;      % Depreciation rate
rho = 0.95;        % Tech. shock persistence
sigma = 0.005;     % Tech. shock st. dev.


#### Discretizing productivity

• Method: Tauchen ("Finite State Markov-Chain Approximations…", Economic Letters, 1986)
• Gaussian random variable $$z$$ is approximated by a 4-state Markov chain
%% Grid for productivity z
nz = 4;            % Grid size
zmin=-0.0480384;
zmax=0.0480384;
% [1 x 4] grid of values for z
zgrid = exp(zmin:((zmax-zmin)/(nz-1)):zmax)
% [4 x 4] Markov transition matrix of z
tran_z = [0.996757 0.00324265 0 0; ...
0.000385933 0.998441 0.00117336 0 ;...
0 0.00117336 0.998441 0.000385933 ;...
0 0 0.00324265 0.996757 ] ;


#### Discretizing capital

• Grid for capital is the primary determinant of accuracy
• Baseline: $$nk = 4800$$ points
• Somewhat of an overkill in this case
• In slightly more complex problems, grids can easily get much larger!
• with $$N$$ states we have an $$N$$-dimensional grid
• Why 4800? It is nice to have the grid size divisible by the number of cores in the CPU (12) or GPU (480).
%% Grid for capital k
nk = 4800;     % Number of points for capital
kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(end)))*((1/beta)-1+delta)^(1/(alpha-1));
kstep = (kmax - kmin)/(nk-1);
% [1 x 4800] grid of possible values of k
kgrid = kmin:kstep:kmax;


#### Initial values

%% Compute the solution: Sequential
tic                   % <----- Start the timer
v = zeros(nk,nz);     % Value Function
v0 = zeros(nk,nz);    % v at the previous iteration
ev = zeros(nk,nz);    % continuation value
c0 = zeros(nk,nz);    % total wealth
c = zeros(nk,1);      % c(k'), given (k,z)

% Compute initial wealth c0(k,z)
for iz=1:nz;
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end

% Define the utility function
uf = @(c) c.^(1-eta)/(1-eta);


#### Solve the problem

tol = 1e-8;                       % Tolerance for V
cnt = 1;                          % Iteration counter
while(1)
for iz=1:nz;                  % For all values of z
for ik = 1:nk;            % For all values of k
c = c0(ik,iz)-kgrid;  % Compute c(k')
ind = c>0;            % Check if c(k') > 0
% Grid search to find v(k,z):
v(ik,iz) = max(uf(c(ind))+ev(ind,iz)');
end
end
ev = beta*v*tran_z';          % Compute the expectation
% Check convergence:
dif = max(abs(v(:)-v0(:)));
v0 = v;                       % Save the value function
if(~mod(cnt,100)) fprintf('%d : %f\n',cnt,dif); end;
cnt = cnt+1;
if(dif<tol) break; end;
end
toc  % <----- Stop the timer


#### Results

• Time: 3718.57 seconds
• What if we need to compute this for different values of parameters?

### Fortran implementation

#### Declarations

Need to declare variables first:

program rbc
use omp_lib     ! For timing
implicit none

real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &
rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;
integer, parameter :: nz = 4, nk=4800;
real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);
real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);
real, dimension(nz,nk) :: v=0., v0=0., ev=0., c0=0.;
integer :: i, iz, ik, cnt;
logical :: ind(nk);
real(kind=8) :: start, finish   ! For timing
real :: tmpmax, c1


#### Main loop

MatlabFortran
while(1)do while(dif>tol)
for iz=1:nz;do iz=1,nz;
for ik = 1:nk;do ik = 1,nk;
c = c0(ik,iz)-kgrid;c = c0(ik,iz)-kgrid;
ind = c>0;ind = c>0;
v(ik,iz) = max(uf(c(ind))+ev(ind,iz)');v(ik,iz) = maxval(pack(c,ind)**(1-eta)/(1-eta)+pack(ev(:,iz), ind));
endend do
endend do
ev = beta*v*tranz';ev = beta*matmul(v,ttranz)
dif = max(abs(v(:)-v0(:)));dif = maxval(abs(v-v0))
v0 = v;v0 = v
if(~mod(cnt,100)) fprintf('%d : %f\n',cnt,dif); end;if(mod(cnt,100)==0) write(*,*) cnt, ':', dif
cnt = cnt+1;cnt = cnt+1
if(dif<tol) break; end;
endend do

#### Loops are faster in Fortran!

This "vectorized" piece is valid Fortran code:

c = c0(ik,iz)-kgrid;
ind = c>0;
v(ik,iz) = maxval(pack(c,ind)**(1-eta)/(1-eta)+pack(ev(:,iz), ind));


An explicit loop over $$k'$$ can avoid computation for $$k' > c0$$:

tmpmax = -huge(0.)
do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax


This would be (much) slower in Matlab, but faster in Fortran.

### Timings

There are many Fortran and C compilers…

Flagifort/iccpgf95/pgccgfortran/gcc
Default double (Fort. only)-r8-r8-fdefault-real-8
Auto-optimize for speed-fast-fast
Extra optimizations-O3-O3-O3
Optimize for recent Intel-mtune=core2
Optimize for recent AMD-mtune=amdfam10
Use "fast math"-ffast-math
Use OpenMP-openmp-mp-fopenmp
• Compiler string:
• ifort -r8 -fast -O3 rbc.f90
• Time: 707.87 sec (5.25x Matlab)

## Parallelized code

### Matlab implementation

#### Matlab parallel computing and parfor

• Matlab parallel toolbox:
• Allows to run code in parallel across multiple CPUs/cores
• skew4, Quest, SSCC, Kellogg desktop licenses
• Without Matlab server, limited to 8 "local" threads
• Easiest way to parallelize: use parfor in place of for
• The index of the parfor loop cannot be combined with any other!
• parfor is just for when running sequentially
• But overhead is still there…

#### Matlab: parallelizing code

Parallelized code requires some adjustments in Matlab:

parfor ik=1:nk;            % Put the longer loop first
vtmp = zeros(nz,1);    % Temporary
cc = c0(ik,:);         % Temporary
c = zeros(nk,1);       % Fill with zeroes
ind = zeros(nk,1);     % Fill with zeroes
for iz = 1:nz;
c = cc(iz)-kgrid;
ind = c>0;
vtmp(iz) = max(uf(c(ind))+ev(ind,iz)');
end
v(ik,:) = vtmp;        % Assign to v
end

• Time: 478.94 sec (7.8x sequential Matlab, 1.5x seq. Fortran)
• On a multi-core machine, faster than (sequential) compiled code!

### Fortran implementation

#### Fortran and OpenMP

• An industry standard for parallelization in C and Fortran
• Implemented as comments with compiler directives
• Supported by all major compilers
• Easy to program
• Allows sharing variables between threads
• Limited to shared-memory computers (e.g. a single node)

#### Fortran: parallelizing code (OpenMP)

The only difference is the two comment lines!

!$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1) do ik=1,nk; do iz = 1,nz; tmpmax = -huge(0.) do i = 1,nk c1 = c0(ik,iz) - kgrid(i) if(c1<0) exit c1 = c1**(1-eta)/(1-eta)+ev(i,iz) if(tmpmax<c1) tmpmax = c1 end do v(ik,iz) = tmpmax end do end do !$omp end parallel do


We need to tell the compiler that we want to parallelize with OpenMP:

• ifort -r8 -fast -O3 -openmp rbc.f90
• gfortran -fdefault-real-8 -O3 -fopenmp -ffast-math rbc.f90
• pgf95 -r8 -fast -O3 -mp rbc.f90

### Timings

• Sequential Matlab: 3718.6 sec
• Sequential Fortran: 707.9 sec
• Parallel Matlab: 478.94 sec
• Parallel Fortran: 66.22 sec

## GPU computation

### PGI directives

We'll use PGI compiler directives. The PGI compilers are available on SSCC. Note that we only had to add four extra lines!!

!$acc region !$acc do parallel
do ik = 1,nk;
!$acc do parallel do iz=1,nz; tmpmax = -huge(0.) do i = 1,nk c1 = c0(ik,iz) - kgrid(i) if(c1<0) exit c1 = c1**(1-eta)/(1-eta)+ev(i,iz) if(tmpmax<c1) tmpmax = c1 end do v(ik,iz) = tmpmax end do end do !$acc end region


### Timings

CodeTimeSpeedup
Sequential Matlab3718.61
Sequential Fortran707.95.3
Sequential Fortran (single prec.)343.710.8
Parallel Matlab478.97.8
Parallel Fortran66.256.1
Parallel Fortran (single prec.)34.0109.5
GPU Fortran19.0195.7
GPU Fortran (single prec.)7.0531.2

© 2001-2011 Kellogg School of Management, Northwestern University