Accelerating Seismic Wave Propagation Simulations | GTC 2013

Transkript

Accelerating Seismic Wave Propagation Simulations | GTC 2013
Accelera'ng Seismic Wave Propaga'on Simula'ons on Petascale Heterogeneous Architectures Yifeng Cui University of California, San Diego GTC 2013 1
2
3
4
Southern California Earthquake Center (SCEC) Computa'onal Pathways Standard seismic hazard analysis Ground mo'on simula'on Dynamic rupture modeling Ground-­‐mo'on inverse problem Other Data Geology Geodesy
Structural Representa'on DFR
3
AWP
KFR
“Extended” Earthquake Earthquake Rupture Rupture Forecast
Forecast
(Image courtesy of SCEC) KFR = Kinema'c Fault Rupture DFR = Dynamic Fault Rupture 4
Invert AWP
Ground Mo'ons NSR 2
ARenua'on Rela'onship
1
Intensity Measures AWP = Anelas'c Wave Propaga'on NSR = Nonlinear Site Response
Improvement of models Physics-­‐
based simula8ons Empirical models M8 – Largest-ever Earthquake Simulation
Why GPUs? SCEC CyberShake 3.0 Statewide Hazard Model •  California statewide wave propagationbased PSHA map up to 1 Hz planned
•  4240 sites, f < 1.0 Hz to be computed
using AWP-Graves and AWP-ODC codes
- 
- 
- 
- 
- 
750 million CPU hrs
3.6 billion jobs
48 PB of total output data
2.6 PB of stored data
77 TB of archived data
CyberShake seismogram LA region (Source: SCEC) CyberShake1.0 hazard map PoE = 2% in 50 yrs GPU-­‐based CyberShake SGT Simula'ons for Reduced Time-­‐to-­‐solu'on and Power Efficiency CyberShake 3.0 CPU GPU CPU+GPU** Cores 14400 900 14400+900 Hours per SGT run 2.56 0.51 0.44 Total Power $* $805,000 $72,000 $202,000 Memory (TB) 28.8 TB 5.4 TB 29.9 TB Flop/s 15.4Tflops 100Tflops ~115Tflops * Power consumpKon based on Cray XK6, assuming 5000 site and two strain Green tensor runs per site ** EsKmated AWP-­‐ODC • 
Started as personal research code (Olsen 1994) • 
3D velocity-­‐stress wave equaKons ∂ tν =
1
∇ ⋅σ
ρ
"2 3
v2
∂ tσ = λ (∇ ⋅ ν )Ι + µ (∇ν + ∇ν Τ )
• 
"1 2
!3
using coarse-­‐grained representaKon (Day 1998) €
• 
• 
€
Dynamic rupture by the staggered-­‐grid split-­‐node (SGSN) method (Dalguer and Day 2007) !1
!5
Relaxation Time
Distribution
!8
!6
Unit Cells
!6
!7
!5
N
λiωτ i
∑
2
2
+1
i =1 ω τ i
!2
!4
!3
!1
dς i (t)
δM
τi
+ ς i (t) = λ i
ε (t)
dt
Mu
€
"1 1 # 1 1
"2 2 # 2 2
"3 3 # 3 3
!8
v1
#1 2
Memory variable formulaKon of inelas'c relaxa'on N
'
*
σ ( t) = M u )ε (t) − ∑ς i ( t),
(
i =1
+
δM
Q −1(ω ) ≈
Mu
€
{
"1 3 #
13
solved by explicit staggered-­‐grid 4th-­‐order FD €
Bielak et al. (2009) v3
#2 3
!4
x2
x3
x1
!2
Inelas8c relaxa8on variables for memory-­‐
variable ODEs in AWP-­‐ODC Absorbing boundary condiKons by perfectly matched layers (PML) (Marcinkovich and Olsen 2003) and Cerjan et al. approach Flops to Bytes Ra'o of AWP-­‐ODC Kernels
Three most 'me Reads Writes consuming Kernels Flops Flops/ Bytes Velocity Comp. 51 3 86 0.398 Stress-­‐1 Comp. 85 12 221 0.569 Total 136 15 307 0.508 Decomposi'on on CPU and GPU •  Two-­‐layer 3D domain decomposi'on on CPU-­‐GPU based heterogeneous supercomputers: first step X&Y decomposi'on for CPUs and second step Y&Z decomposi'on for GPU SMs. Single-­‐GPU Op'miza'ons ü  Step 2: GPU 2D Decomposi'on in y/z vs x/y ü  Step-­‐3: Global memory Op'miza'on Global memory coalesced, texture memory for six 3D constant variables, constant memory for scalar constants ü  Step-­‐4: Register Op'miza'on Pipelined register copy to reduce memory access ü  Step-­‐5: L1/L2 cache vs shared memory Rely on L1/L2 cache rather on-­‐chip shared memory (Zhou, J et al., ICCS 2012) Communica'on Reduc'on • 
• 
Extend ghost cell region with two extra layers and compute rather communicate for the ghost cell region updates before stress computaKon. The 2D XY plane represents the 3D sub-­‐domain, as no communicaKon in Z direcKon is required due to 2D decomposiKon for GPUs. GPU-­‐GPU Communica'on Velocity Stress Communica'on Frequency Message size Frequency Message size Before Comm ReducKon 4 6*(nx+ny)*NZ 4 12*(nx+ny)*NZ Ager Comm ReducKon 4 12*(nx+ny+4)*NZ No communica'on In-­‐order Communica'on fills 4-­‐layer ghost cells by comm reduc'on and keeps the MPI comm frequency at 4x per itera'on Compu'ng and Communica'on Overlapping AWP-ODC-GPU Main Loop:
Do T= timestep 0 to timestep N:
Pre-Post MPI_IRecv waiting for V1, V2, V3 and V4 of (vx, vy, vz).
Compute V1 for (vx, vy, vz) in GPU
Compute V2/V3 for (vx, vy, vz) in GPU and initiate the transfer of V1 from GPU to CPU
Compute V4/V5 for (vx, vy, vz) in GPU
Compute S5 for (xx, yy, zz, xy, yz, xz) in GPU
Wait for V1/V2 data transfer done and then initiate MPI_ISend for M1/M2
Wait for M1, received MPI message and initiate the transfer of G1 from CPU to GPU
Wait for G1 data transfer done and then initiate the transfer of V3 from GPU to CPU
Wait for V3 data transfer done and then initiate MPI_ISend for M3
Wait for M2, received MPI message and initiate the transfer of G2 from CPU to GPU
Wait for G2 data transfer done and then initiate the transfer of V4 from GPU to CPU
Wait for V4 data transfer done and then initiate MPI_ISend for M4
Wait for M3/M4, received MPI message and initiate the transfer of G3/G4 from CPU to GPU
Compute the rest of stress computation S1-S4 for (xx, yy, zz, xy, yz, xz) in GPU
End Do
(Zhou, J. et al., ICCS 2013) Parallel Mul'-­‐GPU Performance 1255 306 200 30.6 Ideal Speedup NCCS Titan Speedup NCCS Titan FLOPS Blue Waters FLOPS Keeneland FLOPS 1GPU/node Keeneland FLOPS 3GPUs/node 20 2 2 20 200 Number of GPUs 2000 3.06 0.306 Tflops Speedup 2000 Verifica'on of Surface Velocity §  Comparison of visualizaKons of surface velocity in the X direcKon in units of m/s. Four snapshots are taken at seconds 6, 7, 10 and 25 (ager 6,000, 7,000, 10,000, 25,000 Kmesteps respecKvely). The AWP-­‐ODC-­‐CPU code was extensively verified and validated. Wave Propaga'on Simula'on up to 2.5 Hz for the 2008 Mw5.4 Chino Hills Earthquake Science: Kim Olsen, William Savran, SDSU; Philip Maechlng and Thomas Jordan, USC Keeneland Support: Jeffrey Vemer and team Simula'on: Efecan Poyraz, Jun Zhou and Yifeng Cui, SDSC Visualiza'on: Efecan Poyraz, Map image: Google map provided by Tim Scheitlin of NCAR Next Movie Simulated the Same Dataset Wave Propaga'on for the Mw5.4 Chino Hills, CA, Earthquake, including a Sta's'cal Model of Small-­‐
Scale Heterogenei'es Visualiza'on: Tim Scheitlin and Perry Domingo of NCAR Map image: supplied by Google map Simula'on: Efecan Poyraz and Yifeng Cui of SDSC Science: Kim Olsen, William Savran, Phillip Maechling and Thomas Jordan of SCEC Supercomputer: Yellowstone, NCAR Wave Propaga'on Simula'on for the Mw5.4 Chino Hills Earthquake, including a Sta's'cal Model of Small-­‐Scale Heterogenei'es Image: Tim Scheitlin and Perry Domingo of NCAR, Map image: Google map, Simula'on: Efecan Poyraz and Yifeng Cui of SDSC Science: Kim Olsen, Phillip Maechling and Thomas Jordan of SCEC Conclusions & Future Work •  GPU implementaKon of AWP-­‐ODC scales linearly up to tested 4K GPUs. The code is accelerated 5.2x on Tesla K20X over 16-­‐core AMD Opteron CPUs, with 630 Tflop/s is recorded using 4096 GPUs •  Short-­‐term plan is to add strain Green tensors calculaKons to the kernel, and tune the GPU code on Kepler based systems. This is to prepare producKon capabiliKes that can be used to calculate the statewide determinisKc and probabilisKc seismic hazard in California •  UlKmate goal is to support development of CyberShake model that can assimilate informaKon during earthquake cascades for operaKonal forecasKng and early warning Acknowledgements • 
• 
The GPU implementa'on team: –  Jun Zhou, Efecan Poyraz, Dongju Choi and Yifeng Cui This work is supported by: – 
– 
– 
• 
• 
XSEDE ECSS Advanced Support for ApplicaKons OCI-­‐1053575 NCSA Direct PRAC Support Funding SCEC 2012 Core Program, GeoinformaKcs (EAR-­‐1226343) and SEISM (OCI-­‐1148493). Implementa'on and tests were carried out on six M2050 and M2090 GPUs donated by NVIDIA. Benchmarks are performed on XSEDE Tesla M2090 Keeneland, NCSA Blue Waters Tesla K20X, NCCS TitanDev M2090 and Titan K20X systems Thanks to: – 
– 
– 
– 
– 
– 
– 
Kim Olsen, San Diego State University Philip Maechling and Thomas Jordan, USC/SCEC Carl Ponder, Cyril Zeller, Stanley Posey, NVIDIA Sreeram Potluri, Karen Tomko and DK Panda, OSU Amit Chourasia, SDSC Jeffrey Vemer, Mitch D. Horton, Graham Lopez, Richard Glassbrok, Dick Glassbrook, NICS/Georgia Tech Keeneland Team Jay Alameda, NCSA 

Benzer belgeler