In simulating low-density parity-check (LDPC) codes on a general purpose graphics processing unit (GPGPU), two major things that significantly affect performance are efficient use of limited cache memory and appropriate assignment of threads. Because GPGPUs are designed for compute-intensive application, they are not optimized for data caching or control management. On the other hand, LDPC codes have various size of parity check matrix H and they are accessed at every node updating process, so efficient memory access for that matrix is evidently needed. In that point of view, cyclic or quasi-cyclic codes are very appropriate for GPGPU-based simulator thanks to their cyclic property. In our experiments, the compute unified device architecture (CUDA) of NVIDIA is used. With the (1057, 813) and (4161, 3431) PG–LDPC codes, the CUDA-based LDPC decoding could achieve the throughput of 3.8 Mbps, which is 17~42 times speedup compared to the high performance personal computer (PC) based one.