In this paper, we propose an efficient architecture to implement the 2D FFT for large-sized input data based on a novel 2D decomposition algorithm. This architecture achieves very high throughput by exploiting the inherent parallelism and the row-wise burst access pattern of the external memory. A custom-designed high throughput memory interface block enables maximum utilization of the memory bandwidth. In addition, an automatic system generator is provided for mapping this architecture onto a reconfigurable platform of Xilinx Virtex4 or Virtex5 devices. For a 2K*2K input size, the proposed architecture is 1.96x times faster than RC decomposition based implementation under the same memory constraints, and also outperforms existing 2D FFT implementations.