We explore the comparative performance of the Cray XMT and XMT-2 massively multithreaded supercomputers. We use benchmarks to evaluate memory accesses for various types of loops. We also compare the performance of these machines on matrix multiply and on three previously implemented dynamic programming algorithms. It is shown that the relative performance of these machines is dependent on the size (number of processors) of the configuration, as well as the size of the problem being evaluated. In particular, small configurations of the original XMT can sometimes show slightly better performance than larger configurations of the XMT-2, for the same problem size. We note that, under heavy memory load, performance of loops can saturate well before the maximum number of processors available. This suggests that it may not always be useful to use the maximum number of processors for a specific run. We also show that manual restructuring of nested loops, including decreasing the parallelism, can result in major improvements in performance. The results in this paper indicate that careful exploration of the space of problem sizes, number of processors used, and choices of loop parallelization can yield substantial improvements in performance. These improvements can be very significant for production codes that run for extended periods of time.
Tech report OSUBMI-TR-2012-n01 (pdf, 1.1MB)