partnersnanax.blogg.se - Linpack benchmark report table 1.

LINPACK BENCHMARK REPORT TABLE 1. CODE

Is linpack using only 1/2 the compute power? Does it only go by the physical core count? How do I run linpack so it will use all 40 threads?Īccording to Table 3 of "Intel Xeon Processor E5 v3 Product Family: Processor Specification Update" (document 330785, revision 009, August 2015), the Xeon E5-2650 v3 has a minimum "Intel AVX Core Frequency" of 2.0 GHz. When I run the mp_linpack, only cpu 1 to 20 go to 100% usage, cpu 21 to 40 stays at 0% usage. On each compute node (dual socket E5-2650 v3 10 cores 20 threads), when I open htop, it shows 40 cpus. Since I have 12 nodes, I only have to use -n 12. Do these results look right? So the Intel mp_linapck treats each node (a 2 socket E5 10 cores) as -n 1. It seems like the results are close to theoretical. Computational tests pass if scaled residuals are less than 16.0 The relative machine precision (eps) is taken to be 1.110223e-16 The following scaled residual check will be computed: The matrix A is randomly generated for each test. I manged to get a step forward! I ran the mp_linpack with the following line:

LINPACK BENCHMARK REPORT TABLE 1. CODE

Part of the slowdown is due to the cache blocking - the code is configured so that each thread expects to use the whole L2 cache - and part of the slowdown is due to increased interference with OS processes that no longer have free "logical processors" to run on. Using both "logical processors" is not a disaster, but it does run a bit slower (maybe 5%-10%) in that configuration. You will get best performance (once things are configured correctly) with one thread per physical core. It is a good idea to use Intel's scripts instead of using mpirun directly - there are a number of environment variables and other controls (such as numactl) that need to be set up correctly for best performance. Part of the output will be "CPUs utilized", which will tell you whether you are actually using all the cores that you think you have requested. This should run slightly over 1 second on your system (it takes 0.96 seconds on a single node on my system).įor a single node you can get some additional diagnostic information by running the command under "perf stat". Using a Xeon E5-2680 v3 (12 core, 2.5 GHz), you can get pretty good parallel scaling within a single node with problem sizes as small as 10,000. The problem size you are running is a big one - it will be easier to debug performance issues if you back the size down to something that runs more quickly.