Unfortunately, each execution of mpirun has no knowledge of where the procs
have been placed and bound by another execution of mpirun. So what is
happening is that the procs of the two jobs are being bound to the same
cores, thus causing contention.
If you truly want to run two jobs at the same time on the same nodes, then
you should add "--bind-to none" on the cmd line. Each job will see a
performance impact relative to running bound on their own, but the jobs
will run much better if they are sharing nodes.
On Thu, Apr 17, 2014 at 10:37 AM, Alfonso Sanchez <
> Hi all,
> I've compiled OMPI 1.8 on a x64 linux cluster using the PGI compilers
> v14.1 (I've tried it with PGI v11.10 and get the same result). I'm able to
> compile with the resulting mpicc/mpifort/etc. When running the codes,
> everything seems to be working fine when there's only one job running on a
> given computing node. However, whenever a second job gets assigned the same
> computing node, the CPU load of every process gets divided by 2. I'm using
> pbs torque. As an example:
> -Submit jobA using torque to node1 using mpirun -n 4
> -All 4 rocesses of jobA show 100% CPU load.
> -Submit jobB using torque to node1 using mpirun -n 4
> -All 8 processes ( 4 from jobA & 4 from jobB ) show 50% CPU load.
> Moreover, whilst jobA/jobB would run in 30 mins by itself; when both jobs
> are on the same node they've gone 14 hrs without completing.
> I'm attaching config.log & the output of ompi_info --all (bzipped)
> Some more info:
> $> ompi_info | grep tm
> MCA ess: tm (MCA v2.0, API v3.0, Component v1.8)
> MCA plm: tm (MCA v2.0, API v2.0, Component v1.8)
> MCA ras: tm (MCA v2.0, API v2.0, Component v1.8)
> Sorry if this is a common problem but I've tried searching for posts
> discussing similar problems but haven't been able to find any.
> Thanks for your help,
> users mailing list