Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] -npersocket in 1.6
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-11-09 23:18:26


I believe there is an issue over the default setting for the number of sockets on a node. We changed to discovering it in the 1.7 and beyond series, but the default value in the 1.6 series got set to zero (it defaults to 1 I believe for 1.4).

Try adding "-mca orte_num_sockets N -mca orte_num_cores M", where N=#sockets on your nodes and M=#cores on each socket, to your cmd line.

On Nov 7, 2012, at 1:32 PM, David Singleton <David.Singleton_at_[hidden]> wrote:

>
> There appears to have been a change in the behaviour of -npersocket from
> 1.4.3 to 1.6.x (tested with 1.6.2). Below is what I see on a pair of dual
> quad-core socket Nehalem nodes running under PBS. Is this expected?
>
> Thanks
> David
>
>
> [dbs900_at_v482 ~/MPI]$ mpirun -V
> mpirun (Open MPI) 1.4.3
> ...
> [dbs900_at_v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa143
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],0] to socket 0 cpus 0001
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],1] to socket 0 cpus 0002
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],2] to socket 0 cpus 0004
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],3] to socket 1 cpus 0010
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],4] to socket 1 cpus 0020
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],5] to socket 1 cpus 0040
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],6] to socket 0 cpus 0001
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],7] to socket 0 cpus 0002
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],8] to socket 0 cpus 0004
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],9] to socket 1 cpus 0010
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],10] to socket 1 cpus 0020
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],11] to socket 1 cpus 0040
> ...
>
> [dbs900_at_v482 ~/MPI]$ mpirun -V
> mpirun (Open MPI) 1.6.2
> ...
> [dbs900_at_v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa162
> --------------------------------------------------------------------------
> Your job has requested a conflicting number of processes for the
> application:
>
> App: ./numa162
> number of procs: 12
>
> This is more processes than we can launch under the following
> additional directives and conditions:
>
> number of sockets: 0
> npersocket: 3
>
> Please revise the conflict and try again.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel