Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] hosfile issue of openmpi-1.7.4rc2
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-17 12:18:50


I'm sorry, but I'm really confused, so let me try to understand the situation.

You use Torque to get an allocation, so you are running in a managed environment.

You then use mpirun to start the job, but pass it a hostfile as shown below.

Somehow, ORTE believes that there is only one slot on each host, and you believe the code you've identified is resetting the slot counts.

Is that a correct summary of the situation?

Thanks
Ralph

On Jan 16, 2014, at 4:00 PM, tmishima_at_[hidden] wrote:

>
> Hi Ralph,
>
> I encountered the hostfile issue again where slots are counted by
> listing the node multiple times. This should be fixed by r29765
> - Fix hostfile parsing for the case where RMs count slots ....
>
> The difference is using RM or not. At that time, I executed mpirun through
> Torque manager. This time I executed it directly from command line as
> shown at the bottom, where node05,06 has 8 cores.
>
> Then, I checked source files arroud it and found that the line 151-160 in
> plm_base_launch_support.c caused this issue. As node->slots is already
> counted in hostfile.c @ r29765 even when node->slots_given is false,
> I think this part of plm_base_launch_support.c would be unnecesarry.
>
> orte/mca/plm/base/plm_base_launch_support.c @ 30189:
> 151 } else {
> 152 /* set any non-specified slot counts to 1 */
> 153 for (i=0; i < orte_node_pool->size; i++) {
> 154 if (NULL == (node =
> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
> 155 continue;
> 156 }
> 157 if (!node->slots_given) {
> 158 node->slots = 1;
> 159 }
> 160 }
> 161 }
>
> Removing this part, it works very well, where the function of
> orte_set_default_slots is still alive. I think this would be better for
> the compatible extention of openmpi-1.7.3.
>
> Regards,
> Tetsuya Mishima
>
> [mishima_at_manage work]$ cat pbs_hosts
> node05
> node05
> node05
> node05
> node05
> node05
> node05
> node05
> node06
> node06
> node06
> node06
> node06
> node06
> node06
> node06
> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts -cpus-per-proc 4
> -report-bindings myprog
> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all available
> processors)
> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all available
> processors)
> Hello world from process 0 of 4
> Hello world from process 1 of 4
> Hello world from process 3 of 4
> Hello world from process 2 of 4
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users