Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] hosfile issue of openmpi-1.7.4rc2
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-17 12:18:50


I'm sorry, but I'm really confused, so let me try to understand the situation.

You use Torque to get an allocation, so you are running in a managed environment.

You then use mpirun to start the job, but pass it a hostfile as shown below.

Somehow, ORTE believes that there is only one slot on each host, and you believe the code you've identified is resetting the slot counts.

Is that a correct summary of the situation?

Thanks
Ralph

On Jan 16, 2014, at 4:00 PM, tmishima_at_[hidden] wrote:

>
> Hi Ralph,
>
> I encountered the hostfile issue again where slots are counted by
> listing the node multiple times. This should be fixed by r29765
> - Fix hostfile parsing for the case where RMs count slots ....
>
> The difference is using RM or not. At that time, I executed mpirun through
> Torque manager. This time I executed it directly from command line as
> shown at the bottom, where node05,06 has 8 cores.
>
> Then, I checked source files arroud it and found that the line 151-160 in
> plm_base_launch_support.c caused this issue. As node->slots is already
> counted in hostfile.c @ r29765 even when node->slots_given is false,
> I think this part of plm_base_launch_support.c would be unnecesarry.
>
> orte/mca/plm/base/plm_base_launch_support.c @ 30189:
> 151 } else {
> 152 /* set any non-specified slot counts to 1 */
> 153 for (i=0; i < orte_node_pool->size; i++) {
> 154 if (NULL == (node =
> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
> 155 continue;
> 156 }
> 157 if (!node->slots_given) {
> 158 node->slots = 1;
> 159 }
> 160 }
> 161 }
>
> Removing this part, it works very well, where the function of
> orte_set_default_slots is still alive. I think this would be better for
> the compatible extention of openmpi-1.7.3.
>
> Regards,
> Tetsuya Mishima
>
> [mishima_at_manage work]$ cat pbs_hosts
> node05
> node05
> node05
> node05
> node05
> node05
> node05
> node05
> node06
> node06
> node06
> node06
> node06
> node06
> node06
> node06
> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts -cpus-per-proc 4
> -report-bindings myprog
> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all available
> processors)
> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all available
> processors)
> Hello world from process 0 of 4
> Hello world from process 1 of 4
> Hello world from process 3 of 4
> Hello world from process 2 of 4
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users