Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] hosfile issue of openmpi-1.7.4rc2
From: tmishima_at_[hidden]
Date: 2014-01-17 15:31:36


No, I didn't use Torque this time.

This issue is caused only when it is not in the managed
environment - namely, orte_managed_allocation is false
(and orte_set_slots is NULL).

Under the torque management, it works fine.

I hope you can understand the situation.

Tetsuya Mishima

> I'm sorry, but I'm really confused, so let me try to understand the
situation.
>
> You use Torque to get an allocation, so you are running in a managed
environment.
>
> You then use mpirun to start the job, but pass it a hostfile as shown
below.
>
> Somehow, ORTE believes that there is only one slot on each host, and you
believe the code you've identified is resetting the slot counts.
>
> Is that a correct summary of the situation?
>
> Thanks
> Ralph
>
> On Jan 16, 2014, at 4:00 PM, tmishima_at_[hidden] wrote:
>
> >
> > Hi Ralph,
> >
> > I encountered the hostfile issue again where slots are counted by
> > listing the node multiple times. This should be fixed by r29765
> > - Fix hostfile parsing for the case where RMs count slots ....
> >
> > The difference is using RM or not. At that time, I executed mpirun
through
> > Torque manager. This time I executed it directly from command line as
> > shown at the bottom, where node05,06 has 8 cores.
> >
> > Then, I checked source files arroud it and found that the line 151-160
in
> > plm_base_launch_support.c caused this issue. As node->slots is already
> > counted in hostfile.c @ r29765 even when node->slots_given is false,
> > I think this part of plm_base_launch_support.c would be unnecesarry.
> >
> > orte/mca/plm/base/plm_base_launch_support.c @ 30189:
> > 151 } else {
> > 152 /* set any non-specified slot counts to 1 */
> > 153 for (i=0; i < orte_node_pool->size; i++) {
> > 154 if (NULL == (node =
> > (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
> > 155 continue;
> > 156 }
> > 157 if (!node->slots_given) {
> > 158 node->slots = 1;
> > 159 }
> > 160 }
> > 161 }
> >
> > Removing this part, it works very well, where the function of
> > orte_set_default_slots is still alive. I think this would be better for
> > the compatible extention of openmpi-1.7.3.
> >
> > Regards,
> > Tetsuya Mishima
> >
> > [mishima_at_manage work]$ cat pbs_hosts
> > node05
> > node05
> > node05
> > node05
> > node05
> > node05
> > node05
> > node05
> > node06
> > node06
> > node06
> > node06
> > node06
> > node06
> > node06
> > node06
> > [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts -cpus-per-proc
4
> > -report-bindings myprog
> > [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]],
socket
> > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > [node05.cluster:22287] MCW rank 3 is not bound (or bound to all
available
> > processors)
> > [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > [node05.cluster:22287] MCW rank 1 is not bound (or bound to all
available
> > processors)
> > Hello world from process 0 of 4
> > Hello world from process 1 of 4
> > Hello world from process 3 of 4
> > Hello world from process 2 of 4
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users