Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] hosfile issue of openmpi-1.7.4rc2
From: tmishima_at_[hidden]
Date: 2014-01-17 17:00:54


Hi Ralph,

I'm sorry that my explanation was not enough ...
This is the summary of my situation:

1. I create a hostfile as shown below manually.

2. I use mpirun to start the job without Torque, which means I'm running in
an un-managed environment.

3. Firstly, ORTE detects 8 slots on each host(maybe in
"orte_ras_base_allocate").
    node05: slots=8 max_slots=0 slots_inuse=0
    node06: slots=8 max_slots=0 slots_inuse=0

4. Then, the code I identified is resetting the slot counts.
    node05: slots=1 max_slots=0 slots_inuse=0
    node06: slots=1 max_slots=0 slots_inuse=0

5. Therefore, ORTE believes that there is only one slot on each host.

Regards,
Tetsuya Mishima

> No, I didn't use Torque this time.
>
> This issue is caused only when it is not in the managed
> environment - namely, orte_managed_allocation is false
> (and orte_set_slots is NULL).
>
> Under the torque management, it works fine.
>
> I hope you can understand the situation.
>
> Tetsuya Mishima
>
> > I'm sorry, but I'm really confused, so let me try to understand the
> situation.
> >
> > You use Torque to get an allocation, so you are running in a managed
> environment.
> >
> > You then use mpirun to start the job, but pass it a hostfile as shown
> below.
> >
> > Somehow, ORTE believes that there is only one slot on each host, and
you
> believe the code you've identified is resetting the slot counts.
> >
> > Is that a correct summary of the situation?
> >
> > Thanks
> > Ralph
> >
> > On Jan 16, 2014, at 4:00 PM, tmishima_at_[hidden] wrote:
> >
> > >
> > > Hi Ralph,
> > >
> > > I encountered the hostfile issue again where slots are counted by
> > > listing the node multiple times. This should be fixed by r29765
> > > - Fix hostfile parsing for the case where RMs count slots ....
> > >
> > > The difference is using RM or not. At that time, I executed mpirun
> through
> > > Torque manager. This time I executed it directly from command line as
> > > shown at the bottom, where node05,06 has 8 cores.
> > >
> > > Then, I checked source files arroud it and found that the line
151-160
> in
> > > plm_base_launch_support.c caused this issue. As node->slots is
already
> > > counted in hostfile.c @ r29765 even when node->slots_given is false,
> > > I think this part of plm_base_launch_support.c would be unnecesarry.
> > >
> > > orte/mca/plm/base/plm_base_launch_support.c @ 30189:
> > > 151 } else {
> > > 152 /* set any non-specified slot counts to 1 */
> > > 153 for (i=0; i < orte_node_pool->size; i++) {
> > > 154 if (NULL == (node =
> > > (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
> > > 155 continue;
> > > 156 }
> > > 157 if (!node->slots_given) {
> > > 158 node->slots = 1;
> > > 159 }
> > > 160 }
> > > 161 }
> > >
> > > Removing this part, it works very well, where the function of
> > > orte_set_default_slots is still alive. I think this would be better
for
> > > the compatible extention of openmpi-1.7.3.
> > >
> > > Regards,
> > > Tetsuya Mishima
> > >
> > > [mishima_at_manage work]$ cat pbs_hosts
> > > node05
> > > node05
> > > node05
> > > node05
> > > node05
> > > node05
> > > node05
> > > node05
> > > node06
> > > node06
> > > node06
> > > node06
> > > node06
> > > node06
> > > node06
> > > node06
> > > [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
-cpus-per-proc
> 4
> > > -report-bindings myprog
> > > [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]],
> socket
> > > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > > [node05.cluster:22287] MCW rank 3 is not bound (or bound to all
> available
> > > processors)
> > > [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
> > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > > [node05.cluster:22287] MCW rank 1 is not bound (or bound to all
> available
> > > processors)
> > > Hello world from process 0 of 4
> > > Hello world from process 1 of 4
> > > Hello world from process 3 of 4
> > > Hello world from process 2 of 4
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users