Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
From: tmishima_at_[hidden]
Date: 2013-03-22 00:06:47


Hi Ralph,

I tried to patch trunk/orte/mca/plm/base/plm_base_launch_support.c.

I didn't touch debugging part of plm_base_launch_support.c and whole of
trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c, because
rmaps_base_support_fns.c seems to include only updates for debugging.

Then, it works! Here is the result.

Regards,
Tetsuya Mishima

mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS --display-allocation
-mca ras_base_verbose 5 -mca rmaps_base_verb
ose 5 /home/mishima/Ducom/testbed/mPre m02-ld
[node05.cluster:22522] mca:base:select:( ras) Querying component
[loadleveler]
[node05.cluster:22522] [[58229,0],0] ras:loadleveler: NOT available for
selection
[node05.cluster:22522] mca:base:select:( ras) Skipping component
[loadleveler]. Query failed to return a module
[node05.cluster:22522] mca:base:select:( ras) Querying component
[simulator]
[node05.cluster:22522] mca:base:select:( ras) Skipping component
[simulator]. Query failed to return a module
[node05.cluster:22522] mca:base:select:( ras) Querying component [slurm]
[node05.cluster:22522] [[58229,0],0] ras:slurm: NOT available for selection
[node05.cluster:22522] mca:base:select:( ras) Skipping component [slurm].
Query failed to return a module
[node05.cluster:22522] mca:base:select:( ras) Querying component [tm]
[node05.cluster:22522] mca:base:select:( ras) Query of component [tm] set
priority to 100
[node05.cluster:22522] mca:base:select:( ras) Selected component [tm]
[node05.cluster:22522] mca:rmaps:select: checking available component ppr
[node05.cluster:22522] mca:rmaps:select: Querying component [ppr]
[node05.cluster:22522] mca:rmaps:select: checking available component
rank_file
[node05.cluster:22522] mca:rmaps:select: Querying component [rank_file]
[node05.cluster:22522] mca:rmaps:select: checking available component
resilient
[node05.cluster:22522] mca:rmaps:select: Querying component [resilient]
[node05.cluster:22522] mca:rmaps:select: checking available component
round_robin
[node05.cluster:22522] mca:rmaps:select: Querying component [round_robin]
[node05.cluster:22522] mca:rmaps:select: checking available component seq
[node05.cluster:22522] mca:rmaps:select: Querying component [seq]
[node05.cluster:22522] [[58229,0],0]: Final mapper priorities
[node05.cluster:22522] Mapper: ppr Priority: 90
[node05.cluster:22522] Mapper: seq Priority: 60
[node05.cluster:22522] Mapper: resilient Priority: 40
[node05.cluster:22522] Mapper: round_robin Priority: 10
[node05.cluster:22522] Mapper: rank_file Priority: 0
[node05.cluster:22522] [[58229,0],0] ras:base:allocate
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node05
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found --
added to list
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node05
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
bumped slots to 2
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node05
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
bumped slots to 3
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node05
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
bumped slots to 4
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node04
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found --
added to list
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node04
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
bumped slots to 2
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node04
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
bumped slots to 3
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
node04
[node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
bumped slots to 4
[node05.cluster:22522] [[58229,0],0] ras:base:node_insert inserting 2 nodes
[node05.cluster:22522] [[58229,0],0] ras:base:node_insert updating HNP info
to 4 slots
[node05.cluster:22522] [[58229,0],0] ras:base:node_insert node node04

====================== ALLOCATED NODES ======================

 Data for node: node05 Num slots: 4 Max slots: 0
 Data for node: node04 Num slots: 4 Max slots: 0

=================================================================
[node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
node04
[node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
node05
[node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
node04
[node05.cluster:22522] mca:rmaps: mapping job [58229,1]
[node05.cluster:22522] mca:rmaps: creating new map for job [58229,1]
[node05.cluster:22522] mca:rmaps:ppr: job [58229,1] not using ppr mapper
[node05.cluster:22522] [[58229,0],0] rmaps:seq mapping job [58229,1]
[node05.cluster:22522] mca:rmaps:seq: job [58229,1] not using seq mapper
[node05.cluster:22522] mca:rmaps:resilient: cannot perform initial map of
job [58229,1] - no fault groups
[node05.cluster:22522] mca:rmaps:rr: mapping job [58229,1]
[node05.cluster:22522] [[58229,0],0] Starting with 2 nodes in list
[node05.cluster:22522] [[58229,0],0] Filtering thru apps
[node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
node05
[node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
node04
[node05.cluster:22522] [[58229,0],0] Retained 2 nodes in list
[node05.cluster:22522] AVAILABLE NODES FOR MAPPING:
[node05.cluster:22522] node: node05 daemon: 0
[node05.cluster:22522] node: node04 daemon: 1
[node05.cluster:22522] [[58229,0],0] Starting bookmark at node node05
[node05.cluster:22522] [[58229,0],0] Starting at node node05
[node05.cluster:22522] mca:rmaps:rr: mapping by slot for job [58229,1]
slots 8 num_procs 8
[node05.cluster:22522] mca:rmaps:rr:slot working node node05
[node05.cluster:22522] mca:rmaps:rr:slot working node node04
[node05.cluster:22522] mca:rmaps:base: computing vpids by slot for job
[58229,1]
[node05.cluster:22522] mca:rmaps:base: assigning rank 0 to node node05
[node05.cluster:22522] mca:rmaps:base: assigning rank 1 to node node05
[node05.cluster:22522] mca:rmaps:base: assigning rank 2 to node node05
[node05.cluster:22522] mca:rmaps:base: assigning rank 3 to node node05
[node05.cluster:22522] mca:rmaps:base: assigning rank 4 to node node04
[node05.cluster:22522] mca:rmaps:base: assigning rank 5 to node node04
[node05.cluster:22522] mca:rmaps:base: assigning rank 6 to node node04
[node05.cluster:22522] mca:rmaps:base: assigning rank 7 to node node04
[node05.cluster:22522] [[58229,0],0] rmaps:base:compute_usage

> Okay, I found it - fix coming in a bit.
>
> Thanks!
> Ralph
>
> On Mar 21, 2013, at 4:02 PM, tmishima_at_[hidden] wrote:
>
> >
> >
> > Hi Ralph,
> >
> > Sorry for late reply. Here is my result.
> >
> > mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS
--display-allocation
> > -mca ras_base_verbose 5 -mca rmaps_base_verb
> > ose 5 /home/mishima/Ducom/testbed/mPre m02-ld
> > [node04.cluster:28175] mca:base:select:( ras) Querying component
> > [loadleveler]
> > [node04.cluster:28175] [[29518,0],0] ras:loadleveler: NOT available for
> > selection
> > [node04.cluster:28175] mca:base:select:( ras) Skipping component
> > [loadleveler]. Query failed to return a module
> > [node04.cluster:28175] mca:base:select:( ras) Querying component
> > [simulator]
> > [node04.cluster:28175] mca:base:select:( ras) Skipping component
> > [simulator]. Query failed to return a module
> > [node04.cluster:28175] mca:base:select:( ras) Querying component
[slurm]
> > [node04.cluster:28175] [[29518,0],0] ras:slurm: NOT available for
selection
> > [node04.cluster:28175] mca:base:select:( ras) Skipping component
[slurm].
> > Query failed to return a module
> > [node04.cluster:28175] mca:base:select:( ras) Querying component [tm]
> > [node04.cluster:28175] mca:base:select:( ras) Query of component [tm]
set
> > priority to 100
> > [node04.cluster:28175] mca:base:select:( ras) Selected component [tm]
> > [node04.cluster:28175] mca:rmaps:select: checking available component
ppr
> > [node04.cluster:28175] mca:rmaps:select: Querying component [ppr]
> > [node04.cluster:28175] mca:rmaps:select: checking available component
> > rank_file
> > [node04.cluster:28175] mca:rmaps:select: Querying component [rank_file]
> > [node04.cluster:28175] mca:rmaps:select: checking available component
> > resilient
> > [node04.cluster:28175] mca:rmaps:select: Querying component [resilient]
> > [node04.cluster:28175] mca:rmaps:select: checking available component
> > round_robin
> > [node04.cluster:28175] mca:rmaps:select: Querying component
[round_robin]
> > [node04.cluster:28175] mca:rmaps:select: checking available component
seq
> > [node04.cluster:28175] mca:rmaps:select: Querying component [seq]
> > [node04.cluster:28175] [[29518,0],0]: Final mapper priorities
> > [node04.cluster:28175] Mapper: ppr Priority: 90
> > [node04.cluster:28175] Mapper: seq Priority: 60
> > [node04.cluster:28175] Mapper: resilient Priority: 40
> > [node04.cluster:28175] Mapper: round_robin Priority: 10
> > [node04.cluster:28175] Mapper: rank_file Priority: 0
> > [node04.cluster:28175] [[29518,0],0] ras:base:allocate
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node04
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not
found --
> > added to list
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node04
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
> > bumped slots to 2
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node04
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
> > bumped slots to 3
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node04
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
> > bumped slots to 4
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node03
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not
found --
> > added to list
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node03
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
> > bumped slots to 2
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node03
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
> > bumped slots to 3
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
hostname
> > node03
> > [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
> > bumped slots to 4
> > [node04.cluster:28175] [[29518,0],0] ras:base:node_insert inserting 2
nodes
> > [node04.cluster:28175] [[29518,0],0] ras:base:node_insert updating HNP
info
> > to 4 slots
> > [node04.cluster:28175] [[29518,0],0] ras:base:node_insert node node03
> >
> > ====================== ALLOCATED NODES ======================
> >
> > Data for node: node04 Num slots: 4 Max slots: 0
> > Data for node: node03 Num slots: 4 Max slots: 0
> >
> > =================================================================
> > [node04.cluster:28175] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
> > node03
> >
--------------------------------------------------------------------------
> > A hostfile was provided that contains at least one node not
> > present in the allocation:
> >
> > hostfile: pbs_hosts
> > node: node04
> >
> > If you are operating in a resource-managed environment, then only
> > nodes that are in the allocation can be used in the hostfile. You
> > may find relative node syntax to be a useful alternative to
> > specifying absolute node names see the orte_hosts man page for
> > further information.
> >
--------------------------------------------------------------------------
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> Hmmm...okay, let's try one more thing. Can you please add the
following
> > to your command line:
> >>
> >> -mca ras_base_verbose 5 -mca rmaps_base_verbose 5
> >>
> >> Appreciate your patience. For some reason, we are losing your head
node
> > from the allocation when we start trying to map processes. I'm trying
to
> > track down where this is happening so we can figure
> >> out why.
> >>
> >>
> >> On Mar 20, 2013, at 10:32 PM, tmishima_at_[hidden] wrote:
> >>
> >>>
> >>>
> >>> Hi Ralph,
> >>>
> >>> Here is the result on patched openmpi-1.7rc8.
> >>>
> >>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS
> >>> --display-allocation /home/mishima/Ducom/testbed/mPre m02-ld
> >>>
> >>> ====================== ALLOCATED NODES ======================
> >>>
> >>> Data for node: node06 Num slots: 4 Max slots: 0
> >>> Data for node: node05 Num slots: 4 Max slots: 0
> >>>
> >>> =================================================================
> >>> [node06.cluster:21149] HOSTFILE: CHECKING FILE NODE node06 VS LIST
NODE
> >>> node05
> >>>
> >
--------------------------------------------------------------------------
> >>> A hostfile was provided that contains at least one node not
> >>> present in the allocation:
> >>>
> >>> hostfile: pbs_hosts
> >>> node: node06
> >>>
> >>> If you are operating in a resource-managed environment, then only
> >>> nodes that are in the allocation can be used in the hostfile. You
> >>> may find relative node syntax to be a useful alternative to
> >>> specifying absolute node names see the orte_hosts man page for
> >>> further information.
> >>>
> >
--------------------------------------------------------------------------
> >>>
> >>> Regards,
> >>> Tetsuya
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>