Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-03-22 00:13:01


Thanks - yes, the problem was in the launch_support.c code. I'll mark this as checked and apply it to the v1.7.0 release.

Thanks for the help!
Ralph

On Mar 21, 2013, at 9:06 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Ralph,
>
> I tried to patch trunk/orte/mca/plm/base/plm_base_launch_support.c.
>
> I didn't touch debugging part of plm_base_launch_support.c and whole of
> trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c, because
> rmaps_base_support_fns.c seems to include only updates for debugging.
>
> Then, it works! Here is the result.
>
> Regards,
> Tetsuya Mishima
>
> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS --display-allocation
> -mca ras_base_verbose 5 -mca rmaps_base_verb
> ose 5 /home/mishima/Ducom/testbed/mPre m02-ld
> [node05.cluster:22522] mca:base:select:( ras) Querying component
> [loadleveler]
> [node05.cluster:22522] [[58229,0],0] ras:loadleveler: NOT available for
> selection
> [node05.cluster:22522] mca:base:select:( ras) Skipping component
> [loadleveler]. Query failed to return a module
> [node05.cluster:22522] mca:base:select:( ras) Querying component
> [simulator]
> [node05.cluster:22522] mca:base:select:( ras) Skipping component
> [simulator]. Query failed to return a module
> [node05.cluster:22522] mca:base:select:( ras) Querying component [slurm]
> [node05.cluster:22522] [[58229,0],0] ras:slurm: NOT available for selection
> [node05.cluster:22522] mca:base:select:( ras) Skipping component [slurm].
> Query failed to return a module
> [node05.cluster:22522] mca:base:select:( ras) Querying component [tm]
> [node05.cluster:22522] mca:base:select:( ras) Query of component [tm] set
> priority to 100
> [node05.cluster:22522] mca:base:select:( ras) Selected component [tm]
> [node05.cluster:22522] mca:rmaps:select: checking available component ppr
> [node05.cluster:22522] mca:rmaps:select: Querying component [ppr]
> [node05.cluster:22522] mca:rmaps:select: checking available component
> rank_file
> [node05.cluster:22522] mca:rmaps:select: Querying component [rank_file]
> [node05.cluster:22522] mca:rmaps:select: checking available component
> resilient
> [node05.cluster:22522] mca:rmaps:select: Querying component [resilient]
> [node05.cluster:22522] mca:rmaps:select: checking available component
> round_robin
> [node05.cluster:22522] mca:rmaps:select: Querying component [round_robin]
> [node05.cluster:22522] mca:rmaps:select: checking available component seq
> [node05.cluster:22522] mca:rmaps:select: Querying component [seq]
> [node05.cluster:22522] [[58229,0],0]: Final mapper priorities
> [node05.cluster:22522] Mapper: ppr Priority: 90
> [node05.cluster:22522] Mapper: seq Priority: 60
> [node05.cluster:22522] Mapper: resilient Priority: 40
> [node05.cluster:22522] Mapper: round_robin Priority: 10
> [node05.cluster:22522] Mapper: rank_file Priority: 0
> [node05.cluster:22522] [[58229,0],0] ras:base:allocate
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found --
> added to list
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 2
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 3
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node05
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 4
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: not found --
> added to list
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 2
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 3
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: got hostname
> node04
> [node05.cluster:22522] [[58229,0],0] ras:tm:allocate:discover: found --
> bumped slots to 4
> [node05.cluster:22522] [[58229,0],0] ras:base:node_insert inserting 2 nodes
> [node05.cluster:22522] [[58229,0],0] ras:base:node_insert updating HNP info
> to 4 slots
> [node05.cluster:22522] [[58229,0],0] ras:base:node_insert node node04
>
> ====================== ALLOCATED NODES ======================
>
> Data for node: node05 Num slots: 4 Max slots: 0
> Data for node: node04 Num slots: 4 Max slots: 0
>
> =================================================================
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
> node04
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
> node05
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
> node04
> [node05.cluster:22522] mca:rmaps: mapping job [58229,1]
> [node05.cluster:22522] mca:rmaps: creating new map for job [58229,1]
> [node05.cluster:22522] mca:rmaps:ppr: job [58229,1] not using ppr mapper
> [node05.cluster:22522] [[58229,0],0] rmaps:seq mapping job [58229,1]
> [node05.cluster:22522] mca:rmaps:seq: job [58229,1] not using seq mapper
> [node05.cluster:22522] mca:rmaps:resilient: cannot perform initial map of
> job [58229,1] - no fault groups
> [node05.cluster:22522] mca:rmaps:rr: mapping job [58229,1]
> [node05.cluster:22522] [[58229,0],0] Starting with 2 nodes in list
> [node05.cluster:22522] [[58229,0],0] Filtering thru apps
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node05 VS LIST NODE
> node05
> [node05.cluster:22522] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
> node04
> [node05.cluster:22522] [[58229,0],0] Retained 2 nodes in list
> [node05.cluster:22522] AVAILABLE NODES FOR MAPPING:
> [node05.cluster:22522] node: node05 daemon: 0
> [node05.cluster:22522] node: node04 daemon: 1
> [node05.cluster:22522] [[58229,0],0] Starting bookmark at node node05
> [node05.cluster:22522] [[58229,0],0] Starting at node node05
> [node05.cluster:22522] mca:rmaps:rr: mapping by slot for job [58229,1]
> slots 8 num_procs 8
> [node05.cluster:22522] mca:rmaps:rr:slot working node node05
> [node05.cluster:22522] mca:rmaps:rr:slot working node node04
> [node05.cluster:22522] mca:rmaps:base: computing vpids by slot for job
> [58229,1]
> [node05.cluster:22522] mca:rmaps:base: assigning rank 0 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 1 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 2 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 3 to node node05
> [node05.cluster:22522] mca:rmaps:base: assigning rank 4 to node node04
> [node05.cluster:22522] mca:rmaps:base: assigning rank 5 to node node04
> [node05.cluster:22522] mca:rmaps:base: assigning rank 6 to node node04
> [node05.cluster:22522] mca:rmaps:base: assigning rank 7 to node node04
> [node05.cluster:22522] [[58229,0],0] rmaps:base:compute_usage
>
>
>> Okay, I found it - fix coming in a bit.
>>
>> Thanks!
>> Ralph
>>
>> On Mar 21, 2013, at 4:02 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Hi Ralph,
>>>
>>> Sorry for late reply. Here is my result.
>>>
>>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS
> --display-allocation
>>> -mca ras_base_verbose 5 -mca rmaps_base_verb
>>> ose 5 /home/mishima/Ducom/testbed/mPre m02-ld
>>> [node04.cluster:28175] mca:base:select:( ras) Querying component
>>> [loadleveler]
>>> [node04.cluster:28175] [[29518,0],0] ras:loadleveler: NOT available for
>>> selection
>>> [node04.cluster:28175] mca:base:select:( ras) Skipping component
>>> [loadleveler]. Query failed to return a module
>>> [node04.cluster:28175] mca:base:select:( ras) Querying component
>>> [simulator]
>>> [node04.cluster:28175] mca:base:select:( ras) Skipping component
>>> [simulator]. Query failed to return a module
>>> [node04.cluster:28175] mca:base:select:( ras) Querying component
> [slurm]
>>> [node04.cluster:28175] [[29518,0],0] ras:slurm: NOT available for
> selection
>>> [node04.cluster:28175] mca:base:select:( ras) Skipping component
> [slurm].
>>> Query failed to return a module
>>> [node04.cluster:28175] mca:base:select:( ras) Querying component [tm]
>>> [node04.cluster:28175] mca:base:select:( ras) Query of component [tm]
> set
>>> priority to 100
>>> [node04.cluster:28175] mca:base:select:( ras) Selected component [tm]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
> ppr
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [ppr]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
>>> rank_file
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [rank_file]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
>>> resilient
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [resilient]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
>>> round_robin
>>> [node04.cluster:28175] mca:rmaps:select: Querying component
> [round_robin]
>>> [node04.cluster:28175] mca:rmaps:select: checking available component
> seq
>>> [node04.cluster:28175] mca:rmaps:select: Querying component [seq]
>>> [node04.cluster:28175] [[29518,0],0]: Final mapper priorities
>>> [node04.cluster:28175] Mapper: ppr Priority: 90
>>> [node04.cluster:28175] Mapper: seq Priority: 60
>>> [node04.cluster:28175] Mapper: resilient Priority: 40
>>> [node04.cluster:28175] Mapper: round_robin Priority: 10
>>> [node04.cluster:28175] Mapper: rank_file Priority: 0
>>> [node04.cluster:28175] [[29518,0],0] ras:base:allocate
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not
> found --
>>> added to list
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 2
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 3
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node04
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 4
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: not
> found --
>>> added to list
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 2
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 3
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: got
> hostname
>>> node03
>>> [node04.cluster:28175] [[29518,0],0] ras:tm:allocate:discover: found --
>>> bumped slots to 4
>>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert inserting 2
> nodes
>>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert updating HNP
> info
>>> to 4 slots
>>> [node04.cluster:28175] [[29518,0],0] ras:base:node_insert node node03
>>>
>>> ====================== ALLOCATED NODES ======================
>>>
>>> Data for node: node04 Num slots: 4 Max slots: 0
>>> Data for node: node03 Num slots: 4 Max slots: 0
>>>
>>> =================================================================
>>> [node04.cluster:28175] HOSTFILE: CHECKING FILE NODE node04 VS LIST NODE
>>> node03
>>>
> --------------------------------------------------------------------------
>>> A hostfile was provided that contains at least one node not
>>> present in the allocation:
>>>
>>> hostfile: pbs_hosts
>>> node: node04
>>>
>>> If you are operating in a resource-managed environment, then only
>>> nodes that are in the allocation can be used in the hostfile. You
>>> may find relative node syntax to be a useful alternative to
>>> specifying absolute node names see the orte_hosts man page for
>>> further information.
>>>
> --------------------------------------------------------------------------
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>>> Hmmm...okay, let's try one more thing. Can you please add the
> following
>>> to your command line:
>>>>
>>>> -mca ras_base_verbose 5 -mca rmaps_base_verbose 5
>>>>
>>>> Appreciate your patience. For some reason, we are losing your head
> node
>>> from the allocation when we start trying to map processes. I'm trying
> to
>>> track down where this is happening so we can figure
>>>> out why.
>>>>
>>>>
>>>> On Mar 20, 2013, at 10:32 PM, tmishima_at_[hidden] wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi Ralph,
>>>>>
>>>>> Here is the result on patched openmpi-1.7rc8.
>>>>>
>>>>> mpirun -v -np 8 -hostfile pbs_hosts -x OMP_NUM_THREADS
>>>>> --display-allocation /home/mishima/Ducom/testbed/mPre m02-ld
>>>>>
>>>>> ====================== ALLOCATED NODES ======================
>>>>>
>>>>> Data for node: node06 Num slots: 4 Max slots: 0
>>>>> Data for node: node05 Num slots: 4 Max slots: 0
>>>>>
>>>>> =================================================================
>>>>> [node06.cluster:21149] HOSTFILE: CHECKING FILE NODE node06 VS LIST
> NODE
>>>>> node05
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>> A hostfile was provided that contains at least one node not
>>>>> present in the allocation:
>>>>>
>>>>> hostfile: pbs_hosts
>>>>> node: node06
>>>>>
>>>>> If you are operating in a resource-managed environment, then only
>>>>> nodes that are in the allocation can be used in the hostfile. You
>>>>> may find relative node syntax to be a useful alternative to
>>>>> specifying absolute node names see the orte_hosts man page for
>>>>> further information.
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>>
>>>>> Regards,
>>>>> Tetsuya
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users