Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] new map-by-obj has a problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-02-27 22:07:19


I'm having trouble seeing why it is failing, so I added some more debug output. Could you run the failure case again with -mca rmaps_base_verbose 10?

Thanks
Ralph

On Feb 27, 2014, at 6:11 PM, tmishima_at_[hidden] wrote:

>
>
> Just checking the difference, not so significant meaning...
>
> Anyway, I guess it's due to the behavior when slot counts is missing
> (regarded as slots=1) and it's oversubscribed unintentionally.
>
> I'm going out now, so I can't verify it quickly. If I provide the
> correct slot counts, it wll work, I guess. How do you think?
>
> Tetsuya
>
>> "restore" in what sense?
>>
>> On Feb 27, 2014, at 4:10 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Hi Ralph, this is just for your information.
>>>
>>> I tried to restore previous orte_rmaps_rr_byobj. Then I gets the result
>>> below with this command line:
>>>
>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by socket:pe=2
>>> -display-map -bind-to core:overload-allowed ~/mis/openmpi/demos/myprog
>>> Data for JOB [31184,1] offset 0
>>>
>>> ======================== JOB MAP ========================
>>>
>>> Data for node: node05 Num slots: 1 Max slots: 0 Num procs: 7
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 0
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 2
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 4
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 6
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 1
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 3
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 5
>>>
>>> Data for node: node06 Num slots: 1 Max slots: 0 Num procs: 1
>>> Process OMPI jobid: [31184,1] App: 0 Process rank: 7
>>>
>>> =============================================================
>>> [node06.cluster:18857] MCW rank 7 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21399] MCW rank 3 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> [node05.cluster:21399] MCW rank 4 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21399] MCW rank 5 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> [node05.cluster:21399] MCW rank 6 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> [node05.cluster:21399] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21399] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> [node05.cluster:21399] MCW rank 2 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> ....
>>>
>>>
>>> Then I add "-hostfile pbs_hosts" and the result is:
>>>
>>> [mishima_at_manage work]$cat pbs_hosts
>>> node05 slots=8
>>> node06 slots=8
>>> [mishima_at_manage work]$ mpirun -np 8 -hostfile ~/work/pbs_hosts
>>> -report-bindings -map-by socket:pe=2 -display-map
>>> ~/mis/openmpi/demos/myprog
>>> Data for JOB [30254,1] offset 0
>>>
>>> ======================== JOB MAP ========================
>>>
>>> Data for node: node05 Num slots: 8 Max slots: 0 Num procs: 4
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 0
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 2
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 1
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 3
>>>
>>> Data for node: node06 Num slots: 8 Max slots: 0 Num procs: 4
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 4
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 6
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 5
>>> Process OMPI jobid: [30254,1] App: 0 Process rank: 7
>>>
>>> =============================================================
>>> [node05.cluster:21501] MCW rank 2 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> [node05.cluster:21501] MCW rank 3 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> [node05.cluster:21501] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21501] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> [node06.cluster:18935] MCW rank 6 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> [node06.cluster:18935] MCW rank 7 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> [node06.cluster:18935] MCW rank 4 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node06.cluster:18935] MCW rank 5 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> ....
>>>
>>>
>>> I think previous version's behavior would be close to what I expect.
>>>
>>> Tetusya
>>>
>>>> They have 4 cores/socket and 2 sockets, totally 4 X 2 = 8 cores, each.
>>>>
>>>> Here is the output of lstopo.
>>>>
>>>> mishima_at_manage round_robin]$ rsh node05
>>>> Last login: Tue Feb 18 15:10:15 from manage
>>>> [mishima_at_node05 ~]$ lstopo
>>>> Machine (32GB)
>>>> NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (6144KB)
>>>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
>>>> (P#0)
>>>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
>>>> (P#1)
>>>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
>>>> (P#2)
>>>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
>>>> (P#3)
>>>> NUMANode L#1 (P#1 16GB) + Socket L#1 + L3 L#1 (6144KB)
>>>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
>>>> (P#4)
>>>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
>>>> (P#5)
>>>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
>>>> (P#6)
>>>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
>>>> (P#7)
>>>> ....
>>>>
>>>> I foucused on byobj_span and bynode. I didn't notice byobj was
> modified,
>>>> sorry.
>>>>
>>>> Tetsuya
>>>>
>>>>> Hmmm..what does your node look like again (sockets and cores)?
>>>>>
>>>>> On Feb 27, 2014, at 3:19 PM, tmishima_at_[hidden] wrote:
>>>>>
>>>>>>
>>>>>> Hi Ralph, I'm afraid to say your new "map-by obj" causes another
>>>> problem.
>>>>>>
>>>>>> I have overload message with this command line as shown below:
>>>>>>
>>>>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by
> socket:pe=2
>>>>>> -display-map ~/mis/openmpi/d
>>>>>> emos/myprog
>>>>>>
>>>>
>>>
> --------------------------------------------------------------------------
>>>>>> A request was made to bind to that would result in binding more
>>>>>> processes than cpus on a resource:
>>>>>>
>>>>>> Bind to: CORE
>>>>>> Node: node05
>>>>>> #processes: 2
>>>>>> #cpus: 1
>>>>>>
>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>> option to your binding directive.
>>>>>>
>>>>
>>>
> --------------------------------------------------------------------------
>>>>>>
>>>>>> Then, I add "-bind-to core:overload-allowed" to see what happenes.
>>>>>>
>>>>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by
> socket:pe=2
>>>>>> -display-map -bind-to core:o
>>>>>> verload-allowed ~/mis/openmpi/demos/myprog
>>>>>> Data for JOB [14398,1] offset 0
>>>>>>
>>>>>> ======================== JOB MAP ========================
>>>>>>
>>>>>> Data for node: node05 Num slots: 1 Max slots: 0 Num procs: 4
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 0
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 1
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 2
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 3
>>>>>>
>>>>>> Data for node: node06 Num slots: 1 Max slots: 0 Num procs: 4
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 4
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 5
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 6
>>>>>> Process OMPI jobid: [14398,1] App: 0 Process rank: 7
>>>>>>
>>>>>> =============================================================
>>>>>> [node06.cluster:18443] MCW rank 6 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node05.cluster:20901] MCW rank 2 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node06.cluster:18443] MCW rank 7 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node05.cluster:20901] MCW rank 3 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node06.cluster:18443] MCW rank 4 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node05.cluster:20901] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node06.cluster:18443] MCW rank 5 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node05.cluster:20901] MCW rank 1 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> Hello world from process 4 of 8
>>>>>> Hello world from process 2 of 8
>>>>>> Hello world from process 6 of 8
>>>>>> Hello world from process 0 of 8
>>>>>> Hello world from process 5 of 8
>>>>>> Hello world from process 1 of 8
>>>>>> Hello world from process 7 of 8
>>>>>> Hello world from process 3 of 8
>>>>>>
>>>>>> When I add "map-by obj:span", it works fine:
>>>>>>
>>>>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by
>>>> socket:pe=2,span
>>>>>> -display-map ~/mis/ope
>>>>>> nmpi/demos/myprog
>>>>>> Data for JOB [14703,1] offset 0
>>>>>>
>>>>>> ======================== JOB MAP ========================
>>>>>>
>>>>>> Data for node: node05 Num slots: 1 Max slots: 0 Num procs: 4
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 0
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 2
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 1
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 3
>>>>>>
>>>>>> Data for node: node06 Num slots: 1 Max slots: 0 Num procs: 4
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 4
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 6
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 5
>>>>>> Process OMPI jobid: [14703,1] App: 0 Process rank: 7
>>>>>>
>>>>>> =============================================================
>>>>>> [node06.cluster:18491] MCW rank 6 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node05.cluster:20949] MCW rank 2 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node06.cluster:18491] MCW rank 7 bound to socket 1[core 6[hwt 0]],
>>>> socket
>>>>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>>>>> [node05.cluster:20949] MCW rank 3 bound to socket 1[core 6[hwt 0]],
>>>> socket
>>>>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>>>>> [node06.cluster:18491] MCW rank 4 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node05.cluster:20949] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node06.cluster:18491] MCW rank 5 bound to socket 1[core 4[hwt 0]],
>>>> socket
>>>>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>>>>> [node05.cluster:20949] MCW rank 1 bound to socket 1[core 4[hwt 0]],
>>>> socket
>>>>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>>>>> ....
>>>>>>
>>>>>> So, byobj_span would be okay. Of course, bynode and byslot should be
>>>> okay.
>>>>>> Could you take a look at orte_rmaps_rr_byobj again?
>>>>>>
>>>>>> Regards,
>>>>>> Tetsuya Mishima
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users