Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-11-07 19:19:40


Okay, so the problem is a bug in LAMA itself. I'll file a ticket and let the LAMA folks look into it.

On Nov 7, 2013, at 4:18 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Ralph,
>
> I quickly tried 2 runs:
>
> mpirun -report-bindings -bind-to core Myprog
> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core
> Myprog
>
> It works fine in both cases on node03 and node08.
>
> Regards,
> Tetsuya Mishima
>
>> What happens if you drop the LAMA request and instead run
>>
>> mpirun -report-bindings -bind-to core Myprog
>>
>> This would do the same thing - does it work? If so, then we know it is a
> problem in the LAMA mapper. If not, then it is likely a problem in a
> different section of the code.
>>
>>
>>
>> On Nov 7, 2013, at 3:43 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Dear openmpi developers,
>>>
>>> I tried the new function LAMA of openmpi-1.7.3 and
>>> unfortunately it is not stable under my environment, which
>>> is built with torque.
>>>
>>> (1) I used 4 scripts as shown below to clarify the problem:
>>>
>>> (COMMON PART)
>>> #!/bin/sh
>>> #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8
>>> export OMP_NUM_THREADS=1
>>> cd $PBS_O_WORKDIR
>>> cp $PBS_NODEFILE pbs_hosts
>>> NPROCS=`wc -l < pbs_hosts`
>>>
>>> (SCRIPT1)
>>> mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
>>> (SCRIPT2)
>>> mpirun -oversubscribe -report-bindings -mca rmaps lama \
>>> -mca rmaps_lama_bind 1c Myprog
>>> (SCRITP3)
>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
>>> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
>>> (SCRIPT4)
>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
>>> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \
>>> -mca rmaps_lama_map Nsbnch \
>>> -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog
>>>
>>> (2) The results are as follows:
>>>
>>> NODE03(32cores) NODE08(8cores)
>>> SCRIPT1 *ERROR1 *ERROR1
>>> SCRIPT2 OK OK
>>> SCRIPT3 **ABORT OK
>>> SCRIPT4 **ABORT **ABORT
>>>
>>> (*)ERROR1 means:
>>>
> --------------------------------------------------------------------------
>>> RMaps LAMA detected oversubscription after mapping 1 of 8 processes.
>>> Since you have asked not to oversubscribe the resources the job will
> not
>>> be launched. If you would instead like to oversubscribe the resources
>>> try using the --oversubscribe option to mpirun.
>>>
> --------------------------------------------------------------------------
>>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
>>> rmaps_lama_module.c at line 310
>>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
>>> base/rmaps_base_map_job.c at line 166
>>>
>>> (**)ABORT means "stuck and no answer" until forced termination.
>>>
>>>
>>> (3) openmpi-1.7.3 configuration (with PGI compiler)
>>>
>>> ./configure \
>>> --with-tm \
>>> --with-verbs \
>>> --disable-ipv6 \
>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
>>>
>>>
>>> (4) Cluster information:
>>>
>>> 32 cores AMD based node(node03):
>>> Machine (126GB)
>>> Socket L#0 (32GB)
>>> NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
>>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU
> L#0
>>> (P#0)
>>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU
> L#1
>>> (P#1)
>>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU
> L#2
>>> (P#2)
>>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU
> L#3
>>> (P#3)
>>> NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
>>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU
> L#4
>>> (P#4)
>>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU
> L#5
>>> (P#5)
>>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU
> L#6
>>> (P#6)
>>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU
> L#7
>>> (P#7)
>>> Socket L#1 (32GB)
>>> NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB)
>>> L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU
> L#8
>>> (P#8)
>>> L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU
> L#9
>>> (P#9)
>>> L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 +
> PU
>>> L#10 (P#10)
>>> L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 +
> PU
>>> L#11 (P#11)
>>> NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB)
>>> L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 +
> PU
>>> L#12 (P#12)
>>> L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 +
> PU
>>> L#13 (P#13)
>>> L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 +
> PU
>>> L#14 (P#14)
>>> L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 +
> PU
>>> L#15 (P#15)
>>> Socket L#2 (32GB)
>>> NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB)
>>> L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 +
> PU
>>> L#16 (P#16)
>>> L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 +
> PU
>>> L#17 (P#17)
>>> L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 +
> PU
>>> L#18 (P#18)
>>> L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 +
> PU
>>> L#19 (P#19)
>>> NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB)
>>> L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 +
> PU
>>> L#20 (P#20)
>>> L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 +
> PU
>>> L#21 (P#21)
>>> L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 +
> PU
>>> L#22 (P#22)
>>> L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 +
> PU
>>> L#23 (P#23)
>>> Socket L#3 (32GB)
>>> NUMANode L#6 (P#2 16GB) + L3 L#6 (5118KB)
>>> L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 +
> PU
>>> L#24 (P#24)
>>> L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 +
> PU
>>> L#25 (P#25)
>>> L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 +
> PU
>>> L#26 (P#26)
>>> L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 +
> PU
>>> L#27 (P#27)
>>> NUMANode L#7 (P#3 16GB) + L3 L#7 (5118KB)
>>> L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 +
> PU
>>> L#28 (P#28)
>>> L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 +
> PU
>>> L#29 (P#29)
>>> L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 +
> PU
>>> L#30 (P#30)
>>> L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 +
> PU
>>> L#31 (P#31)
>>> HostBridge L#0
>>> PCIBridge
>>> PCI 14e4:1639
>>> Net L#0 "eth0"
>>> PCI 14e4:1639
>>> Net L#1 "eth1"
>>> PCIBridge
>>> PCI 14e4:1639
>>> Net L#2 "eth2"
>>> PCI 14e4:1639
>>> Net L#3 "eth3"
>>> PCIBridge
>>> PCIBridge
>>> PCIBridge
>>> PCI 1000:0072
>>> Block L#4 "sdb"
>>> Block L#5 "sda"
>>> PCI 1002:4390
>>> Block L#6 "sr0"
>>> PCIBridge
>>> PCI 102b:0532
>>> HostBridge L#7
>>> PCIBridge
>>> PCI 15b3:6274
>>> Net L#7 "ib0"
>>> OpenFabrics L#8 "mthca0"
>>>
>>> 8 cores AMD based node(node08):
>>> Machine (32GB)
>>> NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (6144KB)
>>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
>>> (P#0)
>>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
>>> (P#1)
>>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
>>> (P#2)
>>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
>>> (P#3)
>>> NUMANode L#1 (P#1 16GB) + Socket L#1 + L3 L#1 (6144KB)
>>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
>>> (P#4)
>>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
>>> (P#5)
>>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
>>> (P#6)
>>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
>>> (P#7)
>>> HostBridge L#0
>>> PCI 10de:036e
>>> PCI 10de:037f
>>> Block L#0 "sda"
>>> PCI 10de:037f
>>> PCI 10de:037f
>>> PCIBridge
>>> PCI 18ca:0020
>>> PCIBridge
>>> PCI 15b3:6274
>>> Net L#1 "ib0"
>>> OpenFabrics L#2 "mthca0"
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users