Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Mostyn Lewis (Mostyn.Lewis_at_[hidden])
Date: 2009-03-10 16:13:35


Maybe I know why now but it's not pleasant, e.g. 2 machines in the same
cluster have their ethernets such as:

Machine s0157

eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
           BROADCAST MULTICAST MTU:1500 Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
           Interrupt:233 Base address:0x6000

eth3 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A9
           inet addr:10.173.128.13 Bcast:10.173.255.255 Mask:255.255.0.0
           inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
           RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
           TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:5780065692 (5512.3 Mb) TX bytes:59140357016 (56400.6 Mb)
           Interrupt:50 Base address:0x8000

Machine s0158

eth0 Link encap:Ethernet HWaddr 00:23:8B:42:10:A9
           inet addr:7.8.82.158 Bcast:7.8.255.255 Mask:255.255.0.0
           UP BROADCAST MULTICAST MTU:1500 Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
           Interrupt:233 Base address:0x6000

eth1 Link encap:Ethernet HWaddr 00:23:8B:42:10:AA
           inet addr:10.173.128.14 Bcast:10.173.255.255 Mask:255.255.0.0
           inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
           RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
           TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:5879861483 (5607.4 Mb) TX bytes:2406041840 (2294.5 Mb)
           Interrupt:50 Base address:0x8000

Apart from the eths being on different names (happens when installing SuSE SLES 10 SP2)
on apparently similar machines, I notice theirs a private ethernet on s0158 at IP
7.8.82.158 - I guess this was used. How to exclude when the eth names vary?

DM

On Tue, 10 Mar 2009, Ralph Castain wrote:

> Not really. I've run much bigger jobs than this without problem, so I don't
> think there is a fundamental issue here.
>
> It looks like the TCP fabric between the various nodes is breaking down. I
> note in the enclosed messages that the problems are all with comm between
> daemons 4 and 21. We keep trying to get through, but failing.
>
> I can fix things so we don't endlessly loop when that happens (IIRC, I think
> we are already supposed to abort, but it appears that isn't working). But the
> real question is why the comm fails in the first place.
>
>
> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
>
>> Latest status - 1.4a1r20757 (yesterday);
>> the job now starts with a little output but quickly runs into trouble with
>> a lot of
>> 'oob-tcp: Communication retries exceeded. Can not communicate with peer '
>> errors?
>>
>> e.g.
>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
>> exceeded. Can not communicate with peer [s0158:22513]
>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can
>> not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],21]
>> oob-tcp: Communication retries exceeded. Can not communicate with peer
>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
>> exceeded. Can not communicate with peer
>>
>> The nodes are O.K. ...
>>
>> Any ideas folks?
>>
>> DM
>>
>> On Sat, 28 Feb 2009, Ralph Castain wrote:
>>
>>> I think I have this figured out - will fix on Monday. I'm not sure why
>>> Jeff's conditions are all required, especially the second one. However,
>>> the fundamental problem is that we pull information from two sources
>>> regarding the number of procs in the job when unpacking a buffer, and the
>>> two sources appear to be out-of-sync with each other in certain scenarios.
>>>
>>> The details are beyond the user list. I'll respond here again once I get
>>> it fixed.
>>>
>>> Ralph
>>>
>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
>>>
>>>> Unfortunately, I think I have reproduced the problem as well -- with SVN
>>>> trunk HEAD (r20655):
>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack
>>>> failed in file base/odls_base_default_fns.c at line 566
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>> --------------------------------------------------------------------------
>>>> Notice that I'm not trying to run an MPI app -- it's just "uptime".
>>>> The following things seem to be necessary to make this error occur for
>>>> me:
>>>> 1. --bynode
>>>> 2. set some mca parameter (any mca parameter)
>>>> 3. -np value less than the size of my slurm allocation
>>>> If I remove any of those, it seems to run file
>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
>>>>> With further investigation, I have reproduced this problem. I think I
>>>>> was originally testing against a version that was not recent enough. I
>>>>> do not see it with r20594 which is from February 19. So, something must
>>>>> have happened over the last 8 days. I will try and narrow down the
>>>>> issue.
>>>>> Rolf
>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error, although
>>>>>> my configuration was rather different. I ran across 100 2-CPU sparc
>>>>>> nodes, np=256, connected with TCP.
>>>>>> Hopefully George's comment helps out with this issue.
>>>>>> One other thought to see whether SGE has anything to do with this is
>>>>>> create a hostfile and run it outside of SGE.
>>>>>> Rolf
>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh launchers,
>>>>>>> and everything checks out fine. However, this is running under SGE and
>>>>>>> thus using qrsh, so it is possible the SGE support is having a
>>>>>>> problem.
>>>>>>> Perhaps one of the Sun OMPI developers can help here?
>>>>>>> Ralph
>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>>>>>> It looks like the system doesn't know what nodes the procs are to be
>>>>>>>> placed upon. Can you run this with --display-devel-map? That will
>>>>>>>> tell us where the system thinks it is placing things.
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>>>>>> Maybe it's my pine mailer.
>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
>>>>>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>>>>> The basic error message, which occurs 31 times is like:
>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>>>> ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595
>>>>>>>>> The mpirun command has long paths in it, sorry. It's invoking a
>>>>>>>>> special binding
>>>>>>>>> script which in turn lauches the NAMD run. This works on an older
>>>>>>>>> SVN at
>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this
>>>>>>>>> 256 proc run where
>>>>>>>>> the older SVN hangs indefinitely polling some completion (sm or
>>>>>>>>> openib). So, I was trying
>>>>>>>>> later SVNs with this 256 proc run, hoping the error would go away.
>>>>>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>>>>> EAGER_SIZE=32767
>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>>>>> and, unexpanded
>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2
>>>>>>>>> stmv.namd
>>>>>>>>> and, expanded
>>>>>>>>> mpirun --prefix
>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron
>>>>>>>>> -np 256 --mca btl sm,openib,self -x
>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/newhosts
>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2
>>>>>>>>> stmv.namd
>>>>>>>>> This is all via Sun Grid Engine.
>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>>>>> DM
>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>>>>>> I'm sorry, but I can't make any sense of this message. Could you
>>>>>>>>>> provide a
>>>>>>>>>> little explanation of what you are doing, what the system looks
>>>>>>>>>> like, what is
>>>>>>>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>>>>>>> Thanks
>>>>>>>>>> Ralph
>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>>>>>> Today's and yesterdays.
>>>>>>>>>>> 1.4a1r20643_svn
>>>>>>>>>>> + mpirun --prefix
>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6
>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>>>>>> 269.1.all.q/newhosts
>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>>>>>> /ctmp8/mostyn/IM
>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>>>>>> base_default_fns.c at line 595
>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>>>>> Regards,
>>>>>>>>>>> Mostyn
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> --
>>>>> =========================
>>>>> rolf.vandevaart_at_[hidden]
>>>>> 781-442-3043
>>>>> =========================
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users