Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Mostyn Lewis (Mostyn.Lewis_at_[hidden])
Date: 2009-03-11 14:51:16


Yes, -mca oob_tcp_if_exclude eth0, worked O.K., even though some
machines have no eth0.

Thanks,
DM

On Tue, 10 Mar 2009, Ralph Castain wrote:

> Ick. We don't have a way currently to allow you to ignore an interface on a
> node-by-node basis. If you do:
>
> -mca oob_tcp_if_exclude eth0
>
> we will exclude that private Ethernet. The catch is that we will exclude
> "eth0" on -every- node. On the two machines you note here, that will still
> let us work - but I don't know if we will catch an "eth0" on another node
> where we need it.
>
> Can you give it a try and see if it works?
> Ralph
>
> On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:
>
>> Maybe I know why now but it's not pleasant, e.g. 2 machines in the same
>> cluster have their ethernets such as:
>>
>> Machine s0157
>>
>> eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
>> BROADCAST MULTICAST MTU:1500 Metric:1
>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>> Interrupt:233 Base address:0x6000
>>
>> eth3 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A9
>> inet addr:10.173.128.13 Bcast:10.173.255.255 Mask:255.255.0.0
>> inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
>> TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:5780065692 (5512.3 Mb) TX bytes:59140357016 (56400.6 Mb)
>> Interrupt:50 Base address:0x8000
>>
>> Machine s0158
>>
>> eth0 Link encap:Ethernet HWaddr 00:23:8B:42:10:A9
>> inet addr:7.8.82.158 Bcast:7.8.255.255 Mask:255.255.0.0
>> UP BROADCAST MULTICAST MTU:1500 Metric:1
>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>> Interrupt:233 Base address:0x6000
>>
>> eth1 Link encap:Ethernet HWaddr 00:23:8B:42:10:AA
>> inet addr:10.173.128.14 Bcast:10.173.255.255 Mask:255.255.0.0
>> inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
>> TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:5879861483 (5607.4 Mb) TX bytes:2406041840 (2294.5 Mb)
>> Interrupt:50 Base address:0x8000
>>
>> Apart from the eths being on different names (happens when installing SuSE
>> SLES 10 SP2)
>> on apparently similar machines, I notice theirs a private ethernet on s0158
>> at IP
>> 7.8.82.158 - I guess this was used. How to exclude when the eth names vary?
>>
>> DM
>>
>>
>> On Tue, 10 Mar 2009, Ralph Castain wrote:
>>
>>> Not really. I've run much bigger jobs than this without problem, so I
>>> don't think there is a fundamental issue here.
>>>
>>> It looks like the TCP fabric between the various nodes is breaking down. I
>>> note in the enclosed messages that the problems are all with comm between
>>> daemons 4 and 21. We keep trying to get through, but failing.
>>>
>>> I can fix things so we don't endlessly loop when that happens (IIRC, I
>>> think we are already supposed to abort, but it appears that isn't
>>> working). But the real question is why the comm fails in the first place.
>>>
>>>
>>> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
>>>
>>>> Latest status - 1.4a1r20757 (yesterday);
>>>> the job now starts with a little output but quickly runs into trouble
>>>> with
>>>> a lot of
>>>> 'oob-tcp: Communication retries exceeded. Can not communicate with peer
>>>> '
>>>> errors?
>>>> e.g.
>>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
>>>> exceeded. Can not communicate with peer [s0158:22513]
>>>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.
>>>> Can not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],21]
>>>> oob-tcp: Communication retries exceeded. Can not communicate with peer
>>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
>>>> exceeded. Can not communicate with peer
>>>> The nodes are O.K. ...
>>>> Any ideas folks?
>>>> DM
>>>> On Sat, 28 Feb 2009, Ralph Castain wrote:
>>>>> I think I have this figured out - will fix on Monday. I'm not sure why
>>>>> Jeff's conditions are all required, especially the second one. However,
>>>>> the fundamental problem is that we pull information from two sources
>>>>> regarding the number of procs in the job when unpacking a buffer, and
>>>>> the two sources appear to be out-of-sync with each other in certain
>>>>> scenarios.
>>>>> The details are beyond the user list. I'll respond here again once I get
>>>>> it fixed.
>>>>> Ralph
>>>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
>>>>>> Unfortunately, I think I have reproduced the problem as well -- with
>>>>>> SVN trunk HEAD (r20655):
>>>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
>>>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack
>>>>>> failed in file base/odls_base_default_fns.c at line 566
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>> --------------------------------------------------------------------------
>>>>>> Notice that I'm not trying to run an MPI app -- it's just "uptime".
>>>>>> The following things seem to be necessary to make this error occur for
>>>>>> me:
>>>>>> 1. --bynode
>>>>>> 2. set some mca parameter (any mca parameter)
>>>>>> 3. -np value less than the size of my slurm allocation
>>>>>> If I remove any of those, it seems to run file
>>>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
>>>>>>> With further investigation, I have reproduced this problem. I think I
>>>>>>> was originally testing against a version that was not recent enough.
>>>>>>> I do not see it with r20594 which is from February 19. So, something
>>>>>>> must have happened over the last 8 days. I will try and narrow down
>>>>>>> the issue.
>>>>>>> Rolf
>>>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
>>>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error, although
>>>>>>>> my configuration was rather different. I ran across 100 2-CPU sparc
>>>>>>>> nodes, np=256, connected with TCP.
>>>>>>>> Hopefully George's comment helps out with this issue.
>>>>>>>> One other thought to see whether SGE has anything to do with this is
>>>>>>>> create a hostfile and run it outside of SGE.
>>>>>>>> Rolf
>>>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
>>>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh launchers,
>>>>>>>>> and everything checks out fine. However, this is running under SGE
>>>>>>>>> and thus using qrsh, so it is possible the SGE support is having a
>>>>>>>>> problem.
>>>>>>>>> Perhaps one of the Sun OMPI developers can help here?
>>>>>>>>> Ralph
>>>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>>>>>>>> It looks like the system doesn't know what nodes the procs are to
>>>>>>>>>> be placed upon. Can you run this with --display-devel-map? That
>>>>>>>>>> will tell us where the system thinks it is placing things.
>>>>>>>>>> Thanks
>>>>>>>>>> Ralph
>>>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>>>>>>>> Maybe it's my pine mailer.
>>>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core
>>>>>>>>>>> AMD
>>>>>>>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>>>>>>> The basic error message, which occurs 31 times is like:
>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line
>>>>>>>>>>> 595
>>>>>>>>>>> The mpirun command has long paths in it, sorry. It's invoking a
>>>>>>>>>>> special binding
>>>>>>>>>>> script which in turn lauches the NAMD run. This works on an older
>>>>>>>>>>> SVN at
>>>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this
>>>>>>>>>>> 256 proc run where
>>>>>>>>>>> the older SVN hangs indefinitely polling some completion (sm or
>>>>>>>>>>> openib). So, I was trying
>>>>>>>>>>> later SVNs with this 256 proc run, hoping the error would go away.
>>>>>>>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>>>>>>> EAGER_SIZE=32767
>>>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>>>>>>> and, unexpanded
>>>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit
>>>>>>>>>>> -x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
>>>>>>>>>>> $NAMD2 stmv.namd
>>>>>>>>>>> and, expanded
>>>>>>>>>>> mpirun --prefix
>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron
>>>>>>>>>>> -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit
>>>>>>>>>>> -x OMPI_MCA_btl_sm_eager_limit -machinefile
>>>>>>>>>>> /tmp/48292.1.all.q/newhosts
>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2
>>>>>>>>>>> stmv.namd
>>>>>>>>>>> This is all via Sun Grid Engine.
>>>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>>>>>>> DM
>>>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>>>>>>>> I'm sorry, but I can't make any sense of this message. Could you
>>>>>>>>>>>> provide a
>>>>>>>>>>>> little explanation of what you are doing, what the system looks
>>>>>>>>>>>> like, what is
>>>>>>>>>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Ralph
>>>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>>>>>>>> Today's and yesterdays.
>>>>>>>>>>>>> 1.4a1r20643_svn
>>>>>>>>>>>>> + mpirun --prefix
>>>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6
>>>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>>>>>>>> 269.1.all.q/newhosts
>>>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>>>>>>>> /ctmp8/mostyn/IM
>>>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
>>>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>>>>>>>> base_default_fns.c at line 595
>>>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Mostyn
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> --
>>>>>>> =========================
>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>> 781-442-3043
>>>>>>> =========================
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> Cisco Systems
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users