Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-10 16:26:39


Ick. We don't have a way currently to allow you to ignore an interface
on a node-by-node basis. If you do:

-mca oob_tcp_if_exclude eth0

we will exclude that private Ethernet. The catch is that we will
exclude "eth0" on -every- node. On the two machines you note here,
that will still let us work - but I don't know if we will catch an
"eth0" on another node where we need it.

Can you give it a try and see if it works?
Ralph

On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:

> Maybe I know why now but it's not pleasant, e.g. 2 machines in the
> same
> cluster have their ethernets such as:
>
> Machine s0157
>
> eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
> BROADCAST MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> Interrupt:233 Base address:0x6000
>
> eth3 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A9
> inet addr:10.173.128.13 Bcast:10.173.255.255 Mask:
> 255.255.0.0
> inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
> TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:5780065692 (5512.3 Mb) TX bytes:59140357016
> (56400.6 Mb)
> Interrupt:50 Base address:0x8000
>
> Machine s0158
>
> eth0 Link encap:Ethernet HWaddr 00:23:8B:42:10:A9
> inet addr:7.8.82.158 Bcast:7.8.255.255 Mask:255.255.0.0
> UP BROADCAST MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> Interrupt:233 Base address:0x6000
>
> eth1 Link encap:Ethernet HWaddr 00:23:8B:42:10:AA
> inet addr:10.173.128.14 Bcast:10.173.255.255 Mask:
> 255.255.0.0
> inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
> TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:5879861483 (5607.4 Mb) TX bytes:2406041840
> (2294.5 Mb)
> Interrupt:50 Base address:0x8000
>
> Apart from the eths being on different names (happens when
> installing SuSE SLES 10 SP2)
> on apparently similar machines, I notice theirs a private ethernet
> on s0158 at IP
> 7.8.82.158 - I guess this was used. How to exclude when the eth
> names vary?
>
> DM
>
>
> On Tue, 10 Mar 2009, Ralph Castain wrote:
>
>> Not really. I've run much bigger jobs than this without problem, so
>> I don't think there is a fundamental issue here.
>>
>> It looks like the TCP fabric between the various nodes is breaking
>> down. I note in the enclosed messages that the problems are all
>> with comm between daemons 4 and 21. We keep trying to get through,
>> but failing.
>>
>> I can fix things so we don't endlessly loop when that happens
>> (IIRC, I think we are already supposed to abort, but it appears
>> that isn't working). But the real question is why the comm fails in
>> the first place.
>>
>>
>> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
>>
>>> Latest status - 1.4a1r20757 (yesterday);
>>> the job now starts with a little output but quickly runs into
>>> trouble with
>>> a lot of
>>> 'oob-tcp: Communication retries exceeded. Can not communicate
>>> with peer '
>>> errors?
>>> e.g.
>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication
>>> retries exceeded. Can not communicate with peer [s0158:22513]
>>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
>>> exceeded. Can not communicate with peer [s0158:22513] [[41245,0],
>>> 4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can
>>> not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],
>>> 21] oob-tcp: Communication retries exceeded. Can not communicate
>>> with peer
>>> The nodes are O.K. ...
>>> Any ideas folks?
>>> DM
>>> On Sat, 28 Feb 2009, Ralph Castain wrote:
>>>> I think I have this figured out - will fix on Monday. I'm not
>>>> sure why Jeff's conditions are all required, especially the
>>>> second one. However, the fundamental problem is that we pull
>>>> information from two sources regarding the number of procs in the
>>>> job when unpacking a buffer, and the two sources appear to be out-
>>>> of-sync with each other in certain scenarios.
>>>> The details are beyond the user list. I'll respond here again
>>>> once I get it fixed.
>>>> Ralph
>>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
>>>>> Unfortunately, I think I have reproduced the problem as well --
>>>>> with SVN trunk HEAD (r20655):
>>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2
>>>>> uptime
>>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data
>>>>> unpack failed in file base/odls_base_default_fns.c at line 566
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>> process
>>>>> that caused that situation.
>>>>> --------------------------------------------------------------------------
>>>>> Notice that I'm not trying to run an MPI app -- it's just
>>>>> "uptime".
>>>>> The following things seem to be necessary to make this error
>>>>> occur for me:
>>>>> 1. --bynode
>>>>> 2. set some mca parameter (any mca parameter)
>>>>> 3. -np value less than the size of my slurm allocation
>>>>> If I remove any of those, it seems to run file
>>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
>>>>>> With further investigation, I have reproduced this problem. I
>>>>>> think I was originally testing against a version that was not
>>>>>> recent enough. I do not see it with r20594 which is from
>>>>>> February 19. So, something must have happened over the last 8
>>>>>> days. I will try and narrow down the issue.
>>>>>> Rolf
>>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
>>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error,
>>>>>>> although my configuration was rather different. I ran across
>>>>>>> 100 2-CPU sparc nodes, np=256, connected with TCP.
>>>>>>> Hopefully George's comment helps out with this issue.
>>>>>>> One other thought to see whether SGE has anything to do with
>>>>>>> this is create a hostfile and run it outside of SGE.
>>>>>>> Rolf
>>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
>>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh
>>>>>>>> launchers, and everything checks out fine. However, this is
>>>>>>>> running under SGE and thus using qrsh, so it is possible the
>>>>>>>> SGE support is having a problem.
>>>>>>>> Perhaps one of the Sun OMPI developers can help here?
>>>>>>>> Ralph
>>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>>>>>>> It looks like the system doesn't know what nodes the procs
>>>>>>>>> are to be placed upon. Can you run this with --display-devel-
>>>>>>>>> map? That will tell us where the system thinks it is placing
>>>>>>>>> things.
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>>>>>>> Maybe it's my pine mailer.
>>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-
>>>>>>>>>> core AMD
>>>>>>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>>>>>> The basic error message, which occurs 31 times is like:
>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>>>>>>>>>> file ../../../.././orte/mca/odls/base/
>>>>>>>>>> odls_base_default_fns.c at line 595
>>>>>>>>>> The mpirun command has long paths in it, sorry. It's
>>>>>>>>>> invoking a special binding
>>>>>>>>>> script which in turn lauches the NAMD run. This works on an
>>>>>>>>>> older SVN at
>>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not
>>>>>>>>>> for this 256 proc run where
>>>>>>>>>> the older SVN hangs indefinitely polling some completion
>>>>>>>>>> (sm or openib). So, I was trying
>>>>>>>>>> later SVNs with this 256 proc run, hoping the error would
>>>>>>>>>> go away.
>>>>>>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>>>>>> EAGER_SIZE=32767
>>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>>>>>> and, unexpanded
>>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
>>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
>>>>>>>>>> $NAMD2 stmv.namd
>>>>>>>>>> and, expanded
>>>>>>>>>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
>>>>>>>>>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256
>>>>>>>>>> --mca btl sm,openib,self -x
>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
>>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/
>>>>>>>>>> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>>>>>>>>> mpi_binder.MRL /ctmp8/mostyn/IMSC/
>>>>>>>>>> bench_intel_openmpi_I_shang2/
>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>>>>>>> NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>> This is all via Sun Grid Engine.
>>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>>>>>> DM
>>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>>>>>>> I'm sorry, but I can't make any sense of this message.
>>>>>>>>>>> Could you provide a
>>>>>>>>>>> little explanation of what you are doing, what the system
>>>>>>>>>>> looks like, what is
>>>>>>>>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>>>>>>>> Thanks
>>>>>>>>>>> Ralph
>>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>>>>>>> Today's and yesterdays.
>>>>>>>>>>>> 1.4a1r20643_svn
>>>>>>>>>>>> + mpirun --prefix
>>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/
>>>>>>>>>>>> openib/suse_sles_10/x86_6
>>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>>>>>>> 269.1.all.q/newhosts
>>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>>>>>>>>>>> mpi_binder.MRL
>>>>>>>>>>>> /ctmp8/mostyn/IM
>>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/
>>>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>>>>>>>>> NAMD_2.6_Source/Li
>>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>>>>>>> base_default_fns.c at line 595
>>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Mostyn
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> --
>>>>>> =========================
>>>>>> rolf.vandevaart_at_[hidden]
>>>>>> 781-442-3043
>>>>>> =========================
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users