Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-11 14:59:28


No problem - glad we could help!

However, I am going to file this as a bug. The oob is supposed to
cycle through -all- the available interfaces when attempting to form a
connection to a remote process, and select the one that allows it to
connect. It shouldn't have "fixated" on the first one in your list
(eth0) and hung - it should have tried it, failed to connect, and then
tried eth1, which would have succeeded.

So I apologize for the problem, and appreciate your patience in
helping to identify what is indeed a bug in the code.
Ralph

On Mar 11, 2009, at 12:51 PM, Mostyn Lewis wrote:

> Yes, -mca oob_tcp_if_exclude eth0, worked O.K., even though some
> machines have no eth0.
>
> Thanks,
> DM
>
> On Tue, 10 Mar 2009, Ralph Castain wrote:
>
>> Ick. We don't have a way currently to allow you to ignore an
>> interface on a node-by-node basis. If you do:
>>
>> -mca oob_tcp_if_exclude eth0
>>
>> we will exclude that private Ethernet. The catch is that we will
>> exclude "eth0" on -every- node. On the two machines you note here,
>> that will still let us work - but I don't know if we will catch an
>> "eth0" on another node where we need it.
>>
>> Can you give it a try and see if it works?
>> Ralph
>>
>> On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:
>>
>>> Maybe I know why now but it's not pleasant, e.g. 2 machines in the
>>> same
>>> cluster have their ethernets such as:
>>> Machine s0157
>>> eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
>>> BROADCAST MULTICAST MTU:1500 Metric:1
>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>> Interrupt:233 Base address:0x6000
>>> eth3 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A9
>>> inet addr:10.173.128.13 Bcast:10.173.255.255 Mask:
>>> 255.255.0.0
>>> inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
>>> TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:5780065692 (5512.3 Mb) TX bytes:59140357016
>>> (56400.6 Mb)
>>> Interrupt:50 Base address:0x8000
>>> Machine s0158
>>> eth0 Link encap:Ethernet HWaddr 00:23:8B:42:10:A9
>>> inet addr:7.8.82.158 Bcast:7.8.255.255 Mask:255.255.0.0
>>> UP BROADCAST MULTICAST MTU:1500 Metric:1
>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>> Interrupt:233 Base address:0x6000
>>> eth1 Link encap:Ethernet HWaddr 00:23:8B:42:10:AA
>>> inet addr:10.173.128.14 Bcast:10.173.255.255 Mask:
>>> 255.255.0.0
>>> inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
>>> TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:5879861483 (5607.4 Mb) TX bytes:2406041840
>>> (2294.5 Mb)
>>> Interrupt:50 Base address:0x8000
>>> Apart from the eths being on different names (happens when
>>> installing SuSE SLES 10 SP2)
>>> on apparently similar machines, I notice theirs a private ethernet
>>> on s0158 at IP
>>> 7.8.82.158 - I guess this was used. How to exclude when the eth
>>> names vary?
>>> DM
>>> On Tue, 10 Mar 2009, Ralph Castain wrote:
>>>> Not really. I've run much bigger jobs than this without problem,
>>>> so I don't think there is a fundamental issue here.
>>>> It looks like the TCP fabric between the various nodes is
>>>> breaking down. I note in the enclosed messages that the problems
>>>> are all with comm between daemons 4 and 21. We keep trying to get
>>>> through, but failing.
>>>> I can fix things so we don't endlessly loop when that happens
>>>> (IIRC, I think we are already supposed to abort, but it appears
>>>> that isn't working). But the real question is why the comm fails
>>>> in the first place.
>>>> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
>>>>> Latest status - 1.4a1r20757 (yesterday);
>>>>> the job now starts with a little output but quickly runs into
>>>>> trouble with
>>>>> a lot of
>>>>> 'oob-tcp: Communication retries exceeded. Can not communicate
>>>>> with peer '
>>>>> errors?
>>>>> e.g.
>>>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp:
>>>>> Communication retries exceeded. Can not communicate with peer
>>>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp:
>>>>> Communication retries exceeded. Can not communicate with peer
>>>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp:
>>>>> Communication retries exceeded. Can not communicate with peer
>>>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp:
>>>>> Communication retries exceeded. Can not communicate with peer
>>>>> The nodes are O.K. ...
>>>>> Any ideas folks?
>>>>> DM
>>>>> On Sat, 28 Feb 2009, Ralph Castain wrote:
>>>>>> I think I have this figured out - will fix on Monday. I'm not
>>>>>> sure why Jeff's conditions are all required, especially the
>>>>>> second one. However, the fundamental problem is that we pull
>>>>>> information from two sources regarding the number of procs in
>>>>>> the job when unpacking a buffer, and the two sources appear to
>>>>>> be out-of-sync with each other in certain scenarios.
>>>>>> The details are beyond the user list. I'll respond here again
>>>>>> once I get it fixed.
>>>>>> Ralph
>>>>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
>>>>>>> Unfortunately, I think I have reproduced the problem as well
>>>>>>> -- with SVN trunk HEAD (r20655):
>>>>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2
>>>>>>> uptime
>>>>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data
>>>>>>> unpack failed in file base/odls_base_default_fns.c at line 566
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>> process
>>>>>>> that caused that situation.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> Notice that I'm not trying to run an MPI app -- it's just
>>>>>>> "uptime".
>>>>>>> The following things seem to be necessary to make this error
>>>>>>> occur for me:
>>>>>>> 1. --bynode
>>>>>>> 2. set some mca parameter (any mca parameter)
>>>>>>> 3. -np value less than the size of my slurm allocation
>>>>>>> If I remove any of those, it seems to run file
>>>>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
>>>>>>>> With further investigation, I have reproduced this problem.
>>>>>>>> I think I was originally testing against a version that was
>>>>>>>> not recent enough. I do not see it with r20594 which is from
>>>>>>>> February 19. So, something must have happened over the last
>>>>>>>> 8 days. I will try and narrow down the issue.
>>>>>>>> Rolf
>>>>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
>>>>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error,
>>>>>>>>> although my configuration was rather different. I ran
>>>>>>>>> across 100 2-CPU sparc nodes, np=256, connected with TCP.
>>>>>>>>> Hopefully George's comment helps out with this issue.
>>>>>>>>> One other thought to see whether SGE has anything to do with
>>>>>>>>> this is create a hostfile and run it outside of SGE.
>>>>>>>>> Rolf
>>>>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
>>>>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh
>>>>>>>>>> launchers, and everything checks out fine. However, this is
>>>>>>>>>> running under SGE and thus using qrsh, so it is possible
>>>>>>>>>> the SGE support is having a problem.
>>>>>>>>>> Perhaps one of the Sun OMPI developers can help here?
>>>>>>>>>> Ralph
>>>>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>>>>>>>>> It looks like the system doesn't know what nodes the procs
>>>>>>>>>>> are to be placed upon. Can you run this with --display-
>>>>>>>>>>> devel-map? That will tell us where the system thinks it is
>>>>>>>>>>> placing things.
>>>>>>>>>>> Thanks
>>>>>>>>>>> Ralph
>>>>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>>>>>>>>> Maybe it's my pine mailer.
>>>>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket
>>>>>>>>>>>> quad-core AMD
>>>>>>>>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>>>>>>>> The basic error message, which occurs 31 times is like:
>>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file ../../../.././orte/mca/odls/base/
>>>>>>>>>>>> odls_base_default_fns.c at line 595
>>>>>>>>>>>> The mpirun command has long paths in it, sorry. It's
>>>>>>>>>>>> invoking a special binding
>>>>>>>>>>>> script which in turn lauches the NAMD run. This works on
>>>>>>>>>>>> an older SVN at
>>>>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not
>>>>>>>>>>>> for this 256 proc run where
>>>>>>>>>>>> the older SVN hangs indefinitely polling some completion
>>>>>>>>>>>> (sm or openib). So, I was trying
>>>>>>>>>>>> later SVNs with this 256 proc run, hoping the error would
>>>>>>>>>>>> go away.
>>>>>>>>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>>>>>>>> EAGER_SIZE=32767
>>>>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>>>>>>>> and, unexpanded
>>>>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
>>>>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS
>>>>>>>>>>>> $MPI_BINDER $NAMD2 stmv.namd
>>>>>>>>>>>> and, expanded
>>>>>>>>>>>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
>>>>>>>>>>>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np
>>>>>>>>>>>> 256 --mca btl sm,openib,self -x
>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
>>>>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/
>>>>>>>>>>>> 48292.1.all.q/newhosts /ctmp8/mostyn/IMSC/
>>>>>>>>>>>> bench_intel_openmpi_I_shang2/mpi_binder.MRL /ctmp8/mostyn/
>>>>>>>>>>>> IMSC/bench_intel_openmpi_I_shang2/
>>>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>>>>>>>>> NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>>>> This is all via Sun Grid Engine.
>>>>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>>>>>>>> DM
>>>>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>>>>>>>>> I'm sorry, but I can't make any sense of this message.
>>>>>>>>>>>>> Could you provide a
>>>>>>>>>>>>> little explanation of what you are doing, what the
>>>>>>>>>>>>> system looks like, what is
>>>>>>>>>>>>> supposed to happen, etc? I can barely parse your cmd
>>>>>>>>>>>>> line...
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>>>>>>>>> Today's and yesterdays.
>>>>>>>>>>>>>> 1.4a1r20643_svn
>>>>>>>>>>>>>> + mpirun --prefix
>>>>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/
>>>>>>>>>>>>>> intel64/10.1.015/openib/suse_sles_10/x86_6
>>>>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>>>>>>>>> 269.1.all.q/newhosts
>>>>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>>>>>>>>>>>>> mpi_binder.MRL
>>>>>>>>>>>>>> /ctmp8/mostyn/IM
>>>>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/
>>>>>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>>>>>>>>>>> NAMD_2.6_Source/Li
>>>>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found
>>>>>>>>>>>>>> in file
>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found
>>>>>>>>>>>>>> in file
>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>>>>>>>>> base_default_fns.c at line 595
>>>>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found
>>>>>>>>>>>>>> in file
>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found
>>>>>>>>>>>>>> in file
>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found
>>>>>>>>>>>>>> in file
>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Mostyn
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> --
>>>>>>>> =========================
>>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>>> 781-442-3043
>>>>>>>> =========================
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> Cisco Systems
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users