Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2009-03-11 15:08:12


Ugh! If you dont' get to this by Friday and I'm able to get the XGrid bug
knocked out quickly, I'll take a look. I remember being worried about
that case when I fixed up the OOB connection code, but thought I convinced
myself it was right. Apparently not - I wonder if I got a loop wrong and
it tries eth0 for N times before trying eth1 ;).

Brian

On Wed, 11 Mar 2009, Ralph Castain wrote:

> No problem - glad we could help!
>
> However, I am going to file this as a bug. The oob is supposed to cycle
> through -all- the available interfaces when attempting to form a connection
> to a remote process, and select the one that allows it to connect. It
> shouldn't have "fixated" on the first one in your list (eth0) and hung - it
> should have tried it, failed to connect, and then tried eth1, which would
> have succeeded.
>
> So I apologize for the problem, and appreciate your patience in helping to
> identify what is indeed a bug in the code.
> Ralph
>
> On Mar 11, 2009, at 12:51 PM, Mostyn Lewis wrote:
>
>> Yes, -mca oob_tcp_if_exclude eth0, worked O.K., even though some
>> machines have no eth0.
>>
>> Thanks,
>> DM
>>
>> On Tue, 10 Mar 2009, Ralph Castain wrote:
>>
>>> Ick. We don't have a way currently to allow you to ignore an interface on
>>> a node-by-node basis. If you do:
>>>
>>> -mca oob_tcp_if_exclude eth0
>>>
>>> we will exclude that private Ethernet. The catch is that we will exclude
>>> "eth0" on -every- node. On the two machines you note here, that will still
>>> let us work - but I don't know if we will catch an "eth0" on another node
>>> where we need it.
>>>
>>> Can you give it a try and see if it works?
>>> Ralph
>>>
>>> On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:
>>>
>>>> Maybe I know why now but it's not pleasant, e.g. 2 machines in the same
>>>> cluster have their ethernets such as:
>>>> Machine s0157
>>>> eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
>>>> BROADCAST MULTICAST MTU:1500 Metric:1
>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>>> Interrupt:233 Base address:0x6000
>>>> eth3 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A9
>>>> inet addr:10.173.128.13 Bcast:10.173.255.255 Mask:255.255.0.0
>>>> inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
>>>> TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:5780065692 (5512.3 Mb) TX bytes:59140357016 (56400.6 Mb)
>>>> Interrupt:50 Base address:0x8000
>>>> Machine s0158
>>>> eth0 Link encap:Ethernet HWaddr 00:23:8B:42:10:A9
>>>> inet addr:7.8.82.158 Bcast:7.8.255.255 Mask:255.255.0.0
>>>> UP BROADCAST MULTICAST MTU:1500 Metric:1
>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>>> Interrupt:233 Base address:0x6000
>>>> eth1 Link encap:Ethernet HWaddr 00:23:8B:42:10:AA
>>>> inet addr:10.173.128.14 Bcast:10.173.255.255 Mask:255.255.0.0
>>>> inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
>>>> TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:5879861483 (5607.4 Mb) TX bytes:2406041840 (2294.5 Mb)
>>>> Interrupt:50 Base address:0x8000
>>>> Apart from the eths being on different names (happens when installing
>>>> SuSE SLES 10 SP2)
>>>> on apparently similar machines, I notice theirs a private ethernet on
>>>> s0158 at IP
>>>> 7.8.82.158 - I guess this was used. How to exclude when the eth names
>>>> vary?
>>>> DM
>>>> On Tue, 10 Mar 2009, Ralph Castain wrote:
>>>>> Not really. I've run much bigger jobs than this without problem, so I
>>>>> don't think there is a fundamental issue here.
>>>>> It looks like the TCP fabric between the various nodes is breaking down.
>>>>> I note in the enclosed messages that the problems are all with comm
>>>>> between daemons 4 and 21. We keep trying to get through, but failing.
>>>>> I can fix things so we don't endlessly loop when that happens (IIRC, I
>>>>> think we are already supposed to abort, but it appears that isn't
>>>>> working). But the real question is why the comm fails in the first
>>>>> place.
>>>>> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
>>>>>> Latest status - 1.4a1r20757 (yesterday);
>>>>>> the job now starts with a little output but quickly runs into trouble
>>>>>> with
>>>>>> a lot of
>>>>>> 'oob-tcp: Communication retries exceeded. Can not communicate with
>>>>>> peer '
>>>>>> errors?
>>>>>> e.g.
>>>>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication
>>>>>> retries exceeded. Can not communicate with peer [s0158:22513]
>>>>>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.
>>>>>> Can not communicate with peer [s0158:22513]
>>>>>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.
>>>>>> Can not communicate with peer [s0158:22513]
>>>>>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.
>>>>>> Can not communicate with peer
>>>>>> The nodes are O.K. ...
>>>>>> Any ideas folks?
>>>>>> DM
>>>>>> On Sat, 28 Feb 2009, Ralph Castain wrote:
>>>>>>> I think I have this figured out - will fix on Monday. I'm not sure why
>>>>>>> Jeff's conditions are all required, especially the second one.
>>>>>>> However, the fundamental problem is that we pull information from two
>>>>>>> sources regarding the number of procs in the job when unpacking a
>>>>>>> buffer, and the two sources appear to be out-of-sync with each other
>>>>>>> in certain scenarios.
>>>>>>> The details are beyond the user list. I'll respond here again once I
>>>>>>> get it fixed.
>>>>>>> Ralph
>>>>>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
>>>>>>>> Unfortunately, I think I have reproduced the problem as well -- with
>>>>>>>> SVN trunk HEAD (r20655):
>>>>>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
>>>>>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack
>>>>>>>> failed in file base/odls_base_default_fns.c at line 566
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>> process
>>>>>>>> that caused that situation.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> Notice that I'm not trying to run an MPI app -- it's just "uptime".
>>>>>>>> The following things seem to be necessary to make this error occur
>>>>>>>> for me:
>>>>>>>> 1. --bynode
>>>>>>>> 2. set some mca parameter (any mca parameter)
>>>>>>>> 3. -np value less than the size of my slurm allocation
>>>>>>>> If I remove any of those, it seems to run file
>>>>>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
>>>>>>>>> With further investigation, I have reproduced this problem. I think
>>>>>>>>> I was originally testing against a version that was not recent
>>>>>>>>> enough. I do not see it with r20594 which is from February 19. So,
>>>>>>>>> something must have happened over the last 8 days. I will try and
>>>>>>>>> narrow down the issue.
>>>>>>>>> Rolf
>>>>>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
>>>>>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error,
>>>>>>>>>> although my configuration was rather different. I ran across 100
>>>>>>>>>> 2-CPU sparc nodes, np=256, connected with TCP.
>>>>>>>>>> Hopefully George's comment helps out with this issue.
>>>>>>>>>> One other thought to see whether SGE has anything to do with this
>>>>>>>>>> is create a hostfile and run it outside of SGE.
>>>>>>>>>> Rolf
>>>>>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
>>>>>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh
>>>>>>>>>>> launchers, and everything checks out fine. However, this is
>>>>>>>>>>> running under SGE and thus using qrsh, so it is possible the SGE
>>>>>>>>>>> support is having a problem.
>>>>>>>>>>> Perhaps one of the Sun OMPI developers can help here?
>>>>>>>>>>> Ralph
>>>>>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>>>>>>>>>> It looks like the system doesn't know what nodes the procs are to
>>>>>>>>>>>> be placed upon. Can you run this with --display-devel-map? That
>>>>>>>>>>>> will tell us where the system thinks it is placing things.
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Ralph
>>>>>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>>>>>>>>>> Maybe it's my pine mailer.
>>>>>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core
>>>>>>>>>>>>> AMD
>>>>>>>>>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>>>>>>>>> The basic error message, which occurs 31 times is like:
>>>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line
>>>>>>>>>>>>> 595
>>>>>>>>>>>>> The mpirun command has long paths in it, sorry. It's invoking a
>>>>>>>>>>>>> special binding
>>>>>>>>>>>>> script which in turn lauches the NAMD run. This works on an
>>>>>>>>>>>>> older SVN at
>>>>>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for
>>>>>>>>>>>>> this 256 proc run where
>>>>>>>>>>>>> the older SVN hangs indefinitely polling some completion (sm or
>>>>>>>>>>>>> openib). So, I was trying
>>>>>>>>>>>>> later SVNs with this 256 proc run, hoping the error would go
>>>>>>>>>>>>> away.
>>>>>>>>>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>>>>>>>>> EAGER_SIZE=32767
>>>>>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>>>>>>>>> and, unexpanded
>>>>>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit
>>>>>>>>>>>>> -x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
>>>>>>>>>>>>> $NAMD2 stmv.namd
>>>>>>>>>>>>> and, expanded
>>>>>>>>>>>>> mpirun --prefix
>>>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron
>>>>>>>>>>>>> -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit
>>>>>>>>>>>>> -x OMPI_MCA_btl_sm_eager_limit -machinefile
>>>>>>>>>>>>> /tmp/48292.1.all.q/newhosts
>>>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2
>>>>>>>>>>>>> stmv.namd
>>>>>>>>>>>>> This is all via Sun Grid Engine.
>>>>>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>>>>>>>>> DM
>>>>>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>>>>>>>>>> I'm sorry, but I can't make any sense of this message. Could
>>>>>>>>>>>>>> you provide a
>>>>>>>>>>>>>> little explanation of what you are doing, what the system looks
>>>>>>>>>>>>>> like, what is
>>>>>>>>>>>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>>>>>>>>>> Today's and yesterdays.
>>>>>>>>>>>>>>> 1.4a1r20643_svn
>>>>>>>>>>>>>>> + mpirun --prefix
>>>>>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6
>>>>>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>>>>>>>>>> 269.1.all.q/newhosts
>>>>>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>>>>>>>>>> /ctmp8/mostyn/IM
>>>>>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
>>>>>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>>>>>>>>>> base_default_fns.c at line 595
>>>>>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Mostyn
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> --
>>>>>>>>> =========================
>>>>>>>>> rolf.vandevaart_at_[hidden]
>>>>>>>>> 781-442-3043
>>>>>>>>> =========================
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> --
>>>>>>>> Jeff Squyres
>>>>>>>> Cisco Systems
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>