Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-10 16:32:35


You *could* have a per-machine mca param config file that could be
locally staged on each machine and setup with the exclude for whatever
you need on *that* node. Ugly, but it could work...?

On Mar 10, 2009, at 4:26 PM, Ralph Castain wrote:

> Ick. We don't have a way currently to allow you to ignore an interface
> on a node-by-node basis. If you do:
>
> -mca oob_tcp_if_exclude eth0
>
> we will exclude that private Ethernet. The catch is that we will
> exclude "eth0" on -every- node. On the two machines you note here,
> that will still let us work - but I don't know if we will catch an
> "eth0" on another node where we need it.
>
> Can you give it a try and see if it works?
> Ralph
>
> On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:
>
> > Maybe I know why now but it's not pleasant, e.g. 2 machines in the
> > same
> > cluster have their ethernets such as:
> >
> > Machine s0157
> >
> > eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
> > BROADCAST MULTICAST MTU:1500 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> > Interrupt:233 Base address:0x6000
> >
> > eth3 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A9
> > inet addr:10.173.128.13 Bcast:10.173.255.255 Mask:
> > 255.255.0.0
> > inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
> > TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:5780065692 (5512.3 Mb) TX bytes:59140357016
> > (56400.6 Mb)
> > Interrupt:50 Base address:0x8000
> >
> > Machine s0158
> >
> > eth0 Link encap:Ethernet HWaddr 00:23:8B:42:10:A9
> > inet addr:7.8.82.158 Bcast:7.8.255.255 Mask:255.255.0.0
> > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> > Interrupt:233 Base address:0x6000
> >
> > eth1 Link encap:Ethernet HWaddr 00:23:8B:42:10:AA
> > inet addr:10.173.128.14 Bcast:10.173.255.255 Mask:
> > 255.255.0.0
> > inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
> > TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:5879861483 (5607.4 Mb) TX bytes:2406041840
> > (2294.5 Mb)
> > Interrupt:50 Base address:0x8000
> >
> > Apart from the eths being on different names (happens when
> > installing SuSE SLES 10 SP2)
> > on apparently similar machines, I notice theirs a private ethernet
> > on s0158 at IP
> > 7.8.82.158 - I guess this was used. How to exclude when the eth
> > names vary?
> >
> > DM
> >
> >
> > On Tue, 10 Mar 2009, Ralph Castain wrote:
> >
> >> Not really. I've run much bigger jobs than this without problem, so
> >> I don't think there is a fundamental issue here.
> >>
> >> It looks like the TCP fabric between the various nodes is breaking
> >> down. I note in the enclosed messages that the problems are all
> >> with comm between daemons 4 and 21. We keep trying to get through,
> >> but failing.
> >>
> >> I can fix things so we don't endlessly loop when that happens
> >> (IIRC, I think we are already supposed to abort, but it appears
> >> that isn't working). But the real question is why the comm fails in
> >> the first place.
> >>
> >>
> >> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
> >>
> >>> Latest status - 1.4a1r20757 (yesterday);
> >>> the job now starts with a little output but quickly runs into
> >>> trouble with
> >>> a lot of
> >>> 'oob-tcp: Communication retries exceeded. Can not communicate
> >>> with peer '
> >>> errors?
> >>> e.g.
> >>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication
> >>> retries exceeded. Can not communicate with peer [s0158:22513]
> >>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
> >>> exceeded. Can not communicate with peer [s0158:22513] [[41245,0],
> >>> 4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can
> >>> not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],
> >>> 21] oob-tcp: Communication retries exceeded. Can not communicate
> >>> with peer
> >>> The nodes are O.K. ...
> >>> Any ideas folks?
> >>> DM
> >>> On Sat, 28 Feb 2009, Ralph Castain wrote:
> >>>> I think I have this figured out - will fix on Monday. I'm not
> >>>> sure why Jeff's conditions are all required, especially the
> >>>> second one. However, the fundamental problem is that we pull
> >>>> information from two sources regarding the number of procs in the
> >>>> job when unpacking a buffer, and the two sources appear to be
> out-
> >>>> of-sync with each other in certain scenarios.
> >>>> The details are beyond the user list. I'll respond here again
> >>>> once I get it fixed.
> >>>> Ralph
> >>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
> >>>>> Unfortunately, I think I have reproduced the problem as well --
> >>>>> with SVN trunk HEAD (r20655):
> >>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2
> >>>>> uptime
> >>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data
> >>>>> unpack failed in file base/odls_base_default_fns.c at line 566
> >>>>>
> --------------------------------------------------------------------------
> >>>>> mpirun noticed that the job aborted, but has no info as to the
> >>>>> process
> >>>>> that caused that situation.
> >>>>>
> --------------------------------------------------------------------------
> >>>>> Notice that I'm not trying to run an MPI app -- it's just
> >>>>> "uptime".
> >>>>> The following things seem to be necessary to make this error
> >>>>> occur for me:
> >>>>> 1. --bynode
> >>>>> 2. set some mca parameter (any mca parameter)
> >>>>> 3. -np value less than the size of my slurm allocation
> >>>>> If I remove any of those, it seems to run file
> >>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
> >>>>>> With further investigation, I have reproduced this problem. I
> >>>>>> think I was originally testing against a version that was not
> >>>>>> recent enough. I do not see it with r20594 which is from
> >>>>>> February 19. So, something must have happened over the last 8
> >>>>>> days. I will try and narrow down the issue.
> >>>>>> Rolf
> >>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
> >>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error,
> >>>>>>> although my configuration was rather different. I ran across
> >>>>>>> 100 2-CPU sparc nodes, np=256, connected with TCP.
> >>>>>>> Hopefully George's comment helps out with this issue.
> >>>>>>> One other thought to see whether SGE has anything to do with
> >>>>>>> this is create a hostfile and run it outside of SGE.
> >>>>>>> Rolf
> >>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
> >>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh
> >>>>>>>> launchers, and everything checks out fine. However, this is
> >>>>>>>> running under SGE and thus using qrsh, so it is possible the
> >>>>>>>> SGE support is having a problem.
> >>>>>>>> Perhaps one of the Sun OMPI developers can help here?
> >>>>>>>> Ralph
> >>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
> >>>>>>>>> It looks like the system doesn't know what nodes the procs
> >>>>>>>>> are to be placed upon. Can you run this with --display-
> devel-
> >>>>>>>>> map? That will tell us where the system thinks it is placing
> >>>>>>>>> things.
> >>>>>>>>> Thanks
> >>>>>>>>> Ralph
> >>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
> >>>>>>>>>> Maybe it's my pine mailer.
> >>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-
> >>>>>>>>>> core AMD
> >>>>>>>>>> shangai nodes running a standard benchmark called stmv.
> >>>>>>>>>> The basic error message, which occurs 31 times is like:
> >>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>> file ../../../.././orte/mca/odls/base/
> >>>>>>>>>> odls_base_default_fns.c at line 595
> >>>>>>>>>> The mpirun command has long paths in it, sorry. It's
> >>>>>>>>>> invoking a special binding
> >>>>>>>>>> script which in turn lauches the NAMD run. This works on an
> >>>>>>>>>> older SVN at
> >>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not
> >>>>>>>>>> for this 256 proc run where
> >>>>>>>>>> the older SVN hangs indefinitely polling some completion
> >>>>>>>>>> (sm or openib). So, I was trying
> >>>>>>>>>> later SVNs with this 256 proc run, hoping the error would
> >>>>>>>>>> go away.
> >>>>>>>>>> Here's some of the invocation again. Hope you can read it:
> >>>>>>>>>> EAGER_SIZE=32767
> >>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
> >>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
> >>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
> >>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
> >>>>>>>>>> and, unexpanded
> >>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
> >>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
> >>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
> >>>>>>>>>> $NAMD2 stmv.namd
> >>>>>>>>>> and, expanded
> >>>>>>>>>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
> >>>>>>>>>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256
> >>>>>>>>>> --mca btl sm,openib,self -x
> >>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
> >>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/
> 48292.1.all.q/
> >>>>>>>>>> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
> >>>>>>>>>> mpi_binder.MRL /ctmp8/mostyn/IMSC/
> >>>>>>>>>> bench_intel_openmpi_I_shang2/
> >>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
> >>>>>>>>>> NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd
> >>>>>>>>>> This is all via Sun Grid Engine.
> >>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
> >>>>>>>>>> DM
> >>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
> >>>>>>>>>>> I'm sorry, but I can't make any sense of this message.
> >>>>>>>>>>> Could you provide a
> >>>>>>>>>>> little explanation of what you are doing, what the system
> >>>>>>>>>>> looks like, what is
> >>>>>>>>>>> supposed to happen, etc? I can barely parse your cmd
> line...
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> Ralph
> >>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
> >>>>>>>>>>>> Today's and yesterdays.
> >>>>>>>>>>>> 1.4a1r20643_svn
> >>>>>>>>>>>> + mpirun --prefix
> >>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/
> >>>>>>>>>>>> openib/suse_sles_10/x86_6
> >>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
> >>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
> >>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
> >>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
> >>>>>>>>>>>> 269.1.all.q/newhosts
> >>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
> >>>>>>>>>>>> mpi_binder.MRL
> >>>>>>>>>>>> /ctmp8/mostyn/IM
> >>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/
> >>>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
> >>>>>>>>>>>> NAMD_2.6_Source/Li
> >>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
> >>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
> >>>>>>>>>>>> base_default_fns.c at line 595
> >>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> Made with INTEL compilers 10.1.015.
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Mostyn
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> users mailing list
> >>>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users_at_[hidden]
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> --
> >>>>>> =========================
> >>>>>> rolf.vandevaart_at_[hidden]
> >>>>>> 781-442-3043
> >>>>>> =========================
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> --
> >>>>> Jeff Squyres
> >>>>> Cisco Systems
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems