Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Latest SVN failures
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-10 16:32:35


You *could* have a per-machine mca param config file that could be
locally staged on each machine and setup with the exclude for whatever
you need on *that* node. Ugly, but it could work...?

On Mar 10, 2009, at 4:26 PM, Ralph Castain wrote:

> Ick. We don't have a way currently to allow you to ignore an interface
> on a node-by-node basis. If you do:
>
> -mca oob_tcp_if_exclude eth0
>
> we will exclude that private Ethernet. The catch is that we will
> exclude "eth0" on -every- node. On the two machines you note here,
> that will still let us work - but I don't know if we will catch an
> "eth0" on another node where we need it.
>
> Can you give it a try and see if it works?
> Ralph
>
> On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:
>
> > Maybe I know why now but it's not pleasant, e.g. 2 machines in the
> > same
> > cluster have their ethernets such as:
> >
> > Machine s0157
> >
> > eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
> > BROADCAST MULTICAST MTU:1500 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> > Interrupt:233 Base address:0x6000
> >
> > eth3 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A9
> > inet addr:10.173.128.13 Bcast:10.173.255.255 Mask:
> > 255.255.0.0
> > inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
> > TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:5780065692 (5512.3 Mb) TX bytes:59140357016
> > (56400.6 Mb)
> > Interrupt:50 Base address:0x8000
> >
> > Machine s0158
> >
> > eth0 Link encap:Ethernet HWaddr 00:23:8B:42:10:A9
> > inet addr:7.8.82.158 Bcast:7.8.255.255 Mask:255.255.0.0
> > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> > Interrupt:233 Base address:0x6000
> >
> > eth1 Link encap:Ethernet HWaddr 00:23:8B:42:10:AA
> > inet addr:10.173.128.14 Bcast:10.173.255.255 Mask:
> > 255.255.0.0
> > inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
> > TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:5879861483 (5607.4 Mb) TX bytes:2406041840
> > (2294.5 Mb)
> > Interrupt:50 Base address:0x8000
> >
> > Apart from the eths being on different names (happens when
> > installing SuSE SLES 10 SP2)
> > on apparently similar machines, I notice theirs a private ethernet
> > on s0158 at IP
> > 7.8.82.158 - I guess this was used. How to exclude when the eth
> > names vary?
> >
> > DM
> >
> >
> > On Tue, 10 Mar 2009, Ralph Castain wrote:
> >
> >> Not really. I've run much bigger jobs than this without problem, so
> >> I don't think there is a fundamental issue here.
> >>
> >> It looks like the TCP fabric between the various nodes is breaking
> >> down. I note in the enclosed messages that the problems are all
> >> with comm between daemons 4 and 21. We keep trying to get through,
> >> but failing.
> >>
> >> I can fix things so we don't endlessly loop when that happens
> >> (IIRC, I think we are already supposed to abort, but it appears
> >> that isn't working). But the real question is why the comm fails in
> >> the first place.
> >>
> >>
> >> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
> >>
> >>> Latest status - 1.4a1r20757 (yesterday);
> >>> the job now starts with a little output but quickly runs into
> >>> trouble with
> >>> a lot of
> >>> 'oob-tcp: Communication retries exceeded. Can not communicate
> >>> with peer '
> >>> errors?
> >>> e.g.
> >>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication
> >>> retries exceeded. Can not communicate with peer [s0158:22513]
> >>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
> >>> exceeded. Can not communicate with peer [s0158:22513] [[41245,0],
> >>> 4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can
> >>> not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],
> >>> 21] oob-tcp: Communication retries exceeded. Can not communicate
> >>> with peer
> >>> The nodes are O.K. ...
> >>> Any ideas folks?
> >>> DM
> >>> On Sat, 28 Feb 2009, Ralph Castain wrote:
> >>>> I think I have this figured out - will fix on Monday. I'm not
> >>>> sure why Jeff's conditions are all required, especially the
> >>>> second one. However, the fundamental problem is that we pull
> >>>> information from two sources regarding the number of procs in the
> >>>> job when unpacking a buffer, and the two sources appear to be
> out-
> >>>> of-sync with each other in certain scenarios.
> >>>> The details are beyond the user list. I'll respond here again
> >>>> once I get it fixed.
> >>>> Ralph
> >>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
> >>>>> Unfortunately, I think I have reproduced the problem as well --
> >>>>> with SVN trunk HEAD (r20655):
> >>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2
> >>>>> uptime
> >>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data
> >>>>> unpack failed in file base/odls_base_default_fns.c at line 566
> >>>>>
> --------------------------------------------------------------------------
> >>>>> mpirun noticed that the job aborted, but has no info as to the
> >>>>> process
> >>>>> that caused that situation.
> >>>>>
> --------------------------------------------------------------------------
> >>>>> Notice that I'm not trying to run an MPI app -- it's just
> >>>>> "uptime".
> >>>>> The following things seem to be necessary to make this error
> >>>>> occur for me:
> >>>>> 1. --bynode
> >>>>> 2. set some mca parameter (any mca parameter)
> >>>>> 3. -np value less than the size of my slurm allocation
> >>>>> If I remove any of those, it seems to run file
> >>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
> >>>>>> With further investigation, I have reproduced this problem. I
> >>>>>> think I was originally testing against a version that was not
> >>>>>> recent enough. I do not see it with r20594 which is from
> >>>>>> February 19. So, something must have happened over the last 8
> >>>>>> days. I will try and narrow down the issue.
> >>>>>> Rolf
> >>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
> >>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error,
> >>>>>>> although my configuration was rather different. I ran across
> >>>>>>> 100 2-CPU sparc nodes, np=256, connected with TCP.
> >>>>>>> Hopefully George's comment helps out with this issue.
> >>>>>>> One other thought to see whether SGE has anything to do with
> >>>>>>> this is create a hostfile and run it outside of SGE.
> >>>>>>> Rolf
> >>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
> >>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh
> >>>>>>>> launchers, and everything checks out fine. However, this is
> >>>>>>>> running under SGE and thus using qrsh, so it is possible the
> >>>>>>>> SGE support is having a problem.
> >>>>>>>> Perhaps one of the Sun OMPI developers can help here?
> >>>>>>>> Ralph
> >>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
> >>>>>>>>> It looks like the system doesn't know what nodes the procs
> >>>>>>>>> are to be placed upon. Can you run this with --display-
> devel-
> >>>>>>>>> map? That will tell us where the system thinks it is placing
> >>>>>>>>> things.
> >>>>>>>>> Thanks
> >>>>>>>>> Ralph
> >>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
> >>>>>>>>>> Maybe it's my pine mailer.
> >>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-
> >>>>>>>>>> core AMD
> >>>>>>>>>> shangai nodes running a standard benchmark called stmv.
> >>>>>>>>>> The basic error message, which occurs 31 times is like:
> >>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>> file ../../../.././orte/mca/odls/base/
> >>>>>>>>>> odls_base_default_fns.c at line 595
> >>>>>>>>>> The mpirun command has long paths in it, sorry. It's
> >>>>>>>>>> invoking a special binding
> >>>>>>>>>> script which in turn lauches the NAMD run. This works on an
> >>>>>>>>>> older SVN at
> >>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not
> >>>>>>>>>> for this 256 proc run where
> >>>>>>>>>> the older SVN hangs indefinitely polling some completion
> >>>>>>>>>> (sm or openib). So, I was trying
> >>>>>>>>>> later SVNs with this 256 proc run, hoping the error would
> >>>>>>>>>> go away.
> >>>>>>>>>> Here's some of the invocation again. Hope you can read it:
> >>>>>>>>>> EAGER_SIZE=32767
> >>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
> >>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
> >>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
> >>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
> >>>>>>>>>> and, unexpanded
> >>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
> >>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
> >>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
> >>>>>>>>>> $NAMD2 stmv.namd
> >>>>>>>>>> and, expanded
> >>>>>>>>>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
> >>>>>>>>>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256
> >>>>>>>>>> --mca btl sm,openib,self -x
> >>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
> >>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
> >>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/
> 48292.1.all.q/
> >>>>>>>>>> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
> >>>>>>>>>> mpi_binder.MRL /ctmp8/mostyn/IMSC/
> >>>>>>>>>> bench_intel_openmpi_I_shang2/
> >>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
> >>>>>>>>>> NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd
> >>>>>>>>>> This is all via Sun Grid Engine.
> >>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
> >>>>>>>>>> DM
> >>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
> >>>>>>>>>>> I'm sorry, but I can't make any sense of this message.
> >>>>>>>>>>> Could you provide a
> >>>>>>>>>>> little explanation of what you are doing, what the system
> >>>>>>>>>>> looks like, what is
> >>>>>>>>>>> supposed to happen, etc? I can barely parse your cmd
> line...
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> Ralph
> >>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
> >>>>>>>>>>>> Today's and yesterdays.
> >>>>>>>>>>>> 1.4a1r20643_svn
> >>>>>>>>>>>> + mpirun --prefix
> >>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/
> >>>>>>>>>>>> openib/suse_sles_10/x86_6
> >>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
> >>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
> >>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
> >>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
> >>>>>>>>>>>> 269.1.all.q/newhosts
> >>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
> >>>>>>>>>>>> mpi_binder.MRL
> >>>>>>>>>>>> /ctmp8/mostyn/IM
> >>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/
> >>>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
> >>>>>>>>>>>> NAMD_2.6_Source/Li
> >>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
> >>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
> >>>>>>>>>>>> base_default_fns.c at line 595
> >>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in
> >>>>>>>>>>>> file
> >>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
> >>>>>>>>>>>> _base_default_fns.c at line 595
> >>>>>>>>>>>> Made with INTEL compilers 10.1.015.
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Mostyn
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> users mailing list
> >>>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users_at_[hidden]
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> --
> >>>>>> =========================
> >>>>>> rolf.vandevaart_at_[hidden]
> >>>>>> 781-442-3043
> >>>>>> =========================
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> --
> >>>>> Jeff Squyres
> >>>>> Cisco Systems
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems