Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Stuart (cpunerd_at_[hidden])
Date: 2007-04-04 17:44:43


[stuart_at_cortex ~]$ mpirun -V
mpirun (Open MPI) 1.2

On 4/4/07, users-request_at_[hidden] <users-request_at_[hidden]> wrote:
> Send users mailing list submissions to
> users_at_[hidden]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request_at_[hidden]
>
> You can reach the person managing the list at
> users-owner_at_[hidden]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: "Address not mapped" error on user defined MPI_OP
> function (Eric Thibodeau)
> 2. MPI 1.2 stuck in pthread_condition_wait ( hpetit_at_[hidden] )
> 3. Re: "Address not mapped" error on user defined MPI_OP
> function (Eric Thibodeau)
> 4. Re: problem with MPI_Bcast over ethernet (Jeff Squyres)
> 5. Re: btl_tcp_endpoint errors (Jeff Squyres)
> 6. Re: problems with profile.d scripts generated using
> openmpi.spec (Jeff Squyres)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 4 Apr 2007 12:31:46 -0400
> From: Eric Thibodeau <kyron_at_[hidden]>
> Subject: Re: [OMPI users] "Address not mapped" error on user defined
> MPI_OP function
> To: users_at_[hidden]
> Message-ID: <200704041231.46356.kyron_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I completely forgot to mention which version of OpenMPI I am using, I'll gladly post additional info if required :
>
> kyron_at_kyron ~/openmpi-1.2 $ ompi_info |head
> Open MPI: 1.2
> Open MPI SVN revision: r14027
> Open RTE: 1.2
> Open RTE SVN revision: r14027
> OPAL: 1.2
> OPAL SVN revision: r14027
> Prefix: /home/kyron/openmpi_i686
> Configured architecture: i686-pc-linux-gnu
> Configured by: kyron
> Configured on: Wed Apr 4 10:21:34 EDT 2007
>
> Le mercredi 4 avril 2007 11:47, Eric Thibodeau a ?crit?:
> > Hello all,
> >
> > First off, please excuse the attached code as I may be na??ve in my attempts to implement my own MPI_OP.
> >
> > I am attempting to create my own MPI_OP to use with MPI_Allreduce. I have been able to find very little examples off the net of creating MPI_OPs. My present references are "MPI The complete reference Volume 1 2nd edition" and some rather good slides I found at http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof of concept" code which fails with:
> >
> > [kyron:14074] *** Process received signal ***
> > [kyron:14074] Signal: Segmentation fault (11)
> > [kyron:14074] Signal code: Address not mapped (1)
> > [kyron:14074] Failing at address: 0x801da600
> > [kyron:14074] [ 0] [0x6ffa6440]
> > [kyron:14074] [ 1] /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700) [0x6fbb0dd0]
> > [kyron:14074] [ 2] /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2) [0x6fbae9a2]
> > [kyron:14074] [ 3] /home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86]
> > [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
> > [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823]
> > [kyron:14074] *** End of error message ***
> >
> >
> > Eric Thibodeau
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Wed, 4 Apr 2007 18:50:38 +0200
> From: " hpetit_at_[hidden] " <hpetit_at_[hidden]>
> Subject: [OMPI users] MPI 1.2 stuck in pthread_condition_wait
> To: " users " <users_at_[hidden]>
> Message-ID: <JFZG4E$41584250C17E66D5AFE2EAFA16558974_at_[hidden]>
> Content-Type: text/plain; charset=iso-8859-1
>
> Hi,
>
> I have a problem of MPI 1.2.0rc being locked in a "pthread_condition_wait" call.
> This happen whatever the application when openmpi has been compiled with multi-thread support.
>
> The full "configure" options are
> "./configure --prefix=/usr/local/Mpi/openmpi-1.2 --enable-mpi-threads
> --enable-progress-threads --with-threads=posix --enable-smp-lock"
>
> An example of GDB session is provided here below:
>
> -------------------------------------------------------------------------------------------------------------
> >GNU gdb 6.3-debian
> >Copyright 2004 Free Software Foundation, Inc.
> >GDB is free software, covered by the GNU General Public License, and
> >you are welcome to change it and/or distribute copies of it under certain
> >conditions.
> >Type "show copying" to see the conditions.
> >There is absolutely no warranty for GDB. Type "show warranty" for
> >details.
> >This GDB was configured as "i386-linux"...Using host libthread_db
> >library "/lib/tls/libthread_db.so.1".
> >
> >(gdb) run -np 1 spawn6
> >Starting program: /usr/local/openmpi-1.2.0/bin/mpirun -np 1 spawn6
> >[Thread debugging using libthread_db enabled]
> >[New Thread 1076191360 (LWP 29006)]
> >[New Thread 1084808112 (LWP 29009)]
> >main*******************************
> >main : Lancement MPI*
> >
> >Program received signal SIGINT, Interrupt.
> >[Switching to Thread 1084808112 (LWP 29009)]
> >0x401f0523 in poll () from /lib/tls/libc.so.6
> >(gdb) where
> >#0 0x401f0523 in poll () from /lib/tls/libc.so.6
> >#1 0x40081c7c in opal_poll_dispatch () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#2 0x4007e4f1 in opal_event_base_loop () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#3 0x4007e36b in opal_event_loop () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#4 0x4007f423 in opal_event_run () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#5 0x40115b63 in start_thread () from /lib/tls/libpthread.so.0
> >#6 0x401f918a in clone () from /lib/tls/libc.so.6
> >(gdb) bt
> >#0 0x401f0523 in poll () from /lib/tls/libc.so.6
> >#1 0x40081c7c in opal_poll_dispatch () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#2 0x4007e4f1 in opal_event_base_loop () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#3 0x4007e36b in opal_event_loop () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#4 0x4007f423 in opal_event_run () from
> >/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
> >#5 0x40115b63 in start_thread () from /lib/tls/libpthread.so.0
> >#6 0x401f918a in clone () from /lib/tls/libc.so.6
> >(gdb) info threads
> >* 2 Thread 1084808112 (LWP 29009) 0x401f0523 in poll () from
> >/lib/tls/libc.so.6
> > 1 Thread 1076191360 (LWP 29006) 0x40118295 in
> >pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
> >(gdb) thread 1
> >[Switching to thread 1 (Thread 1076191360 (LWP 29006))]#0 0x40118295
> >in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
> >(gdb) bt
> >#0 0x40118295 in pthread_cond_wait@@GLIBC_2.3.2 () from
> >/lib/tls/libpthread.so.0
> >#1 0x0804cb68 in opal_condition_wait (c=0x8050e4c, m=0x8050e28) at
> >condition.h:64
> >#2 0x0804a4fe in orterun (argc=4, argv=0xbffff844) at orterun.c:436
> >#3 0x0804a046 in main (argc=4, argv=0xbffff844) at main.c:13
> >(gdb) where
> >#0 0x40118295 in pthread_cond_wait@@GLIBC_2.3.2 () from
> >/lib/tls/libpthread.so.0
> >#1 0x0804cb68 in opal_condition_wait (c=0x8050e4c, m=0x8050e28) at
> >condition.h:64
> >#2 0x0804a4fe in orterun (argc=4, argv=0xbffff844) at orterun.c:436
> >#3 0x0804a046 in main (argc=4, argv=0xbffff844) at main.c:13
>
> -------------------------------------------------------------------------------------------------------------
>
> I have read the other threads related to multi-threads support. I have understood that multi-thread support will not be a priority before the end of the year.
>
> The thing is this locking stuff problem appeared only since 1.1.2 openmpi release and as it is a locking problem, I was wondering if you could do an exception and try to analyse this one before the end of the year.
>
> Thanks,
>
> Herve
>
> P.S.: my OS is a debian sarge
>
>
>
> ------------------------ ALICE C'EST ENCORE MIEUX AVEC CANAL+ LE BOUQUET ! ---------------
> D?couvrez vite l'offre exclusive ALICEBOX et CANAL+ LE BOUQUET, en cliquant ici http://alicebox.fr
> Soumis ? conditions.
>
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 4 Apr 2007 13:32:15 -0400
> From: Eric Thibodeau <kyron_at_[hidden]>
> Subject: Re: [OMPI users] "Address not mapped" error on user defined
> MPI_OP function
> To: users_at_[hidden]
> Message-ID: <200704041332.15575.kyron_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> hehe...don't we all love it when a problem "fixes" itself. I was missing a line in my Type creation to realigne the elements correctly:
>
> // Displacement is RELATIVE to it's first structure element!
> for(i=2; i >= 0; i--) Displacement[i] -= Displacement[0];
>
> I'm attaching the functionnal code so that others can maybe see this one as an example ;)
>
> Le mercredi 4 avril 2007 11:47, Eric Thibodeau a ?crit?:
> > Hello all,
> >
> > First off, please excuse the attached code as I may be na??ve in my attempts to implement my own MPI_OP.
> >
> > I am attempting to create my own MPI_OP to use with MPI_Allreduce. I have been able to find very little examples off the net of creating MPI_OPs. My present references are "MPI The complete reference Volume 1 2nd edition" and some rather good slides I found at http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof of concept" code which fails with:
> >
> > [kyron:14074] *** Process received signal ***
> > [kyron:14074] Signal: Segmentation fault (11)
> > [kyron:14074] Signal code: Address not mapped (1)
> > [kyron:14074] Failing at address: 0x801da600
> > [kyron:14074] [ 0] [0x6ffa6440]
> > [kyron:14074] [ 1] /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700) [0x6fbb0dd0]
> > [kyron:14074] [ 2] /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2) [0x6fbae9a2]
> > [kyron:14074] [ 3] /home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86]
> > [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
> > [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823]
> > [kyron:14074] *** End of error message ***
> >
> >
> > Eric Thibodeau
> >
>
> --
> Eric Thibodeau
> Neural Bucket Solutions Inc.
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: AllReduceTest.c
> Type: text/x-csrc
> Size: 3170 bytes
> Desc: not available
> Url : http://www.open-mpi.org/MailArchives/users/attachments/20070404/69383002/attachment.bin
>
> ------------------------------
>
> Message: 4
> Date: Wed, 4 Apr 2007 15:16:56 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] problem with MPI_Bcast over ethernet
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <ECA5445B-727D-4E68-9917-BF9FBF323DD8_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> There is nothing known in the current release that would cause this
> (1.2). What version are you using?
>
> On Apr 2, 2007, at 4:34 PM, Jeff Stuart wrote:
>
> > for some reason, i am getting intermittent process crashing in
> > MPI_Bcast. i run my program, which distributes some data via lots
> > (thousands or more ) of 64k MPI_Bcast calls. the program that is
> > crashing is fairly big, and it would take some time to widdle down a
> > small example program. i *am* willing to do this, i just wanted to
> > make sure there wasnt an already known problem about this first.
> >
> > thanks in advance,
> > -jeff
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
>
> ------------------------------
>
> Message: 5
> Date: Wed, 4 Apr 2007 15:28:14 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] btl_tcp_endpoint errors
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <BC6C67E2-1172-4B00-83A5-F5C9C3E0FA88_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> On Apr 3, 2007, at 1:22 PM, Heywood, Todd wrote:
>
> > ssh: connect to host blade45 port 22: No route to host
> > [blade1:05832] ERROR: A daemon on node blade45 failed to start as
> > expected.
> > [blade1:05832] ERROR: There may be more information available from
> > [blade1:05832] ERROR: the remote shell (see above).
> > [blade1:05832] ERROR: The daemon exited unexpectedly with status 1.
> > [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> > ../../../../orte/mca/pls/base/pls_base_orted_cmds.c at line 188
> > [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> > ../../../../../orte/mca/pls/rsh/pls_rsh_module.c at line 1187
> >
> > I can understand this arising from an ssh bottleneck, with a
> > timeout. So, a
> > question to the OMPI folks: could the "no route to host" (113)
> > error in
> > btl_tcp_endpoint.c:572 also result from a timeout?
>
> I think it *could*, but it's really an OS-level question. OMPI is
> simply reporting what errno is giving us back from a failed TCP
> connect() API call.
>
> The timeout shown in the error message above is really an ORTE
> timeout, meaning that we waited for a daemon to start that didn't, so
> we timed out and gave up. It's on the "to do" list to recognize
> quicker that an ssh failed (or any of the other starters failed --
> SLURM/srun failures behaves similarly to ssh failures right now)
> faster than a timeout, probably not until at least the 1.3 timeframe,
> however.
>
> --
> Jeff Squyres
> Cisco Systems
>
>
>
> ------------------------------
>
> Message: 6
> Date: Wed, 4 Apr 2007 17:39:57 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] problems with profile.d scripts generated
> using openmpi.spec
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <EE226B9A-FBDE-41EA-B9F5-71DDAB9FC312_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> On Apr 4, 2007, at 8:44 AM, Marcin Dulak wrote:
>
> > Thank your for comments.
> > 1) I'am using
> > GNU bash, version 3.00.15(1)-release (i686-redhat-linux-gnu)
> > To see the problem with the original
> > eval "set %{configure_options}" I start the configure_options with
> > -- in buildrpm.sh, like this: configure_options="--with-tm=/usr/
> > local FC=pgf90 F77=pgf90 CC=pgcc CXX=pgCC CFLAGS=-Msignextend
> > CXXFLAGS=-Msignextend --with-wrapper-cflags=-Msignextend --with-
> > wrapper-cxxflags=-Msignextend FFLAGS=-Msignextend FCFLAGS=-
> > Msignextend --with-wrapper-fflags=-Msignextend --with-wrapper-
> > fcflags=-Msignextend" Or to see the problem directly, I go to the
> > shell: sh; set --w sh: set: --: invalid option set: usage: set [--
> > abefhkmnptuvxBCHP] [-o option] [arg ...]
>
> (wow, my mail client really munged your formatting... :-\ )
>
> I see why I didn't run into this before. I did all my testing within
> the context of the OFED 1.2 installer, and we always pass in
> configure_options that start with a token that does not start with
> --. Hence, "set" knew to ignore the -- prefixed options.
>
> So it looks like a slightly less intrusive fix would actually be to
> use the following:
>
> eval "set -- %{configure_options}"
>
> > 2) if ("\$LD_LIBRARY_PATH" !~ *%{_libdir}*) then is the only
> > possibility which works for me. I'am using tcsh 6.13.00 (Astron)
> > 2004-05-19 (i386-intel-linux) options
> > 8b,nls,dl,al,kan,rh,color,dspm,filec If I use "%{_libdir}", then
> > every time I source /opt/openmpi/1.2/bin/mpivars-1.2.csh a new
> > entry of opemnpi is prepended, so the LD_LIBRARY_PATH is growing.
> > The same if I use "*%{_libdir}*" it seems that with the double
> > quotes the shell despite the pattern comparison requested by !~
> > uses literal matching.
>
> I just went and read the man page on this (should have done this
> before): it says that the =~ and !~ operators are glob-style
> matching. So the * prefix and suffix is correct -- thanks for
> pointing that out.
>
> I was trying to use "" to protect multi-word strings, but I can't
> seem to find a syntax that works for multi-word strings on the right
> hand side. Oh well; there's probably other stuff in OMPI that will
> break if use you spaces in the prefix -- I'm ok with this for now.
>
> I'll fix up these in SVN.
>
> > 3) using setenv MANPATH %{_mandir}: (with the colon (:) included),
> > if I start from empty MANPATH
> >
> > unsetenv MANPATH
> >
> > and run
> > source /opt/openmpi/1.2/bin/mpivars-1.2.csh
> > I get
> > echo $MANPATH
> >
> > /opt/openmpi/1.2/man:
>
> Right.
>
> > I tried to google for something like
> > also include the default MANPATHbut I cannot find anything. What is
> > the meaning of this colon at the end?
>
> I believe that I found this option long ago by trial and error in the
> OSCAR project. I just trolled through the man documentation right
> now and [still] can't find it documented anywhere. :-\
>
> The trailing : means "put all the options listed in man.conf here".
> If you don't do that, then the contents of MANPATH wholly replaces
> what is listed in man.conf. For example (I'm a C shell kind of guy):
>
> # With no $MANPATH
> shell% man ls
> ...get ls man page...
>
> # Set MANPATH to a directory with no trailing :
> shell% setenv MANPATH /opt/intel/9.1/man
> shell% man icc
> ...get icc man page...
> shell% man ls
> No manual entry for ls
>
> # Set MANPATH to a directory with a trailing :
> shell% setenv MANPATH /opt/intel/9.1/man:
> shell% man icc
> ...get icc man page...
> shell% man ls
> ...get ls man page...
>
> Thanks for the bug reports and your persistence!
>
> --
> Jeff Squyres
> Cisco Systems
>
>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 550, Issue 5
> *************************************
>