Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-04-04 17:48:25


Can you send the information listed in this web page:

     http://www.open-mpi.org/community/help/

On Apr 4, 2007, at 5:44 PM, Jeff Stuart wrote:

> [stuart_at_cortex ~]$ mpirun -V
> mpirun (Open MPI) 1.2
>
>
> On 4/4/07, users-request_at_[hidden] <users-request_at_[hidden]>
> wrote:
>> Send users mailing list submissions to
>> users_at_[hidden]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>> users-request_at_[hidden]
>>
>> You can reach the person managing the list at
>> users-owner_at_[hidden]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Re: "Address not mapped" error on user defined MPI_OP
>> function (Eric Thibodeau)
>> 2. MPI 1.2 stuck in pthread_condition_wait ( hpetit_at_[hidden] )
>> 3. Re: "Address not mapped" error on user defined MPI_OP
>> function (Eric Thibodeau)
>> 4. Re: problem with MPI_Bcast over ethernet (Jeff Squyres)
>> 5. Re: btl_tcp_endpoint errors (Jeff Squyres)
>> 6. Re: problems with profile.d scripts generated using
>> openmpi.spec (Jeff Squyres)
>>
>>
>> ---------------------------------------------------------------------
>> -
>>
>> Message: 1
>> Date: Wed, 4 Apr 2007 12:31:46 -0400
>> From: Eric Thibodeau <kyron_at_[hidden]>
>> Subject: Re: [OMPI users] "Address not mapped" error on user defined
>> MPI_OP function
>> To: users_at_[hidden]
>> Message-ID: <200704041231.46356.kyron_at_[hidden]>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> I completely forgot to mention which version of OpenMPI I am
>> using, I'll gladly post additional info if required :
>>
>> kyron_at_kyron ~/openmpi-1.2 $ ompi_info |head
>> Open MPI: 1.2
>> Open MPI SVN revision: r14027
>> Open RTE: 1.2
>> Open RTE SVN revision: r14027
>> OPAL: 1.2
>> OPAL SVN revision: r14027
>> Prefix: /home/kyron/openmpi_i686
>> Configured architecture: i686-pc-linux-gnu
>> Configured by: kyron
>> Configured on: Wed Apr 4 10:21:34 EDT 2007
>>
>> Le mercredi 4 avril 2007 11:47, Eric Thibodeau a ?crit?:
>>> Hello all,
>>>
>>> First off, please excuse the attached code as I may be na??
>>> ve in my attempts to implement my own MPI_OP.
>>>
>>> I am attempting to create my own MPI_OP to use with
>>> MPI_Allreduce. I have been able to find very little examples off
>>> the net of creating MPI_OPs. My present references are "MPI The
>>> complete reference Volume 1 2nd edition" and some rather good
>>> slides I found at http://www.mpi-hd.mpg.de/personalhomes/stiff/
>>> MPI/ . I am attaching my "proof of concept" code which fails with:
>>>
>>> [kyron:14074] *** Process received signal ***
>>> [kyron:14074] Signal: Segmentation fault (11)
>>> [kyron:14074] Signal code: Address not mapped (1)
>>> [kyron:14074] Failing at address: 0x801da600
>>> [kyron:14074] [ 0] [0x6ffa6440]
>>> [kyron:14074] [ 1] /home/kyron/openmpi_i686/lib/openmpi/
>>> mca_coll_tuned.so
>>> (ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700)
>>> [0x6fbb0dd0]
>>> [kyron:14074] [ 2] /home/kyron/openmpi_i686/lib/openmpi/
>>> mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2)
>>> [0x6fbae9a2]
>>> [kyron:14074] [ 3] /home/kyron/openmpi_i686/lib/libmpi.so.0
>>> (PMPI_Allreduce+0x1a6) [0x6ff61e86]
>>> [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
>>> [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3)
>>> [0x6fcbd823]
>>> [kyron:14074] *** End of error message ***
>>>
>>>
>>> Eric Thibodeau
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Wed, 4 Apr 2007 18:50:38 +0200
>> From: " hpetit_at_[hidden] " <hpetit_at_[hidden]>
>> Subject: [OMPI users] MPI 1.2 stuck in pthread_condition_wait
>> To: " users " <users_at_[hidden]>
>> Message-ID: <JFZG4E$41584250C17E66D5AFE2EAFA16558974_at_[hidden]>
>> Content-Type: text/plain; charset=iso-8859-1
>>
>> Hi,
>>
>> I have a problem of MPI 1.2.0rc being locked in a
>> "pthread_condition_wait" call.
>> This happen whatever the application when openmpi has been
>> compiled with multi-thread support.
>>
>> The full "configure" options are
>> "./configure --prefix=/usr/local/Mpi/openmpi-1.2 --enable-mpi-threads
>> --enable-progress-threads --with-threads=posix --enable-smp-lock"
>>
>> An example of GDB session is provided here below:
>>
>> ---------------------------------------------------------------------
>> ----------------------------------------
>>> GNU gdb 6.3-debian
>>> Copyright 2004 Free Software Foundation, Inc.
>>> GDB is free software, covered by the GNU General Public License, and
>>> you are welcome to change it and/or distribute copies of it under
>>> certain
>>> conditions.
>>> Type "show copying" to see the conditions.
>>> There is absolutely no warranty for GDB. Type "show warranty" for
>>> details.
>>> This GDB was configured as "i386-linux"...Using host libthread_db
>>> library "/lib/tls/libthread_db.so.1".
>>>
>>> (gdb) run -np 1 spawn6
>>> Starting program: /usr/local/openmpi-1.2.0/bin/mpirun -np 1 spawn6
>>> [Thread debugging using libthread_db enabled]
>>> [New Thread 1076191360 (LWP 29006)]
>>> [New Thread 1084808112 (LWP 29009)]
>>> main*******************************
>>> main : Lancement MPI*
>>>
>>> Program received signal SIGINT, Interrupt.
>>> [Switching to Thread 1084808112 (LWP 29009)]
>>> 0x401f0523 in poll () from /lib/tls/libc.so.6
>>> (gdb) where
>>> #0 0x401f0523 in poll () from /lib/tls/libc.so.6
>>> #1 0x40081c7c in opal_poll_dispatch () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #2 0x4007e4f1 in opal_event_base_loop () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #3 0x4007e36b in opal_event_loop () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #4 0x4007f423 in opal_event_run () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #5 0x40115b63 in start_thread () from /lib/tls/libpthread.so.0
>>> #6 0x401f918a in clone () from /lib/tls/libc.so.6
>>> (gdb) bt
>>> #0 0x401f0523 in poll () from /lib/tls/libc.so.6
>>> #1 0x40081c7c in opal_poll_dispatch () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #2 0x4007e4f1 in opal_event_base_loop () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #3 0x4007e36b in opal_event_loop () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #4 0x4007f423 in opal_event_run () from
>>> /usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>>> #5 0x40115b63 in start_thread () from /lib/tls/libpthread.so.0
>>> #6 0x401f918a in clone () from /lib/tls/libc.so.6
>>> (gdb) info threads
>>> * 2 Thread 1084808112 (LWP 29009) 0x401f0523 in poll () from
>>> /lib/tls/libc.so.6
>>> 1 Thread 1076191360 (LWP 29006) 0x40118295 in
>>> pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
>>> (gdb) thread 1
>>> [Switching to thread 1 (Thread 1076191360 (LWP 29006))]#0 0x40118295
>>> in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
>>> (gdb) bt
>>> #0 0x40118295 in pthread_cond_wait@@GLIBC_2.3.2 () from
>>> /lib/tls/libpthread.so.0
>>> #1 0x0804cb68 in opal_condition_wait (c=0x8050e4c, m=0x8050e28) at
>>> condition.h:64
>>> #2 0x0804a4fe in orterun (argc=4, argv=0xbffff844) at orterun.c:436
>>> #3 0x0804a046 in main (argc=4, argv=0xbffff844) at main.c:13
>>> (gdb) where
>>> #0 0x40118295 in pthread_cond_wait@@GLIBC_2.3.2 () from
>>> /lib/tls/libpthread.so.0
>>> #1 0x0804cb68 in opal_condition_wait (c=0x8050e4c, m=0x8050e28) at
>>> condition.h:64
>>> #2 0x0804a4fe in orterun (argc=4, argv=0xbffff844) at orterun.c:436
>>> #3 0x0804a046 in main (argc=4, argv=0xbffff844) at main.c:13
>>
>> ---------------------------------------------------------------------
>> ----------------------------------------
>>
>> I have read the other threads related to multi-threads support. I
>> have understood that multi-thread support will not be a priority
>> before the end of the year.
>>
>> The thing is this locking stuff problem appeared only since 1.1.2
>> openmpi release and as it is a locking problem, I was wondering if
>> you could do an exception and try to analyse this one before the
>> end of the year.
>>
>> Thanks,
>>
>> Herve
>>
>> P.S.: my OS is a debian sarge
>>
>>
>>
>> ------------------------ ALICE C'EST ENCORE MIEUX AVEC CANAL+ LE
>> BOUQUET ! ---------------
>> D?couvrez vite l'offre exclusive ALICEBOX et CANAL+ LE BOUQUET, en
>> cliquant ici http://alicebox.fr
>> Soumis ? conditions.
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Wed, 4 Apr 2007 13:32:15 -0400
>> From: Eric Thibodeau <kyron_at_[hidden]>
>> Subject: Re: [OMPI users] "Address not mapped" error on user defined
>> MPI_OP function
>> To: users_at_[hidden]
>> Message-ID: <200704041332.15575.kyron_at_[hidden]>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> hehe...don't we all love it when a problem "fixes" itself. I was
>> missing a line in my Type creation to realigne the elements
>> correctly:
>>
>> // Displacement is RELATIVE to it's first structure element!
>> for(i=2; i >= 0; i--) Displacement[i] -= Displacement[0];
>>
>> I'm attaching the functionnal code so that others can maybe see
>> this one as an example ;)
>>
>> Le mercredi 4 avril 2007 11:47, Eric Thibodeau a ?crit?:
>>> Hello all,
>>>
>>> First off, please excuse the attached code as I may be na??
>>> ve in my attempts to implement my own MPI_OP.
>>>
>>> I am attempting to create my own MPI_OP to use with
>>> MPI_Allreduce. I have been able to find very little examples off
>>> the net of creating MPI_OPs. My present references are "MPI The
>>> complete reference Volume 1 2nd edition" and some rather good
>>> slides I found at http://www.mpi-hd.mpg.de/personalhomes/stiff/
>>> MPI/ . I am attaching my "proof of concept" code which fails with:
>>>
>>> [kyron:14074] *** Process received signal ***
>>> [kyron:14074] Signal: Segmentation fault (11)
>>> [kyron:14074] Signal code: Address not mapped (1)
>>> [kyron:14074] Failing at address: 0x801da600
>>> [kyron:14074] [ 0] [0x6ffa6440]
>>> [kyron:14074] [ 1] /home/kyron/openmpi_i686/lib/openmpi/
>>> mca_coll_tuned.so
>>> (ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700)
>>> [0x6fbb0dd0]
>>> [kyron:14074] [ 2] /home/kyron/openmpi_i686/lib/openmpi/
>>> mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2)
>>> [0x6fbae9a2]
>>> [kyron:14074] [ 3] /home/kyron/openmpi_i686/lib/libmpi.so.0
>>> (PMPI_Allreduce+0x1a6) [0x6ff61e86]
>>> [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
>>> [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3)
>>> [0x6fcbd823]
>>> [kyron:14074] *** End of error message ***
>>>
>>>
>>> Eric Thibodeau
>>>
>>
>> --
>> Eric Thibodeau
>> Neural Bucket Solutions Inc.
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: AllReduceTest.c
>> Type: text/x-csrc
>> Size: 3170 bytes
>> Desc: not available
>> Url : http://www.open-mpi.org/MailArchives/users/attachments/
>> 20070404/69383002/attachment.bin
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Wed, 4 Apr 2007 15:16:56 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] problem with MPI_Bcast over ethernet
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <ECA5445B-727D-4E68-9917-BF9FBF323DD8_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>>
>> There is nothing known in the current release that would cause this
>> (1.2). What version are you using?
>>
>> On Apr 2, 2007, at 4:34 PM, Jeff Stuart wrote:
>>
>>> for some reason, i am getting intermittent process crashing in
>>> MPI_Bcast. i run my program, which distributes some data via lots
>>> (thousands or more ) of 64k MPI_Bcast calls. the program that is
>>> crashing is fairly big, and it would take some time to widdle down a
>>> small example program. i *am* willing to do this, i just wanted to
>>> make sure there wasnt an already known problem about this first.
>>>
>>> thanks in advance,
>>> -jeff
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Wed, 4 Apr 2007 15:28:14 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] btl_tcp_endpoint errors
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <BC6C67E2-1172-4B00-83A5-F5C9C3E0FA88_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>>
>> On Apr 3, 2007, at 1:22 PM, Heywood, Todd wrote:
>>
>>> ssh: connect to host blade45 port 22: No route to host
>>> [blade1:05832] ERROR: A daemon on node blade45 failed to start as
>>> expected.
>>> [blade1:05832] ERROR: There may be more information available from
>>> [blade1:05832] ERROR: the remote shell (see above).
>>> [blade1:05832] ERROR: The daemon exited unexpectedly with status 1.
>>> [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>> ../../../../orte/mca/pls/base/pls_base_orted_cmds.c at line 188
>>> [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>> ../../../../../orte/mca/pls/rsh/pls_rsh_module.c at line 1187
>>>
>>> I can understand this arising from an ssh bottleneck, with a
>>> timeout. So, a
>>> question to the OMPI folks: could the "no route to host" (113)
>>> error in
>>> btl_tcp_endpoint.c:572 also result from a timeout?
>>
>> I think it *could*, but it's really an OS-level question. OMPI is
>> simply reporting what errno is giving us back from a failed TCP
>> connect() API call.
>>
>> The timeout shown in the error message above is really an ORTE
>> timeout, meaning that we waited for a daemon to start that didn't, so
>> we timed out and gave up. It's on the "to do" list to recognize
>> quicker that an ssh failed (or any of the other starters failed --
>> SLURM/srun failures behaves similarly to ssh failures right now)
>> faster than a timeout, probably not until at least the 1.3 timeframe,
>> however.
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>>
>> ------------------------------
>>
>> Message: 6
>> Date: Wed, 4 Apr 2007 17:39:57 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] problems with profile.d scripts generated
>> using openmpi.spec
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <EE226B9A-FBDE-41EA-B9F5-71DDAB9FC312_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>>
>> On Apr 4, 2007, at 8:44 AM, Marcin Dulak wrote:
>>
>>> Thank your for comments.
>>> 1) I'am using
>>> GNU bash, version 3.00.15(1)-release (i686-redhat-linux-gnu)
>>> To see the problem with the original
>>> eval "set %{configure_options}" I start the configure_options with
>>> -- in buildrpm.sh, like this: configure_options="--with-tm=/usr/
>>> local FC=pgf90 F77=pgf90 CC=pgcc CXX=pgCC CFLAGS=-Msignextend
>>> CXXFLAGS=-Msignextend --with-wrapper-cflags=-Msignextend --with-
>>> wrapper-cxxflags=-Msignextend FFLAGS=-Msignextend FCFLAGS=-
>>> Msignextend --with-wrapper-fflags=-Msignextend --with-wrapper-
>>> fcflags=-Msignextend" Or to see the problem directly, I go to the
>>> shell: sh; set --w sh: set: --: invalid option set: usage: set [--
>>> abefhkmnptuvxBCHP] [-o option] [arg ...]
>>
>> (wow, my mail client really munged your formatting... :-\ )
>>
>> I see why I didn't run into this before. I did all my testing within
>> the context of the OFED 1.2 installer, and we always pass in
>> configure_options that start with a token that does not start with
>> --. Hence, "set" knew to ignore the -- prefixed options.
>>
>> So it looks like a slightly less intrusive fix would actually be to
>> use the following:
>>
>> eval "set -- %{configure_options}"
>>
>>> 2) if ("\$LD_LIBRARY_PATH" !~ *%{_libdir}*) then is the only
>>> possibility which works for me. I'am using tcsh 6.13.00 (Astron)
>>> 2004-05-19 (i386-intel-linux) options
>>> 8b,nls,dl,al,kan,rh,color,dspm,filec If I use "%{_libdir}", then
>>> every time I source /opt/openmpi/1.2/bin/mpivars-1.2.csh a new
>>> entry of opemnpi is prepended, so the LD_LIBRARY_PATH is growing.
>>> The same if I use "*%{_libdir}*" it seems that with the double
>>> quotes the shell despite the pattern comparison requested by !~
>>> uses literal matching.
>>
>> I just went and read the man page on this (should have done this
>> before): it says that the =~ and !~ operators are glob-style
>> matching. So the * prefix and suffix is correct -- thanks for
>> pointing that out.
>>
>> I was trying to use "" to protect multi-word strings, but I can't
>> seem to find a syntax that works for multi-word strings on the right
>> hand side. Oh well; there's probably other stuff in OMPI that will
>> break if use you spaces in the prefix -- I'm ok with this for now.
>>
>> I'll fix up these in SVN.
>>
>>> 3) using setenv MANPATH %{_mandir}: (with the colon (:) included),
>>> if I start from empty MANPATH
>>>
>>> unsetenv MANPATH
>>>
>>> and run
>>> source /opt/openmpi/1.2/bin/mpivars-1.2.csh
>>> I get
>>> echo $MANPATH
>>>
>>> /opt/openmpi/1.2/man:
>>
>> Right.
>>
>>> I tried to google for something like
>>> also include the default MANPATHbut I cannot find anything. What is
>>> the meaning of this colon at the end?
>>
>> I believe that I found this option long ago by trial and error in the
>> OSCAR project. I just trolled through the man documentation right
>> now and [still] can't find it documented anywhere. :-\
>>
>> The trailing : means "put all the options listed in man.conf here".
>> If you don't do that, then the contents of MANPATH wholly replaces
>> what is listed in man.conf. For example (I'm a C shell kind of guy):
>>
>> # With no $MANPATH
>> shell% man ls
>> ...get ls man page...
>>
>> # Set MANPATH to a directory with no trailing :
>> shell% setenv MANPATH /opt/intel/9.1/man
>> shell% man icc
>> ...get icc man page...
>> shell% man ls
>> No manual entry for ls
>>
>> # Set MANPATH to a directory with a trailing :
>> shell% setenv MANPATH /opt/intel/9.1/man:
>> shell% man icc
>> ...get icc man page...
>> shell% man ls
>> ...get ls man page...
>>
>> Thanks for the bug reports and your persistence!
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 550, Issue 5
>> *************************************
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems