Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-03-04 14:24:36


On Mar 3, 2006, at 9:07 AM, Jose Pedro Garcia Mahedero wrote:

> cluster master machine
> eth0 mpihosts_out --> for outside use (getting its own ip via dhcp)
> eth1, mpihosts_cluster --> for cluster use (serves ip's to
> cluster nodes)
>
> ------- TESTS 1,2 -openmpi-1.0.2a9 ------
>
> 1.- cd openmpi-1.0.1
> 2.- make clean
> 3.- cd openmpi-1.0.2a9
> 4.- ./configure
> 5.- make all install
>
> no parameters /usr/local/etc/openmpi-mca-params.conf
> mpirun -np 2 --hostfile mpihosts_cluster ping_pong
> mpirun -np 2 --hostfile mpihosts_out ping_pong
>
> GIve the same results:
>
> Signal:11 info.si_errno:0 (Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x6
> *** End of error message ***
> [0] func:/usr/local/lib/libopal.so.0 [0x40101cb2]
> [1] func:[0xffffe440]
> [2] func:/usr/local/lib/openmpi/mca_btl_tcp.so [0x404541d6]
> [3] func:/usr/local/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs
> +0x149) [0x404502f9]

Yoinks -- whatever we do, we should not be seg faulting. :-( It is
apparently dying in the mca_btl_tcp_add_procs() function, which is
where we're creating MPI networking mappings from one TCP peer to
another.

I am unable to repeat this error (I tried it on a cluster with a
similar setup to yours). Can you recompile the TCP BTL with
debugging symbols so that we can get a little more information?

Do the following:

cd top_of_your_open_mpi_source_tree
cd ompi/mca/btl/tcp
make CFLAGS=-g clean all install

Then run the test again (you shouldn't need to recompile your
application; this just recompiled and re-installed the TCP BTL
plugin). The output stack trace for the mca_btl_tcp stuff should now
include line numbers, and tell us exactly where it is dying. If you
get a corefile, can you load that up in gdb and send the output of
"bt full"?

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/