Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-06-02 18:38:15


Troy and I talked about this off-list and resolved that the issue was
with the TCP setup on the nodes.

But it is worth noting that we had previously fixed a bug in the TCP
setup in 1.0.2 with respect to the SEGVs that Troy was seeing -- hence,
when he tested the 1.0.3 prerelease tarballs, there were no SEGVs.

> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of Troy Telford
> Sent: Thursday, June 01, 2006 4:35 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Open MPI 1.0.2 and np >=64
>
> On Wed, 31 May 2006 20:17:33 -0600, Brian Barrett
> <brbarret_at_[hidden]>
> wrote:
>
> > Did you happen to have a chance to try to run the 1.0.3 or 1.1
> > nightly tarballs? I'm 50/50 on whether we've fixed these issues
> > already.
>
> For Ticket #41:
>
> Using Open MPI 1.0.3 and 1.1:
> For some reason, I can't seem to get TCP to work with any
> number of nodes
> >1 (which is odd, because I've had it working on *this*
> system before;
> MPICH works fine, so there's at least *something* right about
> the ethernet
> config/hardware)
>
> But I do get a different error with the snapshots vs. 1.0.2:
>
> *****Open MPI 1.0.2*****
> [root_at_zartan1 1.0.2]# mpirun -v -np 6 -prefix $MPIHOME -machinefile
> machines -mca btl tcp,sm,self laten -o 10
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x6
> [0] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libopal.so.0
> [0x2ab8333408ca]
> [1] func:/lib64/libpthread.so.0 [0x2ab83394a380]
> [2]
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl
_tcp.so(mca_btl_tcp_proc_remove+0xbb)
> [0x2ab8364299ab]
> [3]
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl
> _tcp.so
> [0x2ab836427bec]
> [4]
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl
_tcp.so(mca_btl_tcp_add_procs+0x155)
> [0x2ab836425445]
> *** End of error message ***
> [5]
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_bml
_r2.so(mca_bml_r2_add_procs+0x26b)
> [0x2ab835da72db]
> [6]
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_pml
_ob1.so(mca_pml_ob1_add_procs+0xcc)
> [0x2ab835b8bd5c]
> [7]
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libmpi.so.0(ompi_
> mpi_init+0x590)
> [0x2ab8330b1c90]
> [8]
> func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib/libmpi.so.0(MPI_I
> nit+0x83)
> [0x2ab83309d2d3]
> [9] func:laten(main+0x6a) [0x4015f2]
> [10] func:/lib64/libc.so.6(__libc_start_main+0xdc) [0x2ab833a6f4cc]
> [11] func:laten [0x4014f9]
>
> *****Open MPI 1.0.3*****
> [root_at_zartan1 tmp]# mpirun -v -np 4 -prefix $MPIHOME -mca btl
> tcp,sm,self
> -machinefile machines laten -o 10
> MPI Bidirectional latency test (Send/Recv)
> Processes Max Latency (us)
> ------------------------------------------
> [0,1,3][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_c
> onnect]
> connect() failed with errno=113
> [0,1,2][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_c
> onnect]
> connect() failed with errno=113
>
> *****Open MPI 1.1*****
> [root_at_zartan1 1.1]# mpirun -v -np 4 -prefix $MPIHOME -mca btl tcp
> -machinefile machines laten -o 10
> MPI Bidirectional latency test (Send/Recv)
> Processes Max Latency (us)
> ------------------------------------------
> [0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
> onnect]
> connect() failed with errno=113
> [0,1,3][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
> onnect]
> connect() failed with errno=113
>
> If I use -np 2 (ie. the job doesn't leave the node, it being
> a dual-cpu
> machine), it works fine.
> --
> Troy Telford
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>