Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Gurhan Ozen (gurhan.ozen_at_[hidden])
Date: 2007-02-04 15:10:06


On 2/2/07, Dennis McRitchie <dmcr_at_[hidden]> wrote:
> When I submit a simple job (described below) using PBS, I always get one
> of the following two errors:
> 1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> recv() failed with errno=104
>
> 2) [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=3770)
>

     Hi Dennis,
Looks like you could be blocked by a firewall. Can you make sure that
you disable firewalls on both nodes and try ?

gurhan

> The program does a uname and prints out results to standard out. The
> only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank, and
> MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4, built
> with Intel C compiler 9.1.045, and get the same results. But if I build
> the same versions of openmpi using gcc, the test program always works
> fine. The app itself is built with mpicc.
>
> It runs successfully if run from the command line with "mpiexec -n X
> <test-program-name>", where X is 1 to 8, but if I wrap it in the
> following qsub command file:
> ---------------------------------------------------
> #PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00
> #PBS -m abe
> # #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout
> # #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr
>
> cd /home/dmcr/my_mpi/openmpi
> echo "About to call mpiexec"
> module list
> mpiexec -n 1 uname_test.intel
> echo "After call to mpiexec"
> ----------------------------------------------------
>
> it fails on any number of processors from 1 to 8, and the application
> segfaults.
>
> The complete standard error of an 8-processsor job follows (note that
> mpiexec ran on adroit-31, but usually there is no info about adroit-31
> in standard error):
> -------------------------
> Currently Loaded Modulefiles:
> 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> 3) intel/9.1/32/Iidb/9.1.045
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x5
> [0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0 [0xb72c5b]
> *** End of error message ***
> ^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> recv() failed with errno=104
> [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
> failed with errno=104
> [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=3770)
> --------------------------
>
> The complete standard error of an 1-processsor job follows:
> --------------------------
> Currently Loaded Modulefiles:
> 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> 3) intel/9.1/32/Iidb/9.1.045
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x2
> [0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0 [0x27d847]
> *** End of error message ***
> ^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=8840)
> ---------------------------
>
> Any thoughts as to why this might be failing?
>
> Thanks,
> Dennis
>
> Dennis McRitchie
> Computational Science and Engineering Support (CSES)
> Academic Services Department
> Office of Information Technology
> Princeton University
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>