Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Gurhan Ozen (gurhan.ozen_at_[hidden])
Date: 2007-02-04 15:10:06


On 2/2/07, Dennis McRitchie <dmcr_at_[hidden]> wrote:
> When I submit a simple job (described below) using PBS, I always get one
> of the following two errors:
> 1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> recv() failed with errno=104
>
> 2) [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=3770)
>

     Hi Dennis,
Looks like you could be blocked by a firewall. Can you make sure that
you disable firewalls on both nodes and try ?

gurhan

> The program does a uname and prints out results to standard out. The
> only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank, and
> MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4, built
> with Intel C compiler 9.1.045, and get the same results. But if I build
> the same versions of openmpi using gcc, the test program always works
> fine. The app itself is built with mpicc.
>
> It runs successfully if run from the command line with "mpiexec -n X
> <test-program-name>", where X is 1 to 8, but if I wrap it in the
> following qsub command file:
> ---------------------------------------------------
> #PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00
> #PBS -m abe
> # #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout
> # #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr
>
> cd /home/dmcr/my_mpi/openmpi
> echo "About to call mpiexec"
> module list
> mpiexec -n 1 uname_test.intel
> echo "After call to mpiexec"
> ----------------------------------------------------
>
> it fails on any number of processors from 1 to 8, and the application
> segfaults.
>
> The complete standard error of an 8-processsor job follows (note that
> mpiexec ran on adroit-31, but usually there is no info about adroit-31
> in standard error):
> -------------------------
> Currently Loaded Modulefiles:
> 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> 3) intel/9.1/32/Iidb/9.1.045
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x5
> [0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0 [0xb72c5b]
> *** End of error message ***
> ^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> recv() failed with errno=104
> [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
> failed with errno=104
> [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=3770)
> --------------------------
>
> The complete standard error of an 1-processsor job follows:
> --------------------------
> Currently Loaded Modulefiles:
> 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> 3) intel/9.1/32/Iidb/9.1.045
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x2
> [0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0 [0x27d847]
> *** End of error message ***
> ^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=8840)
> ---------------------------
>
> Any thoughts as to why this might be failing?
>
> Thanks,
> Dennis
>
> Dennis McRitchie
> Computational Science and Engineering Support (CSES)
> Academic Services Department
> Office of Information Technology
> Princeton University
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>