Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (bbarrett_at_[hidden])
Date: 2007-02-05 10:58:04


This is very odd. The two error messages you are seeing are side
effects of the real problem, which is that Open MPI is segfaulting
when build with the Intel compiler. We've had some problems with
bugs in various versions of the Intel compiler -- just to be on the
safe side, can you make sure that the machine has the latest bug
fixes from Intel applied? From there, if possible, it would be
extremely useful to have a stack trace from a core file, or even to
know whether it's mpirun or one of our "orte daemons" that are
segfaulting. If you can get a core file, you should be able to
figure out which process is causing the segfault.

Brian

On Feb 2, 2007, at 4:07 PM, Dennis McRitchie wrote:

> When I submit a simple job (described below) using PBS, I always
> get one
> of the following two errors:
> 1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> recv() failed with errno=104
>
> 2) [adroit-30:03770] [0,0,3]-[0,0,0]
> mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=3770)
>
> The program does a uname and prints out results to standard out. The
> only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank,
> and
> MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4,
> built
> with Intel C compiler 9.1.045, and get the same results. But if I
> build
> the same versions of openmpi using gcc, the test program always works
> fine. The app itself is built with mpicc.
>
> It runs successfully if run from the command line with "mpiexec -n X
> <test-program-name>", where X is 1 to 8, but if I wrap it in the
> following qsub command file:
> ---------------------------------------------------
> #PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00
> #PBS -m abe
> # #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout
> # #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr
>
> cd /home/dmcr/my_mpi/openmpi
> echo "About to call mpiexec"
> module list
> mpiexec -n 1 uname_test.intel
> echo "After call to mpiexec"
> ----------------------------------------------------
>
> it fails on any number of processors from 1 to 8, and the application
> segfaults.
>
> The complete standard error of an 8-processsor job follows (note that
> mpiexec ran on adroit-31, but usually there is no info about adroit-31
> in standard error):
> -------------------------
> Currently Loaded Modulefiles:
> 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> 3) intel/9.1/32/Iidb/9.1.045
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x5
> [0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0
> [0xb72c5b]
> *** End of error message ***
> ^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> recv() failed with errno=104
> [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> recv()
> failed with errno=104
> [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=3770)
> --------------------------
>
> The complete standard error of an 1-processsor job follows:
> --------------------------
> Currently Loaded Modulefiles:
> 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> 3) intel/9.1/32/Iidb/9.1.045
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x2
> [0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0
> [0x27d847]
> *** End of error message ***
> ^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed (errno=111) - retrying (pid=8840)
> ---------------------------
>
> Any thoughts as to why this might be failing?
>
> Thanks,
> Dennis
>
> Dennis McRitchie
> Computational Science and Engineering Support (CSES)
> Academic Services Department
> Office of Information Technology
> Princeton University
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
   Brian Barrett
   Open MPI Team, CCS-1
   Los Alamos National Laboratory