Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Dennis McRitchie (dmcr_at_[hidden])
Date: 2007-02-05 10:48:48


Thanks for the suggestion, and I should have mentioned it, but there is
no firewall set up on any of the compute nodes. Only on the head node on
the eth interface to the outside world are there firewall restrictions.

Also, as I mentioned, I was able to build and run this test app
successfully using a gcc-built openmpi with a gcc-built application.
This only happens running the Intel-compiler-built openmpi with an
Intel-compiler-built application.

Any other thoughts?

Dennis

> -----Original Message-----
> From: Gurhan Ozen [mailto:gurhan.ozen_at_[hidden]]
> Sent: Sunday, February 04, 2007 3:10 PM
> To: Open MPI Users; Dennis McRitchie
> Subject: Re: [OMPI users] Can't run simple job with openmpi
> using the Intel compiler
>
> On 2/2/07, Dennis McRitchie <dmcr_at_[hidden]> wrote:
> > When I submit a simple job (described below) using PBS, I
> always get one
> > of the following two errors:
> > 1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> > recv() failed with errno=104
> >
> > 2) [adroit-30:03770] [0,0,3]-[0,0,0]
> mca_oob_tcp_peer_complete_connect:
> > connection failed (errno=111) - retrying (pid=3770)
> >
>
> Hi Dennis,
> Looks like you could be blocked by a firewall. Can you make sure that
> you disable firewalls on both nodes and try ?
>
> gurhan
>
> > The program does a uname and prints out results to standard out. The
> > only MPI calls it makes are MPI_Init, MPI_Comm_size,
> MPI_Comm_rank, and
> > MPI_Finalize. I have tried it with both openmpi v 1.1.2 and
> 1.1.4, built
> > with Intel C compiler 9.1.045, and get the same results.
> But if I build
> > the same versions of openmpi using gcc, the test program
> always works
> > fine. The app itself is built with mpicc.
> >
> > It runs successfully if run from the command line with "mpiexec -n X
> > <test-program-name>", where X is 1 to 8, but if I wrap it in the
> > following qsub command file:
> > ---------------------------------------------------
> > #PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00
> > #PBS -m abe
> > # #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout
> > # #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr
> >
> > cd /home/dmcr/my_mpi/openmpi
> > echo "About to call mpiexec"
> > module list
> > mpiexec -n 1 uname_test.intel
> > echo "After call to mpiexec"
> > ----------------------------------------------------
> >
> > it fails on any number of processors from 1 to 8, and the
> application
> > segfaults.
> >
> > The complete standard error of an 8-processsor job follows
> (note that
> > mpiexec ran on adroit-31, but usually there is no info
> about adroit-31
> > in standard error):
> > -------------------------
> > Currently Loaded Modulefiles:
> > 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> > 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> > 3) intel/9.1/32/Iidb/9.1.045
> > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> > Failing at addr:0x5
> > [0]
> func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0 [0xb72c5b]
> > *** End of error message ***
> > ^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
> > recv() failed with errno=104
> > [adroit-28:03945] [0,0,1]-[0,0,0]
> mca_oob_tcp_peer_recv_blocking: recv()
> > failed with errno=104
> > [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> > connection failed (errno=111) - retrying (pid=3770)
> > --------------------------
> >
> > The complete standard error of an 1-processsor job follows:
> > --------------------------
> > Currently Loaded Modulefiles:
> > 1) intel/9.1/32/C/9.1.045 4) intel/9.1/32/default
> > 2) intel/9.1/32/Fortran/9.1.040 5) openmpi/intel/1.1.2/32
> > 3) intel/9.1/32/Iidb/9.1.045
> > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> > Failing at addr:0x2
> > [0]
> func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0 [0x27d847]
> > *** End of error message ***
> > ^@[adroit-31:08840] [0,0,1]-[0,0,0]
> mca_oob_tcp_peer_complete_connect:
> > connection failed (errno=111) - retrying (pid=8840)
> > ---------------------------
> >
> > Any thoughts as to why this might be failing?
> >
> > Thanks,
> > Dennis
> >
> > Dennis McRitchie
> > Computational Science and Engineering Support (CSES)
> > Academic Services Department
> > Office of Information Technology
> > Princeton University
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>