Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jim Kusznir (jkusznir_at_[hidden])
Date: 2007-10-10 11:09:03


Hi:

I've added:
btl = ^openib
to /etc/openmpi-mca-params.conf on the head node, but this doesn't
seem to help. Does this need to be pushed out to all the compute
nodes as well?

The program is known to work on other clusters. I finally figured out
what was happening, though: Openmpi was compiled without torque/PBS
support (redhat/CentOS .rpm), so it was launching with a single
process on the node it was started on. When MPI_Send() was called, it
had nothing to send it to, and crashed. Once I manually set the -np
value, it stared working.

--Jim

On 10/9/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> If you do not have IB hardware, you might want to permanently disable
> the IB support. You can do this by setting an MCA parameter or
> simply removing the $prefix/lib/openmpi/mca_btl_openib.* files. This
> will suppress the warning that you're seeing.
>
> As for your problem with MPI_SEND, do you know that your program is
> correct? I.e., it's a little odd that you're failing directly in
> seedSends, not in an MPI function. Are you getting a core dump that
> you can examine, or can you attach a debugger to see where exactly it
> is failing?
>
>
> On Oct 4, 2007, at 8:36 PM, Jim Kusznir wrote:
>
> > Hi all:
> >
> > I'm having trouble getting torque/maui working with OpenMPI.
> >
> > Currently, I am getting hard failures when an MPI_Send is called.
> > When
> > run without qsub (no torque/maui), the mpi job runs fine, so its
> > something that
> > qsub/torque/maui is doing (I think). Here's the error:
> >
> > libibverbs: Fatal: couldn't open sysfs class 'infiniband_verbs'.
> > ----------------------------------------------------------------------
> > ----
> > [0,1,0]: OpenIB on host localhost was unable to find any HCAs.
> > Another transport will be used instead, although this may result in
> > lower performance.
> > ----------------------------------------------------------------------
> > ----
> > Signal:8 info.si_errno:0(Success) si_code:1(FPE_INTDIV)
> > Failing at addr:0x40cc2d
> > [0] func:/usr/lib64/openmpi/libopal.so.0 [0x3ecfb21dc5]
> > [1] func:/lib64/tls/libpthread.so.0 [0x3ed040c4f0]
> > [2] func:repdig_mpi(sendSeeds+0x3d) [0x40cc2d]
> > [3] func:repdig_mpi(main+0x3b6) [0x40c026]
> > [4] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3ecfd1c3fb]
> > [5] func:repdig_mpi [0x4030ea]
> > *** End of error message ***
> >
> > I don't really know where to begin looking. I know in the stack trace
> > the actual problem is occurring in #2 (sendSeeds), but that is a basic
> > MPI_Send(), and works when not using torque.
> >
> > My system (installed from Rocks 4.3) does not have infiniband; I think
> > I just figured out how to disable it; in any case, the same warning
> > shows up when not running it through torque, and the job runs
> > successfully.
> >
> > Thoughts/suggestions?
> >
> > Thanks!
> > --Jim
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>