Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] seg faults with IB and RH ibverbs-1.1.1-9
From: Andrew J Caird (acaird_at_[hidden])
Date: 2008-08-27 10:48:57


As I mentioned it might be, it was a local issue.

The lesson is that one should be very careful about OFED versions and
cleanliness. :)

--andy

On Mon, 25 Aug 2008, Andrew J Caird wrote:

> Hello all,
>
> We recently applied the latest RedHat update (/etc/redhat-release says
> "Red Hat Enterprise Linux WS release 4 (Nahant Update 7)") to our cluster,
> and now codes that use IB seg fault.
>
> We have tried multiple versions of OpenMPI and PGI and GNU compilers. We
> have compiled with --memory-manager=none and without that. None of that
> seems to matter.
>
> When we copy mca_btl_openib.la and mca_btl_openib.so from a version of
> OpenMPI compiled before the update into $OMPI_HOME/lib/openmpi/,
> everything works fine - no seg faults. To me this suggests something in
> the relationship between those two files and libibverbs, although I'm at a
> loss as to what that might be. Note that the old version of libibverbs is
> gone from the system, but the new version seems to imply it has both
> IBVerbs 1.0 and 1.1. That's just an assumption on my part based on
> looking at "strings /usr/lib64/libibverbs.so.1.0.0 | grep IBVER" and
> seeing IBVERBS_1.0 and IBVERBS_1.1 in the output.
>
> The RPMs RedHat provides for ibverbs is libibverbs-1.1.1-9.el4 and the
> openib RPM is openib-1.3-5.el4.
>
> The fairly uninformative seg fault looks like:
> [me_at_node421 ~]$ mpirun -np 5 ./cpi127
> [node422:28808] *** Process received signal ***
> [node421:29922] *** Process received signal ***
> [node421:29922] Signal: Segmentation fault (11)
> [node421:29922] Signal code: Address not mapped (1)
> [node421:29922] Failing at address: (nil)
> [node422:28808] Signal: Segmentation fault (11)
> [node422:28808] Signal code: Address not mapped (1)
> [node422:28808] Failing at address: (nil)
> [node422:28808] *** End of error message ***
> [node421:29922] *** End of error message ***
> [node421.engin.umich.edu:29917] [0,0,0]-[0,1,2] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> mpirun noticed that job rank 0 with PID 29919 on node node421 exited on signal 15 (Terminated).
> 4 additional processes aborted (not shown)
>
> Running that same code over Ethernet ("-mca btl ^openib") works fine.
>
>
> The configure line for OpenMPI looks roughly like:
> ./configure --prefix=/home/software/rhel4/openmpi-1.2.7rc5/pgi-7.2 --with-tm=/usr/local/torque --with-openib=/usr CC=pgcc CXX=pgCC FC=pgf90 F77=pgf90
>
> sometimes I added: --memory-manager=none
>
> We're running the embedded subnet manager in our Topspin TS120 switch (but
> I don't think that's the problem, since codes with the old libraries do
> work fine).
>
> Has anyone else seen any oddness with RH Update 7, libibverbs 1.1.1 and
> OpenMPI, or are we looking at the wrong things?
>
> config.log and ompi_info output are in the attached zip file.
>
> Unfortunately, it's very possible that it's something local to our
> installation, but if we had confirmation that this works for someone else,
> it would greatly narrow our search space.
>
> Thanks for any insights.
>
> --andy
>
> *****************************************************************************
> ** **
> ** WARNING: This email contains an attachment of a very suspicious type. **
> ** You are urged NOT to open this attachment unless you are absolutely **
> ** sure it is legitimate. Opening this attachment may cause irreparable **
> ** damage to your computer and your files. If you have any questions **
> ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
> ** **
> ** This warning was added by the IU Computer Science Dept. mail scanner. **
> *****************************************************************************
>
>
>