Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI job initializing problem
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-02 12:44:09


FWIW: /usr/include/infiniband/verbs.h is the normal location for verbs.h. Don't add --with-verbs=/usr/include/infinband; it won't work.

Please send all the information listed here and we can have a look at your logs:

    http://www.open-mpi.org/community/help/

On Mar 2, 2014, at 7:16 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> It should have been looking in the same place - check to see where you installed the inifiniband support. Is "verbs.h" under your /usr/include?
>
> In looking at the code, the 1.6 series searched for verbs.h in /usr/include/infiniband. The 1.7 series also does (though it doesn't look quite right to me), but it wouldn't hurt to add it yourself
>
> --with-verbs=/usr/include/infiniband --with-verbs-libdir=/usr/lib64/infiniband
>
> or something like that
>
>
> On Mar 1, 2014, at 11:56 PM, Beichuan Yan <beichuan.yan_at_[hidden]> wrote:
>
>> Ralph and Gus,
>>
>> 1. Thank you for your suggestion. I built Open MPI 1.6.5 with the following command:
>> ./configure --prefix=/work4/projects/openmpi/openmpi-1.6.5-gcc-compilers-4.7.3 --with-tm=/opt/pbs/default --with-openib= --with-openib-libdir=/usr/lib64
>>
>> In my job script, I need to specify the IB subnet like this:
>> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>> mpirun $TCP -np 64 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt
>>
>> Then my job can get initialized and run correctly each time!
>>
>> 2. However, to build Open MPI 1.7.4 with another command (in order to test/compare shared-memory performance of Open MPI):
>> ./configure --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3 --with-tm=/opt/pbs/default --with-verbs= --with-verbs-libdir=/usr/lib64
>>
>> It gets error as follows:
>> ============================================================================
>> == Modular Component Architecture (MCA) setup
>> ============================================================================
>> checking for subdir args... '--prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3' '--with-tm=/opt/pbs/default' '--with-verbs=' '--with-verbs-libdir=/usr/lib64' 'CC=gcc' 'CXX=g++'
>> checking --with-verbs value... simple ok (unspecified)
>> checking --with-verbs-libdir value... sanity check ok (/usr/lib64)
>> configure: WARNING: Could not find verbs.h in the usual locations under
>> configure: error: Cannot continue
>>
>> Our system is Red Hat 6.4. Do we need to install more packages of Infiniband? Can you please advise?
>>
>> Thanks,
>> Beichuan Yan
>>
>>
>> -----Original Message-----
>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
>> Sent: Friday, February 28, 2014 15:59
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>
>> HI Beichuan
>>
>> To add to what Ralph said,
>> the RHEL OpenMPI package probably wasn't built with with PBS Pro support either.
>> Besides, OMPI 1.5.4 (RHEL version) is old.
>>
>> **
>>
>> You will save yourself time and grief if you read the installation FAQs, before you install from the source tarball:
>>
>> http://www.open-mpi.org/faq/?category=building
>>
>> However, as Ralph said, that is your best bet, and it is quite easy to get right.
>>
>>
>> See this FAQ on how to build with PBS Pro support:
>>
>> http://www.open-mpi.org/faq/?category=building#build-rte-tm
>>
>> And this one on how to build with Infiniband support:
>>
>> http://www.open-mpi.org/faq/?category=building#build-p2p
>>
>> Here is how to select the installation directory (--prefix):
>>
>> http://www.open-mpi.org/faq/?category=building#easy-build
>>
>> Here is how to select the compilers (gcc,g++, and gfortran are fine):
>>
>> http://www.open-mpi.org/faq/?category=building#build-compilers
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 02/28/2014 12:36 PM, Ralph Castain wrote:
>>> Almost certainly, the redhat package wasn't built with matching
>>> infiniband support and so we aren't picking it up. I'd suggest
>>> downloading the latest 1.7.4 or 1.7.5 nightly tarball, or even the
>>> latest 1.6 tarball if you want the stable release, and build it
>>> yourself so you *know* it was built for your system.
>>>
>>>
>>> On Feb 28, 2014, at 9:20 AM, Beichuan Yan <beichuan.yan_at_[hidden]
>>> <mailto:beichuan.yan_at_[hidden]>> wrote:
>>>
>>>> Hi there,
>>>> I am running jobs on clusters with Infiniband connection. They
>>>> installed OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is
>>>> that although my jobs gets queued and started by PBS PRO quickly,
>>>> most of the time they don't really run (occasionally they really run)
>>>> and give error info like this (even though there are a lot of CPU/IB
>>>> resource
>>>> available):
>>>> [r2i6n7][[25564,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_com
>>>> plete_connect]
>>>> connect() to 192.168.159.156 failed: Connection refused (111) And
>>>> even though when a job gets started and runs well, it prompts this
>>>> error:
>>>> ---------------------------------------------------------------------
>>>> -----
>>>> WARNING: There was an error initializing an OpenFabrics device.
>>>> Local host: r1i2n6
>>>> Local device: mlx4_0
>>>> ---------------------------------------------------------------------
>>>> ----- 1. Here is the info from one of the compute nodes:
>>>> -bash-4.1$ /sbin/ifconfig
>>>> eth0 Link encap:Ethernet HWaddr 8C:89:A5:E3:D2:96 inet
>>>> addr:192.168.159.205 Bcast:192.168.159.255 Mask:255.255.255.0
>>>> inet6 addr: fe80::8e89:a5ff:fee3:d296/64 Scope:Link UP BROADCAST
>>>> RUNNING MULTICAST MTU:1500 Metric:1 RX packets:48879864 errors:0
>>>> dropped:0 overruns:17 frame:0 TX packets:39286060 errors:0 dropped:0
>>>> overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:54771093645 (51.0 GiB) TX bytes:37512462596 (34.9 GiB)
>>>> Memory:dfc00000-dfc20000
>>>> Ifconfig uses the ioctl access method to get the full address
>>>> information, which limits hardware addresses to 8 bytes.
>>>> Because Infiniband address has 20 bytes, only the first 8 bytes are
>>>> displayed correctly.
>>>> Ifconfig is obsolete! For replacement check ip.
>>>> ib0 Link encap:InfiniBand HWaddr
>>>> 80:00:00:48:FE:C0:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>> inet addr:10.148.0.114 Bcast:10.148.255.255 Mask:255.255.0.0
>>>> inet6 addr: fe80::202:c903:fb:3489/64 Scope:Link UP BROADCAST RUNNING
>>>> MULTICAST MTU:65520 Metric:1 RX packets:43807414 errors:0 dropped:0
>>>> overruns:0 frame:0 TX packets:10534050 errors:0 dropped:24 overruns:0
>>>> carrier:0
>>>> collisions:0 txqueuelen:256
>>>> RX bytes:47824448125 (44.5 GiB) TX bytes:44764010514 (41.6 GiB) lo
>>>> Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0
>>>> inet6 addr: ::1/128 Scope:Host
>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:17292 errors:0
>>>> dropped:0 overruns:0 frame:0 TX packets:17292 errors:0 dropped:0
>>>> overruns:0 carrier:0
>>>> collisions:0 txqueuelen:0
>>>> RX bytes:1492453 (1.4 MiB) TX bytes:1492453 (1.4 MiB) -bash-4.1$
>>>> chkconfig --list iptables iptables 0:off 1:off 2:on 3:on 4:on 5:on
>>>> 6:off 2. I tried various parameters below but none of them can assure
>>>> my jobs get initialized and run:
>>>> #TCP="--mca btl ^tcp"
>>>> #TCP="--mca btl self,openib"
>>>> #TCP="--mca btl_tcp_if_exclude lo"
>>>> #TCP="--mca btl_tcp_if_include eth0"
>>>> #TCP="--mca btl_tcp_if_include eth0, ib0"
>>>> #TCP="--mca btl_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8 --mca
>>>> oob_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8"
>>>> #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>>> mpirun $TCP -hostfile $PBS_NODEFILE -np 8 ./paraEllip3d input.txt 3.
>>>> Then I turned to Intel MPI, which surprisingly starts and runs my job
>>>> correctly each time (though it is a little slower than OpenMPI, maybe
>>>> 15% slower, but it works each time).
>>>> Can you please advise? Many thanks.
>>>> Sincerely,
>>>> Beichuan Yan
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/