Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI job initializing problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-03-02 10:16:09


It should have been looking in the same place - check to see where you installed the inifiniband support. Is "verbs.h" under your /usr/include?

In looking at the code, the 1.6 series searched for verbs.h in /usr/include/infiniband. The 1.7 series also does (though it doesn't look quite right to me), but it wouldn't hurt to add it yourself

--with-verbs=/usr/include/infiniband --with-verbs-libdir=/usr/lib64/infiniband

or something like that

On Mar 1, 2014, at 11:56 PM, Beichuan Yan <beichuan.yan_at_[hidden]> wrote:

> Ralph and Gus,
>
> 1. Thank you for your suggestion. I built Open MPI 1.6.5 with the following command:
> ./configure --prefix=/work4/projects/openmpi/openmpi-1.6.5-gcc-compilers-4.7.3 --with-tm=/opt/pbs/default --with-openib= --with-openib-libdir=/usr/lib64
>
> In my job script, I need to specify the IB subnet like this:
> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> mpirun $TCP -np 64 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt
>
> Then my job can get initialized and run correctly each time!
>
> 2. However, to build Open MPI 1.7.4 with another command (in order to test/compare shared-memory performance of Open MPI):
> ./configure --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3 --with-tm=/opt/pbs/default --with-verbs= --with-verbs-libdir=/usr/lib64
>
> It gets error as follows:
> ============================================================================
> == Modular Component Architecture (MCA) setup
> ============================================================================
> checking for subdir args... '--prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3' '--with-tm=/opt/pbs/default' '--with-verbs=' '--with-verbs-libdir=/usr/lib64' 'CC=gcc' 'CXX=g++'
> checking --with-verbs value... simple ok (unspecified)
> checking --with-verbs-libdir value... sanity check ok (/usr/lib64)
> configure: WARNING: Could not find verbs.h in the usual locations under
> configure: error: Cannot continue
>
> Our system is Red Hat 6.4. Do we need to install more packages of Infiniband? Can you please advise?
>
> Thanks,
> Beichuan Yan
>
>
> -----Original Message-----
> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
> Sent: Friday, February 28, 2014 15:59
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> HI Beichuan
>
> To add to what Ralph said,
> the RHEL OpenMPI package probably wasn't built with with PBS Pro support either.
> Besides, OMPI 1.5.4 (RHEL version) is old.
>
> **
>
> You will save yourself time and grief if you read the installation FAQs, before you install from the source tarball:
>
> http://www.open-mpi.org/faq/?category=building
>
> However, as Ralph said, that is your best bet, and it is quite easy to get right.
>
>
> See this FAQ on how to build with PBS Pro support:
>
> http://www.open-mpi.org/faq/?category=building#build-rte-tm
>
> And this one on how to build with Infiniband support:
>
> http://www.open-mpi.org/faq/?category=building#build-p2p
>
> Here is how to select the installation directory (--prefix):
>
> http://www.open-mpi.org/faq/?category=building#easy-build
>
> Here is how to select the compilers (gcc,g++, and gfortran are fine):
>
> http://www.open-mpi.org/faq/?category=building#build-compilers
>
> I hope this helps,
> Gus Correa
>
> On 02/28/2014 12:36 PM, Ralph Castain wrote:
>> Almost certainly, the redhat package wasn't built with matching
>> infiniband support and so we aren't picking it up. I'd suggest
>> downloading the latest 1.7.4 or 1.7.5 nightly tarball, or even the
>> latest 1.6 tarball if you want the stable release, and build it
>> yourself so you *know* it was built for your system.
>>
>>
>> On Feb 28, 2014, at 9:20 AM, Beichuan Yan <beichuan.yan_at_[hidden]
>> <mailto:beichuan.yan_at_[hidden]>> wrote:
>>
>>> Hi there,
>>> I am running jobs on clusters with Infiniband connection. They
>>> installed OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is
>>> that although my jobs gets queued and started by PBS PRO quickly,
>>> most of the time they don't really run (occasionally they really run)
>>> and give error info like this (even though there are a lot of CPU/IB
>>> resource
>>> available):
>>> [r2i6n7][[25564,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_com
>>> plete_connect]
>>> connect() to 192.168.159.156 failed: Connection refused (111) And
>>> even though when a job gets started and runs well, it prompts this
>>> error:
>>> ---------------------------------------------------------------------
>>> -----
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> Local host: r1i2n6
>>> Local device: mlx4_0
>>> ---------------------------------------------------------------------
>>> ----- 1. Here is the info from one of the compute nodes:
>>> -bash-4.1$ /sbin/ifconfig
>>> eth0 Link encap:Ethernet HWaddr 8C:89:A5:E3:D2:96 inet
>>> addr:192.168.159.205 Bcast:192.168.159.255 Mask:255.255.255.0
>>> inet6 addr: fe80::8e89:a5ff:fee3:d296/64 Scope:Link UP BROADCAST
>>> RUNNING MULTICAST MTU:1500 Metric:1 RX packets:48879864 errors:0
>>> dropped:0 overruns:17 frame:0 TX packets:39286060 errors:0 dropped:0
>>> overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:54771093645 (51.0 GiB) TX bytes:37512462596 (34.9 GiB)
>>> Memory:dfc00000-dfc20000
>>> Ifconfig uses the ioctl access method to get the full address
>>> information, which limits hardware addresses to 8 bytes.
>>> Because Infiniband address has 20 bytes, only the first 8 bytes are
>>> displayed correctly.
>>> Ifconfig is obsolete! For replacement check ip.
>>> ib0 Link encap:InfiniBand HWaddr
>>> 80:00:00:48:FE:C0:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>> inet addr:10.148.0.114 Bcast:10.148.255.255 Mask:255.255.0.0
>>> inet6 addr: fe80::202:c903:fb:3489/64 Scope:Link UP BROADCAST RUNNING
>>> MULTICAST MTU:65520 Metric:1 RX packets:43807414 errors:0 dropped:0
>>> overruns:0 frame:0 TX packets:10534050 errors:0 dropped:24 overruns:0
>>> carrier:0
>>> collisions:0 txqueuelen:256
>>> RX bytes:47824448125 (44.5 GiB) TX bytes:44764010514 (41.6 GiB) lo
>>> Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0
>>> inet6 addr: ::1/128 Scope:Host
>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:17292 errors:0
>>> dropped:0 overruns:0 frame:0 TX packets:17292 errors:0 dropped:0
>>> overruns:0 carrier:0
>>> collisions:0 txqueuelen:0
>>> RX bytes:1492453 (1.4 MiB) TX bytes:1492453 (1.4 MiB) -bash-4.1$
>>> chkconfig --list iptables iptables 0:off 1:off 2:on 3:on 4:on 5:on
>>> 6:off 2. I tried various parameters below but none of them can assure
>>> my jobs get initialized and run:
>>> #TCP="--mca btl ^tcp"
>>> #TCP="--mca btl self,openib"
>>> #TCP="--mca btl_tcp_if_exclude lo"
>>> #TCP="--mca btl_tcp_if_include eth0"
>>> #TCP="--mca btl_tcp_if_include eth0, ib0"
>>> #TCP="--mca btl_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8 --mca
>>> oob_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8"
>>> #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>> mpirun $TCP -hostfile $PBS_NODEFILE -np 8 ./paraEllip3d input.txt 3.
>>> Then I turned to Intel MPI, which surprisingly starts and runs my job
>>> correctly each time (though it is a little slower than OpenMPI, maybe
>>> 15% slower, but it works each time).
>>> Can you please advise? Many thanks.
>>> Sincerely,
>>> Beichuan Yan
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users