Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI job initializing problem
From: Gus Correa (gus_at_[hidden])
Date: 2014-02-28 17:58:57


HI Beichuan

To add to what Ralph said,
the RHEL OpenMPI package probably wasn't built with
with PBS Pro support either.
Besides, OMPI 1.5.4 (RHEL version) is old.

**

You will save yourself time and grief if you read the installation FAQs,
before you install from the source tarball:

http://www.open-mpi.org/faq/?category=building

However, as Ralph said, that is your best bet, and it is quite easy
to get right.

See this FAQ on how to build with PBS Pro support:

http://www.open-mpi.org/faq/?category=building#build-rte-tm

And this one on how to build with Infiniband support:

http://www.open-mpi.org/faq/?category=building#build-p2p

Here is how to select the installation directory (--prefix):

http://www.open-mpi.org/faq/?category=building#easy-build

Here is how to select the compilers (gcc,g++, and gfortran are fine):

http://www.open-mpi.org/faq/?category=building#build-compilers

I hope this helps,
Gus Correa

On 02/28/2014 12:36 PM, Ralph Castain wrote:
> Almost certainly, the redhat package wasn't built with matching
> infiniband support and so we aren't picking it up. I'd suggest
> downloading the latest 1.7.4 or 1.7.5 nightly tarball, or even the
> latest 1.6 tarball if you want the stable release, and build it yourself
> so you *know* it was built for your system.
>
>
> On Feb 28, 2014, at 9:20 AM, Beichuan Yan <beichuan.yan_at_[hidden]
> <mailto:beichuan.yan_at_[hidden]>> wrote:
>
>> Hi there,
>> I am running jobs on clusters with Infiniband connection. They
>> installed OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is that
>> although my jobs gets queued and started by PBS PRO quickly, most of
>> the time they don’t really run (occasionally they really run) and give
>> error info like this (even though there are a lot of CPU/IB resource
>> available):
>> [r2i6n7][[25564,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.159.156 failed: Connection refused (111)
>> And even though when a job gets started and runs well, it prompts this
>> error:
>> --------------------------------------------------------------------------
>> WARNING: There was an error initializing an OpenFabrics device.
>> Local host: r1i2n6
>> Local device: mlx4_0
>> --------------------------------------------------------------------------
>> 1. Here is the info from one of the compute nodes:
>> -bash-4.1$ /sbin/ifconfig
>> eth0 Link encap:Ethernet HWaddr 8C:89:A5:E3:D2:96
>> inet addr:192.168.159.205 Bcast:192.168.159.255 Mask:255.255.255.0
>> inet6 addr: fe80::8e89:a5ff:fee3:d296/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:48879864 errors:0 dropped:0 overruns:17 frame:0
>> TX packets:39286060 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:54771093645 (51.0 GiB) TX bytes:37512462596 (34.9 GiB)
>> Memory:dfc00000-dfc20000
>> Ifconfig uses the ioctl access method to get the full address
>> information, which limits hardware addresses to 8 bytes.
>> Because Infiniband address has 20 bytes, only the first 8 bytes are
>> displayed correctly.
>> Ifconfig is obsolete! For replacement check ip.
>> ib0 Link encap:InfiniBand HWaddr
>> 80:00:00:48:FE:C0:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>> inet addr:10.148.0.114 Bcast:10.148.255.255 Mask:255.255.0.0
>> inet6 addr: fe80::202:c903:fb:3489/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>> RX packets:43807414 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:10534050 errors:0 dropped:24 overruns:0 carrier:0
>> collisions:0 txqueuelen:256
>> RX bytes:47824448125 (44.5 GiB) TX bytes:44764010514 (41.6 GiB)
>> lo Link encap:Local Loopback
>> inet addr:127.0.0.1 Mask:255.0.0.0
>> inet6 addr: ::1/128 Scope:Host
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>> RX packets:17292 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:17292 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:1492453 (1.4 MiB) TX bytes:1492453 (1.4 MiB)
>> -bash-4.1$ chkconfig --list iptables
>> iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>> 2. I tried various parameters below but none of them can assure my
>> jobs get initialized and run:
>> #TCP="--mca btl ^tcp"
>> #TCP="--mca btl self,openib"
>> #TCP="--mca btl_tcp_if_exclude lo"
>> #TCP="--mca btl_tcp_if_include eth0"
>> #TCP="--mca btl_tcp_if_include eth0, ib0"
>> #TCP="--mca btl_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8 --mca
>> oob_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8"
>> #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>> mpirun $TCP -hostfile $PBS_NODEFILE -np 8 ./paraEllip3d input.txt
>> 3. Then I turned to Intel MPI, which surprisingly starts and runs my
>> job correctly each time (though it is a little slower than OpenMPI,
>> maybe 15% slower, but it works each time).
>> Can you please advise? Many thanks.
>> Sincerely,
>> Beichuan Yan
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users