Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OpenMPI job initializing problem
From: Gus Correa (gus_at_[hidden])
Date: 2014-02-28 17:58:57

HI Beichuan

To add to what Ralph said,
the RHEL OpenMPI package probably wasn't built with
with PBS Pro support either.
Besides, OMPI 1.5.4 (RHEL version) is old.


You will save yourself time and grief if you read the installation FAQs,
before you install from the source tarball:

However, as Ralph said, that is your best bet, and it is quite easy
to get right.

See this FAQ on how to build with PBS Pro support:

And this one on how to build with Infiniband support:

Here is how to select the installation directory (--prefix):

Here is how to select the compilers (gcc,g++, and gfortran are fine):

I hope this helps,
Gus Correa

On 02/28/2014 12:36 PM, Ralph Castain wrote:
> Almost certainly, the redhat package wasn't built with matching
> infiniband support and so we aren't picking it up. I'd suggest
> downloading the latest 1.7.4 or 1.7.5 nightly tarball, or even the
> latest 1.6 tarball if you want the stable release, and build it yourself
> so you *know* it was built for your system.
> On Feb 28, 2014, at 9:20 AM, Beichuan Yan <beichuan.yan_at_[hidden]
> <mailto:beichuan.yan_at_[hidden]>> wrote:
>> Hi there,
>> I am running jobs on clusters with Infiniband connection. They
>> installed OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is that
>> although my jobs gets queued and started by PBS PRO quickly, most of
>> the time they don’t really run (occasionally they really run) and give
>> error info like this (even though there are a lot of CPU/IB resource
>> available):
>> [r2i6n7][[25564,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>> connect() to failed: Connection refused (111)
>> And even though when a job gets started and runs well, it prompts this
>> error:
>> --------------------------------------------------------------------------
>> WARNING: There was an error initializing an OpenFabrics device.
>> Local host: r1i2n6
>> Local device: mlx4_0
>> --------------------------------------------------------------------------
>> 1. Here is the info from one of the compute nodes:
>> -bash-4.1$ /sbin/ifconfig
>> eth0 Link encap:Ethernet HWaddr 8C:89:A5:E3:D2:96
>> inet addr: Bcast: Mask:
>> inet6 addr: fe80::8e89:a5ff:fee3:d296/64 Scope:Link
>> RX packets:48879864 errors:0 dropped:0 overruns:17 frame:0
>> TX packets:39286060 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:54771093645 (51.0 GiB) TX bytes:37512462596 (34.9 GiB)
>> Memory:dfc00000-dfc20000
>> Ifconfig uses the ioctl access method to get the full address
>> information, which limits hardware addresses to 8 bytes.
>> Because Infiniband address has 20 bytes, only the first 8 bytes are
>> displayed correctly.
>> Ifconfig is obsolete! For replacement check ip.
>> ib0 Link encap:InfiniBand HWaddr
>> 80:00:00:48:FE:C0:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>> inet addr: Bcast: Mask:
>> inet6 addr: fe80::202:c903:fb:3489/64 Scope:Link
>> RX packets:43807414 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:10534050 errors:0 dropped:24 overruns:0 carrier:0
>> collisions:0 txqueuelen:256
>> RX bytes:47824448125 (44.5 GiB) TX bytes:44764010514 (41.6 GiB)
>> lo Link encap:Local Loopback
>> inet addr: Mask:
>> inet6 addr: ::1/128 Scope:Host
>> RX packets:17292 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:17292 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:1492453 (1.4 MiB) TX bytes:1492453 (1.4 MiB)
>> -bash-4.1$ chkconfig --list iptables
>> iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>> 2. I tried various parameters below but none of them can assure my
>> jobs get initialized and run:
>> #TCP="--mca btl ^tcp"
>> #TCP="--mca btl self,openib"
>> #TCP="--mca btl_tcp_if_exclude lo"
>> #TCP="--mca btl_tcp_if_include eth0"
>> #TCP="--mca btl_tcp_if_include eth0, ib0"
>> #TCP="--mca btl_tcp_if_exclude, --mca
>> oob_tcp_if_exclude,"
>> #TCP="--mca btl_tcp_if_include"
>> mpirun $TCP -hostfile $PBS_NODEFILE -np 8 ./paraEllip3d input.txt
>> 3. Then I turned to Intel MPI, which surprisingly starts and runs my
>> job correctly each time (though it is a little slower than OpenMPI,
>> maybe 15% slower, but it works each time).
>> Can you please advise? Many thanks.
>> Sincerely,
>> Beichuan Yan
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]