Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OpenMPI scaling > 512 cores
From: Scott Shaw (sshaw_at_[hidden])
Date: 2008-06-06 13:13:39


Thank you to all that replied regarding my questions.

I have tried all the options suggested but unfortunately I still run
into the same problem. I am at a point were I have exhausted all of the
options available with the OpenMPI v1.2.2 release and moving to v1.2.6
later today. Hopefully mca_oob_tcp_peer_complete_connect handling is
better and resolves the "connection failed: retrying" messages when
running large core count jobs.

Thanks!
Scott

> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
On
> Behalf Of Pavel Shamis (Pasha)
> Sent: Wednesday, June 04, 2008 5:18 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI scaling > 512 cores
>
> Scott Shaw wrote:
> > Hi, I hope this is the right forum for my questions. I am running
into
> > a problem when scaling >512 cores on a infiniband cluster which has
> > 14,336 cores. I am new to openmpi and trying to figure out the right
> > -mca options to pass to avoid the
"mca_oob_tcp_peer_complete_connect:
> > connection failed:" on a cluster which has infiniband HCAs and OFED
> > v1.3GA release. Other MPI implementation like Intel MPI and mvapich
> > work fine using uDAPL or VERBs IB layers for MPI communications.
> >
> Did you have chance to see this FAQ -
>
http://www.open-mpi.org/faq/?category=troubleshooting#large-job-tcp-oob-
> timeout
> > I find it difficult to understand which network interface or IB
layer
> > being used. When I explicitly state not to use eth0,lo,ib1, or ib1:0
> > interfaces with the cmdline option "-mca oob_tcp_exclude" openmpi
will
> > continue to probe these interfaces. For all MPI traffic openmpi
should
> > use IB0 which is the 10.148 network. But with debugging enabled I
see
> > references trying the 10.149 network which is IB1. Below is the
> > ifconfig network device output for a compute node.
> >
> > Questions:
> >
> > 1. Is there away to determine which network device is being used and
not
> > have openmpi fallback to another device? With Intel MPI or HP MPI
you
> > can state not to use a fallback device. I thought "-mca
> > oob_tcp_exclude" would be the correct option to pass but I maybe
wrong.
> >
> If you want to use the IB verbs , you may specify:
> -mca btl sm.self,openib
> sm - shmem
> self - self comunication
> openib - IB communication (IB verbs)
>
> > 2. How can I determine infiniband openib device is actually being
used?
> > When running a MPI app I continue to see counters for in/out packets
at
> > a tcp level increasing when it should be using the IB RDMA device
for
> > all MPI comms over the IB0 or mtcha0 device? OpenMPI was bundled
with
> > OFED v1.3 so I am assuming the openib interface should work.
Running
> > ompi_info shows btl_open_* references.
> >
> > /usr/mpi/openmpi-1.2-2/intel/bin/mpiexec -mca
> > btl_openib_warn_default_gid_prefix 0 -mca oob_tcp_exclude
> > eth0,lo,ib1,ib1:0 -mca btl openib,sm,self -machinefile mpd.hosts.$$
-np
> > 1024 ~/bin/test_ompi < input1
> >
> http://www.open-mpi.org/community/lists/users/2008/05/5583.php
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users