Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI Hangs, No Error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-07-06 18:22:25


On Jul 6, 2010, at 5:41 PM, Robert Walters wrote:

> Thanks for your expeditious responses, Ralph.
>
> Just to confirm with you, I should change openmpi-mca-params.conf to include:
>
> oob_tcp_port_min_v4 = (My minimum port in the range)
> oob_tcp_port_range_v4 = (My port range)
> btl_tcp_port_min_v4 = (My minimum port in the range)
> btl_tcp_port_range_v4 = (My port range)
>
> correct?

That should do ya. Use the same values on all nodes. You should be able to confirm that OMPI's run-time system is working if you are able to mpirun a non-MPI program like "hostname" or somesuch. If that works, then the daemons are launching, talking to each other, launching the app, shuttling the I/O around, noticing that the app is dying, tidying everything up, and telling mpirun that everything is done. In short: lots of things are happening right if you're able to mpirun "hostname" across multiple hosts.

> Also, for a cluster of around 32-64 processes (8 processors per node), how wide of a range will I require? I've noticed some entries in the mailing list suggesting you need a few to get started and then it opens as necessary. Will I be safe with 20 or should I go for 100?

If you have 64 hosts, each with 8 processors, meaning that the largest MPI job you would run would be 64 * 8 = 512 MPI processes, then I'd ask for at least 1024 -- 2048 would be better (you have a zillion ports; better to ask for more than you need). We recently found a bug in the TCP BTL where it *may* use 2 sockets for each peerwise connection in some cases.

Additionally, your sysadmin *might* be more amenable to opening up ports *only between the cluster nodes* (vs. opening up the ports to anything). If that's the case, you might as well go for the gold and ask them if they can open up *all* the ports between all your nodes (while still rejecting everything from non-cluster nodes).

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/