I think the problem is likely due to the networking differences
between the nodes. Check out these two FAQ entries:
Specifically, I think you should try using a pair of these four MCA
btl_tcp_if_include and oob_tcp_include
btl_tcp_if_exclude and oob_tcp_exclude
Basically, you need to make sure the Open MPI doesn't try to use
the public network, since one of the nodes isn't on the public network.
On Jan 17, 2008 10:08 PM, Mark Kosmowski <mark.kosmowski_at_[hidden]> wrote:
> On Jan 15, 2008 7:54 PM, Mark Kosmowski <mark.kosmowski_at_[hidden]> wrote:
> > Dear Open-MPI Community:
> > I have a 3 node cluster, each a dual opteron workstation running
> > OpenSUSE 10.1 64-bit. The node names are LT, SGT and PFC. When I
> > start an mpirun job from either SGT or PFC, things work as they are
> > supposed to. However, if I start the same job from LT, the jobs hangs
> > at SGT - this was confirmed by mpirun --np 6 --hostfile <correct
> > hostfile for the three nodes> hostname, which gives only LT; LT; PFC;
> > PFC (and then hangs) when started from LT (this same command started
> > from either of the other nodes give two of each of the three hostnames
> > and terminates normally). The nfs share drive is physically located
> > on LT.
> > I have been using ssh to get to either SGT or PFC from a terminal
> > opened originally on LT to run jobs. I can ssh from any node to any
> > other node.
> > I have attached a gzipped tar archive of the three ifconfig results
> > (for each node) and the results of ompi_info --all command as
> > requested in the "Getting Help" section. I was unable to locate a
> > config.log file in the shared ompi directory.
> > Any assistance on this matter would be appreciated,
> > Mark E. Kosmowski
> >I'd posted a message earlier about intermittent hangs -- perhaps it's
> >the same issue. If you run a hundred instances or so of "mpirun --np 6
> >--hostfile hostfile uptime", from SGT or PFC, do you notice any hangs?
> >Barry Rountree
> I read your thread and I do not think that the issues are the same.
> You seem to get the correct output before the hang, I do not. My
> system either fails to give the expected output with a hang when
> started from the LT node, or works correctly giving the proper output
> and a graceful exit (i.e. no hang whatsoever) when started on one of
> the other two nodes (SGT or PFC).
> I suspect that my issue is that both LT and SGT are connected to both
> the internet and the dedicated cluster traffic gigabit switch, while
> PFC is only connected to the dedicated cluster traffic gigabit switch.
> However, this is the limit of my network diagnostic abilities,
> especially since SGT can properly launch open MPI jobs.
> users mailing list
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmattox_at_[hidden] || timattox_at_[hidden]
I'm a bright... http://www.the-brights.net/