Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ORTE_ERROR_LOG Timeout
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-04 12:51:45


James --

Sorry for the delay in replying.

Do you have any firewall software running on your nodes (e.g.,
iptables)? OMPI uses random TCP ports to connect between nodes for
control messages. If they can't reach each other because TCP ports
are blocked, Bad Things will happen (potentially even a hang, because
firewalls can cause packets to be silently dropped).

On May 20, 2008, at 12:17 PM, Rudd, James wrote:

> I have been trying to compile a molecular dynamics program with the
> Openmpi 1.2.5 included in OFED 1.3. I am running Fedora Core 6; the
> output of uname –r is 2.6.18-1.2798.fc6. I’ve traced the problems
> I’ve been having back to openmpi because I’m unable to run the test
> programs such as glob on more than one node. I currently have 2
> nodes connected to an infiniband switch with opensm running on
> node1. The nodes can ping each other and I am able to ssh between
> them without a password. My openmpi-default-hostfile includes the
> following:
>
> node1 slots=2 max-slots=4
> node2 slots=4 max-slots=4
>
> When I run “mpirun -np 4 --debug-daemons ./glob” I get:
> Daemon [0,0,1] checking in as pid 21341 on host node1
> And the program appears to hang. Once I CTRL+C it a couple of times
> I get the contents of error.txt
>
> Per the instructions in the FAQ I’ve included the output of
> “ibv_devinfo”, “ifconfig”, and “ulimit –l” in the
> infiniband_info.txt file. The results of “ompi_info –all is in the
> ompi_info.txt file.
>
> I’ve been tearing my hear out over this, any help would be greatly
> appreciated.
>
> James Rudd
> JLC-Biomedical/Biotechnology Research Institute
> North Carolina Central University
> 700 George Street
> Durham, NC 27707
> Phone: (919) 530-7015
> Email: jrudd_at_[hidden]
> http://ariel.acc.nccu.edu/Academics/BBRI/personnel/rudd.htm
>
> <error.txt><infiniband_info.txt><ompi_info.txt>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems