James --
Sorry for the delay in replying.
Do you have any firewall software running on your nodes (e.g.,
iptables)? OMPI uses random TCP ports to connect between nodes for
control messages. If they can't reach each other because TCP ports
are blocked, Bad Things will happen (potentially even a hang, because
firewalls can cause packets to be silently dropped).
On May 20, 2008, at 12:17 PM, Rudd, James wrote:
> I have been trying to compile a molecular dynamics program with the
> Openmpi 1.2.5 included in OFED 1.3. I am running Fedora Core 6; the
> output of uname r is 2.6.18-1.2798.fc6. Ive traced the problems
> Ive been having back to openmpi because Im unable to run the test
> programs such as glob on more than one node. I currently have 2
> nodes connected to an infiniband switch with opensm running on
> node1. The nodes can ping each other and I am able to ssh between
> them without a password. My openmpi-default-hostfile includes the
> following:
>
> node1 slots=2 max-slots=4
> node2 slots=4 max-slots=4
>
> When I run mpirun -np 4 --debug-daemons ./glob I get:
> Daemon [0,0,1] checking in as pid 21341 on host node1
> And the program appears to hang. Once I CTRL+C it a couple of times
> I get the contents of error.txt
>
> Per the instructions in the FAQ Ive included the output of
> ibv_devinfo, ifconfig, and ulimit l in the
> infiniband_info.txt file. The results of ompi_info all is in the
> ompi_info.txt file.
>
> Ive been tearing my hear out over this, any help would be greatly
> appreciated.
>
> James Rudd
> JLC-Biomedical/Biotechnology Research Institute
> North Carolina Central University
> 700 George Street
> Durham, NC 27707
> Phone: (919) 530-7015
> Email: jrudd_at_[hidden]
> http://ariel.acc.nccu.edu/Academics/BBRI/personnel/rudd.htm
>
> <error.txt><infiniband_info.txt><ompi_info.txt>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
|