I have recently installed openmpi 1.3r1212a over tcp and gigabit
on a Solaris 10 x86/64 system.
The compilation of some test codes
monte (a monte carlo estimate of pi),
connectivity which test connectivity between processes and nodes
prime, which calculates prime numbers (these testcode are examples
which are bundled with Sun HPC).
compile fine using the openmpi version of mpicc, mpif95 and mpic++
And sometimes the jobs work fine, but most of the time the jobs freeze
leaving zombies behind.
my run time command is
mpirun --hostfile my-hosts -mca pls_rsh_agent rsh --mca btl tcp,self -np 14 \
monte
and I get as output
oberon(209) > mpirun --hostfile my-hosts -mca pls_rsh_agent rsh --mca btl
tcp,self -np 14 monte
Monte-Carlo estimate of pi by 14 processes is 3.141503.
with the cursor hanging.
The process table shows
oberon# ps -eaf | grep dph0elh
dph0elh 9583 7445 7 17:45:01 pts/26 9:22 mpirun --hostfile my-hosts
-mca pls_rsh_agent rsh --mca btl tcp,self -np 14 mon
dph0elh 9595 9588 0 - ? 0:02 <defunct>
dph0elh 9588 1 7 17:45:01 ?? 9:03 orted --bootproxy 1 --name
0.0.1 --num_procs 5 --vpid_start 0 --nodename oberon
dph0elh 7445 6924 0 17:01:38 pts/26 0:00 -tcsh
root 9656 4151 0 18:01:31 pts/36 0:00 grep dph0elh
dph0elh 9593 9588 0 - ? 0:02 <defunct>
one of the nodes offers 8 cpus the other nodes in the hostfile offer 2.
There are a total of 14 cpus available. and as you can see from the command line
I use --mca btl tcp,self
There are no other interconnects.
I could not find any entry in the FAQs, except for the advice on using
--mca btl tcp,self.
------------------------------------------
Dr E L Heck
University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road
DURHAM, DH1 3LE
United Kingdom
e-mail: lydia.heck_at_[hidden]
Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________
|