I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1
compiler used: studio11
The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking
When I try 192 CPUs I get
m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
...........
..............
..............
The myrinet ports have been opened and the job is running
as one of the nodes shows ....
ps -eaf | grep dph0elh
dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted
--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh
dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff
dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff
dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff
any idea ?
Lydia
------------------------------------------
Dr E L Heck
University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road
DURHAM, DH1 3LE
United Kingdom
e-mail: lydia.heck_at_[hidden]
Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________
|