Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Lydia Heck (lydia.heck_at_[hidden])
Date: 2006-11-22 15:45:52


I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1

compiler used: studio11

The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking

When I try 192 CPUs I get
m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
 ...........
..............
..............

The myrinet ports have been opened and the job is running
as one of the nodes shows ....

 ps -eaf | grep dph0elh
 dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted
--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
    root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh
 dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff
 dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff
 dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff

any idea ?

Lydia

------------------------------------------
Dr E L Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.heck_at_[hidden]

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________