Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Mostyn Lewis (Mostyn.Lewis_at_[hidden])
Date: 2006-11-22 16:21:16


I believe this is "too many open files".

ulimit -n some_number

Regards,
Mostyn

On Wed, 22 Nov 2006, Lydia Heck wrote:

>
> I have - again - successfully built and installed
> mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
> version of openmpi is 1.2b1
>
> compiler used: studio11
>
> The code is a benchmark b_eff which runs usually fine - I have used extensively
> it for benchmarking
>
> When I try 192 CPUs I get
> m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> ...........
> ..............
> ..............
>
> The myrinet ports have been opened and the job is running
> as one of the nodes shows ....
>
> ps -eaf | grep dph0elh
> dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted
> --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
> root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh
> dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff
> dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff
> dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff
>
> any idea ?
>
> Lydia
>
>
> ------------------------------------------
> Dr E L Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.heck_at_[hidden]
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>