Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2006-11-22 16:16:23


Hi Lydia:

errno 24 means "Too many open files". When we have seen this, I believe
we increased the number of file descriptors available to the mpirun process
to get past this.

In my case, my shell (tcsh) defaults to 256. I increase it with a call
to "limit descriptors"
as shown below. I think other shells may have other commands.

 burl-ct-v40z-0 41 =>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize 0 kbytes
vmemoryuse unlimited
descriptors 256
 burl-ct-v40z-0 42 =>limit descriptors 64000
 burl-ct-v40z-0 43 =>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize 0 kbytes
vmemoryuse unlimited
descriptors 64000
 burl-ct-v40z-0 44 =>

Lydia Heck wrote On 11/22/06 15:45,:

>I have - again - successfully built and installed
>mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
>version of openmpi is 1.2b1
>
>compiler used: studio11
>
>The code is a benchmark b_eff which runs usually fine - I have used extensively
>it for benchmarking
>
>When I try 192 CPUs I get
>m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
> ...........
>..............
>..............
>
>The myrinet ports have been opened and the job is running
>as one of the nodes shows ....
>
> ps -eaf | grep dph0elh
> dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted
>--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
> root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh
> dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff
> dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff
> dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff
>
>any idea ?
>
>Lydia
>
>
>------------------------------------------
>Dr E L Heck
>
>University of Durham
>Institute for Computational Cosmology
>Ogden Centre
>Department of Physics
>South Road
>
>DURHAM, DH1 3LE
>United Kingdom
>
>e-mail: lydia.heck_at_[hidden]
>
>Tel.: + 44 191 - 334 3628
>Fax.: + 44 191 - 334 3645
>___________________________________________
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
=========================
rolf.vandevaart_at_[hidden]
781-442-3043
=========================