Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-11-22 16:24:35


One of our users/friends has also sent us some example code to do this
internally - I hope to find the time to include that capability in the code
base shortly. I'll advise when we do.

On 11/22/06 2:16 PM, "Rolf Vandevaart" <Rolf.Vandevaart_at_[hidden]> wrote:

>
> Hi Lydia:
>
> errno 24 means "Too many open files". When we have seen this, I believe
> we increased the number of file descriptors available to the mpirun process
> to get past this.
>
> In my case, my shell (tcsh) defaults to 256. I increase it with a call
> to "limit descriptors"
> as shown below. I think other shells may have other commands.
>
> burl-ct-v40z-0 41 =>limit
> cputime unlimited
> filesize unlimited
> datasize unlimited
> stacksize 10240 kbytes
> coredumpsize 0 kbytes
> vmemoryuse unlimited
> descriptors 256
> burl-ct-v40z-0 42 =>limit descriptors 64000
> burl-ct-v40z-0 43 =>limit
> cputime unlimited
> filesize unlimited
> datasize unlimited
> stacksize 10240 kbytes
> coredumpsize 0 kbytes
> vmemoryuse unlimited
> descriptors 64000
> burl-ct-v40z-0 44 =>
>
>
> Lydia Heck wrote On 11/22/06 15:45,:
>
>> I have - again - successfully built and installed
>> mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
>> version of openmpi is 1.2b1
>>
>> compiler used: studio11
>>
>> The code is a benchmark b_eff which runs usually fine - I have used
>> extensively
>> it for benchmarking
>>
>> When I try 192 CPUs I get
>> m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> ...........
>> ..............
>> ..............
>>
>> The myrinet ports have been opened and the job is running
>> as one of the nodes shows ....
>>
>> ps -eaf | grep dph0elh
>> dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted
>> --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
>> root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh
>> dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff
>> dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff
>> dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff
>>
>> any idea ?
>>
>> Lydia
>>
>>
>> ------------------------------------------
>> Dr E L Heck
>>
>> University of Durham
>> Institute for Computational Cosmology
>> Ogden Centre
>> Department of Physics
>> South Road
>>
>> DURHAM, DH1 3LE
>> United Kingdom
>>
>> e-mail: lydia.heck_at_[hidden]
>>
>> Tel.: + 44 191 - 334 3628
>> Fax.: + 44 191 - 334 3645
>> ___________________________________________
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>