Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Maximum number of MPI processes on a node + discovering faulty nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-11-27 11:58:32


Just glancing at the code, I don't see anything tied to 2**12 that pops out at me. I suspect the issue is that you are hitting a system limit on the number of child processes a process can spawn - this is different from the total number of processes allowed on the node - or the number of file descriptors a process can have open (we need several per process for I/O forwarding).

On Nov 27, 2012, at 8:24 AM, George Markomanolis <george_at_[hidden]> wrote:

> Dear Ralph,
>
> Thanks for the answer, I am using OMPI v1.4.1.
>
> Best regards,
> George Markomanolis
>
> On 11/26/2012 05:07 PM, Ralph Castain wrote:
>> What version of OMPI are you using?
>>
>> On Nov 26, 2012, at 1:02 AM, George Markomanolis <george_at_[hidden]> wrote:
>>
>>> Dear all,
>>>
>>> Initially I would like an advice of how to identify the maximum number of MPI processes that can be executed on a node with oversubscribing. When I try to execute an application with 4096 MPI processes on a 24-cores node with 48GB of memory, I have an error "Unknown error: 1" while the memory is not even at the half. I can execute the same application with 2048 MPI processes in less than one minute. I have checked linux settings about maximum number of processes and it is much bigger than 4096.
>>>
>>> Another more generic question, is about discovering nodes with faulty memory. Is there any way to identify nodes with faulty memory? I found accidentally that a node with exact the same hardware couldn't execute an MPI application when it was using more than 12GB of ram while the second one could use all of the 48GB of memory. If I have 500+ nodes is difficult to check all of them and I am not familiar with any efficient solution. Initially I thought about memtester but it takes a lot of time. I know that this does not apply exactly on this mailing list but I thought that maybe an OpenMPI user knows something about.
>>>
>>>
>>> Best regards,
>>> George Markomanolis
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>