Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Limit to number of processes on one node?
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-03-04 09:56:37


On Mar 4, 2010, at 7:51 AM, Prentice Bisbal wrote:

>
>
> Ralph Castain wrote:
>> On Mar 4, 2010, at 7:27 AM, Prentice Bisbal wrote:
>>
>>>
>>> Ralph Castain wrote:
>>>> On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote:
>>>>
>>>>> Eugene Loh wrote:
>>>>>> Prentice Bisbal wrote:
>>>>>>> Eugene Loh wrote:
>>>>>>>
>>>>>>>> Prentice Bisbal wrote:
>>>>>>>>
>>>>>>>>> Is there a limit on how many MPI processes can run on a single host?
>>>>>>>>>
>>>>>> Depending on which OMPI release you're using, I think you need something
>>>>>> like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need
>>>>>> 1000+ descriptors. You're quite possibly up against your limit, though
>>>>>> I don't know for sure that that's the problem here.
>>>>>>
>>>>>> You say you're running 1.2.8. That's "a while ago", so would you
>>>>>> consider updating as a first step? Among other things, newer OMPIs will
>>>>>> generate a much clearer error message if the descriptor limit is the
>>>>>> problem.
>>>>> While 1.2.8 might be "a while ago", upgrading software just because it's
>>>>> "old" is not a valid argument.
>>>>>
>>>>> I can install the lastest version of OpenMPI, but it will take a little
>>>>> while.
>>>> Maybe not because it is "old", but Eugene is correct. The old versions of OMPI required more file descriptors than the newer versions.
>>>>
>>>> That said, you'll still need a minimum of 4x the number of procs on the node even with the latest release. I suggest talking to your sys admin about getting the limit increased. It sounds like it has been set unrealistically low.
>>>>
>>>>
>>> I *am* the system admin! ;)
>>>
>>> The file descriptor limit is the default for RHEL, 1024, so I would not
>>> characterize it as "unrealistically low". I assume someone with much
>>> more knowledge of OS design and administration than me came up with this
>>> default, so I'm hesitant to change it without good reason. If there was
>>> good reason, I'd have no problem changing it. I have read that setting
>>> it to more than 8192 can lead to system instability.
>>
>> Never heard that, and most HPC systems have it set a great deal higher without trouble.
>
> I just read that the other day. Not sure where, though. Probably a forum
> posting somewhere. I'll take your word for it that it's safe to increase
> if necessary.
>>
>> However, the choice is yours. If you have a large SMP system, you'll eventually be forced to change it or severely limit its usefulness for MPI. RHEL sets it that low arbitrarily as a way of saving memory by keeping the fd table small, not because the OS can't handle it.
>>
>> Anyway, that is the problem. Nothing we (or any MPI) can do about it as the fd's are required for socket-based communications and to forward I/O.
>
> Thanks, Ralph, that's exactly the answer I was looking for - where this
> limit was coming from.
>
> I can see how on a large SMP system the fd limit would have to be
> increased. In normal circumstances, my cluster nodes should never have
> more than 8 MPI processes running at once (per node), so I shouldn't be
> hitting that limit on my cluster.

Ah, okay! That helps a great deal in figuring out what to advise you. In your earlier note, it sounded like you were running all 512 procs on one node, so I assumed you had a large single-node SMP.

In this case, though, the problem is solely that you are using the 1.2 series. In that series, mpirun and each process opened many more sockets to all processes in the job. That's why you are overrunning your limit.

Starting with 1.3, the number of sockets being opened on each is only 3 times the number of procs on the node, plus a couple for the daemon. If you are using TCP for MPI communications, then each MPI connection will open another socket as these messages are direct and not routed.

Upgrading to the 1.4 series should resolve the problem you saw.

HTH
Ralph

>
>>
>>
>>> This is admittedly unusual situation - in normal use, no one would ever
>>> want to run that many processes on a single system - so I don't see any
>>> justification for modifying that setting.
>>>
>>> Yesterday I spoke to the researcher who originally asked me this limit -
>>> he just wanted to know what the limit was, and doesn't actually plan to
>>> do any "real" work with that many processes on a single node, rendering
>>> this whole discussion academic.
>>>
>>> I did install OpenMPI 1.4.1 yesterday, but I haven't had a chance to
>>> test it yet. I'll post the results of testing here.
>>>
>>>>>>>>> I have a user trying to test his code on the command-line on a single
>>>>>>>>> host before running it on our cluster like so:
>>>>>>>>>
>>>>>>>>> mpirun -np X foo
>>>>>>>>>
>>>>>>>>> When he tries to run it on large number of process (X = 256, 512), the
>>>>>>>>> program fails, and I can reproduce this with a simple "Hello, World"
>>>>>>>>> program:
>>>>>>>>>
>>>>>>>>> $ mpirun -np 256 mpihello
>>>>>>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>>>>>>> exited on signal 15 (Terminated).
>>>>>>>>> 252 additional processes aborted (not shown)
>>>>>>>>>
>>>>>>>>> I've done some testing and found that X <155 for this program to work.
>>>>>>>>> Is this a bug, part of the standard, or design/implementation decision?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> One possible issue is the limit on the number of descriptors. The error
>>>>>>>> message should be pretty helpful and descriptive, but perhaps you're
>>>>>>>> using an older version of OMPI. If this is your problem, one workaround
>>>>>>>> is something like this:
>>>>>>>>
>>>>>>>> unlimit descriptors
>>>>>>>> mpirun -np 256 mpihello
>>>>>>>>
>>>>>>> Looks like I'm not allowed to set that as a regular user:
>>>>>>>
>>>>>>> $ ulimit -n 2048
>>>>>>> -bash: ulimit: open files: cannot modify limit: Operation not permitted
>>>>>>>
>>>>>>> Since I am the admin, I could change that elsewhere, but I'd rather not
>>>>>>> do that system-wide unless absolutely necessary.
>>>>>>>
>>>>>>>> though I guess the syntax depends on what shell you're running. Another
>>>>>>>> is to set the MCA parameter opal_set_max_sys_limits to 1.
>>>>>>>>
>>>>>>> That didn't work either:
>>>>>>>
>>>>>>> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello
>>>>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>>>>> exited on signal 15 (Terminated).
>>>>>>> 252 additional processes aborted (not shown)
>>>
>>> --
>>> Prentice Bisbal
>>> Linux Software Support Specialist/System Administrator
>>> School of Natural Sciences
>>> Institute for Advanced Study
>>> Princeton, NJ
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> --
> Prentice Bisbal
> Linux Software Support Specialist/System Administrator
> School of Natural Sciences
> Institute for Advanced Study
> Princeton, NJ
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users