Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Limit to number of processes on one node?
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-03-03 17:04:18


On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote:

> Eugene Loh wrote:
>> Prentice Bisbal wrote:
>>> Eugene Loh wrote:
>>>
>>>> Prentice Bisbal wrote:
>>>>
>>>>> Is there a limit on how many MPI processes can run on a single host?
>>>>>
>> Depending on which OMPI release you're using, I think you need something
>> like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need
>> 1000+ descriptors. You're quite possibly up against your limit, though
>> I don't know for sure that that's the problem here.
>>
>> You say you're running 1.2.8. That's "a while ago", so would you
>> consider updating as a first step? Among other things, newer OMPIs will
>> generate a much clearer error message if the descriptor limit is the
>> problem.
>
> While 1.2.8 might be "a while ago", upgrading software just because it's
> "old" is not a valid argument.
>
> I can install the lastest version of OpenMPI, but it will take a little
> while.

Maybe not because it is "old", but Eugene is correct. The old versions of OMPI required more file descriptors than the newer versions.

That said, you'll still need a minimum of 4x the number of procs on the node even with the latest release. I suggest talking to your sys admin about getting the limit increased. It sounds like it has been set unrealistically low.

>
>
>>>>> I have a user trying to test his code on the command-line on a single
>>>>> host before running it on our cluster like so:
>>>>>
>>>>> mpirun -np X foo
>>>>>
>>>>> When he tries to run it on large number of process (X = 256, 512), the
>>>>> program fails, and I can reproduce this with a simple "Hello, World"
>>>>> program:
>>>>>
>>>>> $ mpirun -np 256 mpihello
>>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>>> exited on signal 15 (Terminated).
>>>>> 252 additional processes aborted (not shown)
>>>>>
>>>>> I've done some testing and found that X <155 for this program to work.
>>>>> Is this a bug, part of the standard, or design/implementation decision?
>>>>>
>>>>>
>>>>>
>>>> One possible issue is the limit on the number of descriptors. The error
>>>> message should be pretty helpful and descriptive, but perhaps you're
>>>> using an older version of OMPI. If this is your problem, one workaround
>>>> is something like this:
>>>>
>>>> unlimit descriptors
>>>> mpirun -np 256 mpihello
>>>>
>>>
>>> Looks like I'm not allowed to set that as a regular user:
>>>
>>> $ ulimit -n 2048
>>> -bash: ulimit: open files: cannot modify limit: Operation not permitted
>>>
>>> Since I am the admin, I could change that elsewhere, but I'd rather not
>>> do that system-wide unless absolutely necessary.
>>>
>>>> though I guess the syntax depends on what shell you're running. Another
>>>> is to set the MCA parameter opal_set_max_sys_limits to 1.
>>>>
>>> That didn't work either:
>>>
>>> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello
>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>> exited on signal 15 (Terminated).
>>> 252 additional processes aborted (not shown)
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Prentice Bisbal
> Linux Software Support Specialist/System Administrator
> School of Natural Sciences
> Institute for Advanced Study
> Princeton, NJ
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users