Eugene Loh wrote:
> Prentice Bisbal wrote:
>> Eugene Loh wrote:
>>> Prentice Bisbal wrote:
>>>> Is there a limit on how many MPI processes can run on a single host?
> Depending on which OMPI release you're using, I think you need something
> like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need
> 1000+ descriptors. You're quite possibly up against your limit, though
> I don't know for sure that that's the problem here.
> You say you're running 1.2.8. That's "a while ago", so would you
> consider updating as a first step? Among other things, newer OMPIs will
> generate a much clearer error message if the descriptor limit is the
While 1.2.8 might be "a while ago", upgrading software just because it's
"old" is not a valid argument.
I can install the lastest version of OpenMPI, but it will take a little
>>>> I have a user trying to test his code on the command-line on a single
>>>> host before running it on our cluster like so:
>>>> mpirun -np X foo
>>>> When he tries to run it on large number of process (X = 256, 512), the
>>>> program fails, and I can reproduce this with a simple "Hello, World"
>>>> $ mpirun -np 256 mpihello
>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>> exited on signal 15 (Terminated).
>>>> 252 additional processes aborted (not shown)
>>>> I've done some testing and found that X <155 for this program to work.
>>>> Is this a bug, part of the standard, or design/implementation decision?
>>> One possible issue is the limit on the number of descriptors. The error
>>> message should be pretty helpful and descriptive, but perhaps you're
>>> using an older version of OMPI. If this is your problem, one workaround
>>> is something like this:
>>> unlimit descriptors
>>> mpirun -np 256 mpihello
>> Looks like I'm not allowed to set that as a regular user:
>> $ ulimit -n 2048
>> -bash: ulimit: open files: cannot modify limit: Operation not permitted
>> Since I am the admin, I could change that elsewhere, but I'd rather not
>> do that system-wide unless absolutely necessary.
>>> though I guess the syntax depends on what shell you're running. Another
>>> is to set the MCA parameter opal_set_max_sys_limits to 1.
>> That didn't work either:
>> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello
>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>> exited on signal 15 (Terminated).
>> 252 additional processes aborted (not shown)
> users mailing list
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study