Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi problems/many cpus per node
From: Daniel Davidson (danield_at_[hidden])
Date: 2012-12-14 16:52:56


Oddly enough, adding this debugging info, lowered the number of
processes that can be used down to 42 from 46. When I run the MPI, it
fails giving only the information that follows:

[root_at_compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 44 --leave-session-attached -mca
odls_base_verbose 5 hostname
[compute-2-1.local:44374] mca:base:select:( odls) Querying component
[default]
[compute-2-1.local:44374] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-1.local:44374] mca:base:select:( odls) Selected component
[default]
[compute-2-0.local:28950] mca:base:select:( odls) Querying component
[default]
[compute-2-0.local:28950] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-0.local:28950] mca:base:select:( odls) Selected component
[default]
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local

On 12/14/2012 03:18 PM, Ralph Castain wrote:
> It wouldn't be ssh - in both cases, only one ssh is being done to each node (to start the local daemon). The only difference is the number of fork/exec's being done on each node, and the number of file descriptors being opened to support those fork/exec's.
>
> It certainly looks like your limits are high enough. When you say it "fails", what do you mean - what error does it report? Try adding:
>
> --leave-session-attached -mca odls_base_verbose 5
>
> to your cmd line - this will report all the local proc launch debug and hopefully show you a more detailed error report.
>
>
> On Dec 14, 2012, at 12:29 PM, Daniel Davidson <danield_at_[hidden]> wrote:
>
>> I have had to cobble together two machines in our rocks cluster without using the standard installation, they have efi only bios on them and rocks doesnt like that, so it is the only workaround.
>>
>> Everything works great now, except for one thing. MPI jobs (openmpi or mpich) fail when started from one of these nodes (via qsub or by logging in and running the command) if 24 or more processors are needed on another system. However if the originator of the MPI job is the headnode or any of the preexisting compute nodes, it works fine. Right now I am guessing ssh client or ulimit problems, but I cannot find any difference. Any help would be greatly appreciated.
>>
>> compute-2-1 and compute-2-0 are the new nodes
>>
>> Examples:
>>
>> This works, prints 23 hostnames from each machine:
>> [root_at_compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 46 hostname
>>
>> This does not work, prints 24 hostnames for compute-2-1
>> [root_at_compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 48 hostname
>>
>> These both work, print 64 hostnames from each node
>> [root_at_biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 128 hostname
>> [root_at_compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -np 128 hostname
>>
>> [root_at_compute-2-1 ~]# ulimit -a
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 16410016
>> max locked memory (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 4096
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority (-r) 0
>> stack size (kbytes, -s) unlimited
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) 1024
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
>>
>> [root_at_compute-2-1 ~]# more /etc/ssh/ssh_config
>> Host *
>> CheckHostIP no
>> ForwardX11 yes
>> ForwardAgent yes
>> StrictHostKeyChecking no
>> UsePrivilegedPort no
>> Protocol 2,1
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>