Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-06-22 14:45:21


On Jun 22, 2007, at 10:44 AM, sadfub_at_[hidden] wrote:

>> Can you send more information on this? See http://www.open-mpi.org/
>> community/help/
>
> -sh-3.00$ ompi/bin/mpirun -d -np 2 -H node03,node06 hostname
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] [0,0,0] setting up session dir with
> [headnode:23178] universe default-universe-23178
> [headnode:23178] user me
> [headnode:23178] host headnode
> [headnode:23178] jobid 0
> [headnode:23178] procid 0
> [headnode:23178] procdir:
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178/0/0
> [headnode:23178] jobdir:
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178/0
> [headnode:23178] unidir:
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178
> [headnode:23178] top: openmpi-sessions-me_at_headnode_0
> [headnode:23178] tmp: /tmp
> [headnode:23178] [0,0,0] contact_file
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178/universe-
> setup.txt
> [headnode:23178] [0,0,0] wrote setup file
> [headnode:23178] *** Process received signal ***
> [headnode:23178] Signal: Segmentation fault (11)
> [headnode:23178] Signal code: Address not mapped (1)
> [headnode:23178] Failing at address: 0x1
> [headnode:23178] [ 0] /lib64/tls/libpthread.so.0 [0x39ed80c430]
> [headnode:23178] [ 1] /lib64/tls/libc.so.6(strcmp+0) [0x39ecf6ff00]
> [headnode:23178] [ 2]
> /home/me/ompi/lib/openmpi/mca_pls_rsh.so(orte_pls_rsh_launch+0x24f)
> [0x2a9723cc7f]
> [headnode:23178] [ 3] /home/me/ompi/lib/openmpi/mca_rmgr_urm.so
> [0x2a9764fa90]
> [headnode:23178] [ 4] /home/me/ompi/bin/mpirun(orterun+0x35b)
> [0x402ca3]
> [headnode:23178] [ 5] /home/me/ompi/bin/mpirun(main+0x1b) [0x402943]
> [headnode:23178] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x39ecf1c3fb]
> [headnode:23178] [ 7] /home/me/ompi/bin/mpirun [0x40289a]
> [headnode:23178] *** End of error message ***
> Segmentation fault

This should not happen -- this is [obviously] even before any MPI
processing starts. Are you inside an SGE job here?

Pak/Ralph: any ideas?

>> Launch an SGE job that calls the shell command "limit" (if you run C-
>> shell variants) or "ulimit -l" (if you run Bourne shell variants).
>> Ensure that the output is "unlimited".
>
> I've done that allready, but how to distinguish between tight coupled
> job ulimits and loose coupled job ulimits? I tested to pass
> $TMPDIR/machines to a shell script which in turn delivers a "ulimit
> -a",
> *assuming* this is considered as a tight coupled job, but each node
> returned unlimited.. and without this $TMPDIR/machines too. Even the
> headnode is set to unlimited.

I don't really know what this means. People have explained "loose"
vs. "tight" integration to me before, but since I'm not an SGE user,
the definitions always fall away.

Based on your prior e-mail, it looks like you are always invoking
"ulimit" via "pdsh", even under SGE jobs. This is incorrect. Can't
you just submit an SGE job script that runs "ulimit"?

>> What are the limits of the user that launches the SGE daemons? I.e.,
>> did the SGE daemons get started with proper "unlimited" limits? If
>> not, that could hamper SGE's ability to set the limits that you told
>
> The limits in /etc/security/limits.conf apply to all users (using a
> '*'), hence the SGE processes and deamons shouldn't have any limits.

Not really. limits.conf is not universally applied; it's a PAM
entity. So for daemons that start via /etc/init.d scripts (or
whatever the equivalent is on your system), PAM limits are not
necessarily applied. For example, I had to manually insert a "ulimit
-Hl unlimited" in the startup script for my SLURM daemons.

-- 
Jeff Squyres
Cisco Systems