Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2007-06-25 11:10:56


sadfub_at_[hidden] wrote:
> Pak Lui schrieb:
>> sadfub_at_[hidden] wrote:
>>> Sorry for late reply, but I havent had access to the machine at the weekend.
>>>
>>>> I don't really know what this means. People have explained "loose"
>>>> vs. "tight" integration to me before, but since I'm not an SGE user,
>>>> the definitions always fall away.
>>> I *assume* loose coupled jobs, are just jobs, where the SGE find some
>>> nodes to process them and from then on, it doesn't care about anything
>>> in conjunction to the jobs. In contrast to tight coupled jobs, where the
>>> SGE take care for sub process which could spwan from the job and
>>> terminate them too in case of a failure, or take care of specified
>>> resources.
>>>
>>>> Based on your prior e-mail, it looks like you are always invoking
>>>> "ulimit" via "pdsh", even under SGE jobs. This is incorrect.
>>> why?
>>>
>>>> Can't you just submit an SGE job script that runs "ulimit"?
>>> #!/bin/csh -f
>>> #$ -N MPI_Job
>>> #$ -pe mpi 4
>>> hostname && ulimit -a
>>>
>>> ATM I'm quite confused: cause I want to use the c-shell, but ulimit is
>>> just for bash. The c-shell uses limit... hmm.. and SGE uses obviously
>>> bash, instead of my request for csh in the first line. But if I just use
>>> #!/bin/bash I get the same limits:
>>>
>>> -sh-3.00$ cat MPI_Job.o112116
>>> node02
>>> core file size (blocks, -c) unlimited
>>> data seg size (kbytes, -d) unlimited
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 1024
>>> max locked memory (kbytes, -l) 32
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 1024
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> stack size (kbytes, -s) unlimited
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) 139264
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>>
>>> oops => 32 kbytes... So this isn't OMPI's fault.
>> this looks like sge_execd isn't able to source the correct system
>> defaults from the limit.conf file after you applied the change. Maybe
>> you will need to restart the daemon?
>
> Yes I posted the same question to the sun grid engine mailing list, and
> as Jeff initially supposed it was the inproper limits for the daemons
> (sgeexec). So I've to edit each node's init script
> (/etc/init.d/sgeexecd), and put "ulimit -l unlimited" before starting
> sge_execd. Then kill all sgeexecd's (running jobs won't be affected if
> you use "qconf -ke all") then restart every node's sgeexecd. After that
> every thing with SGE and OMPI 1.1.1 was fine.
>
> But for the whole question just read the small thread at:
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=20390
>
> At this point big thanks to Jeff, and all other which helped me!
>
> Are there any suggestions to the compilation error?

Are you referring to this SEGV error here? I am assuming this is OMPI
1.1.1 so you are using rsh PLS to launch your executables (using loose
integration).

>-sh-3.00$ ompi/bin/mpirun -d -np 2 -H node03,node06 hostname
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] [0,0,0] setting up session dir with
> [headnode:23178] universe default-universe-23178
> [headnode:23178] user me
> [headnode:23178] host headnode
> [headnode:23178] jobid 0
> [headnode:23178] procid 0
> [headnode:23178] procdir:
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178/0/0
> [headnode:23178] jobdir:
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178/0
> [headnode:23178] unidir:
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178
> [headnode:23178] top: openmpi-sessions-me_at_headnode_0
> [headnode:23178] tmp: /tmp
> [headnode:23178] [0,0,0] contact_file
> /tmp/openmpi-sessions-me_at_headnode_0/default-universe-23178/universe-
> setup.txt
> [headnode:23178] [0,0,0] wrote setup file
> [headnode:23178] *** Process received signal ***
> [headnode:23178] Signal: Segmentation fault (11)
> [headnode:23178] Signal code: Address not mapped (1)
> [headnode:23178] Failing at address: 0x1
> [headnode:23178] [ 0] /lib64/tls/libpthread.so.0 [0x39ed80c430]
> [headnode:23178] [ 1] /lib64/tls/libc.so.6(strcmp+0) [0x39ecf6ff00]
> [headnode:23178] [ 2]
> /home/me/ompi/lib/openmpi/mca_pls_rsh.so(orte_pls_rsh_launch+0x24f)
> [0x2a9723cc7f]
> [headnode:23178] [ 3] /home/me/ompi/lib/openmpi/mca_rmgr_urm.so
> [0x2a9764fa90]
> [headnode:23178] [ 4] /home/me/ompi/bin/mpirun(orterun+0x35b)
> [0x402ca3]
> [headnode:23178] [ 5] /home/me/ompi/bin/mpirun(main+0x1b) [0x402943]
> [headnode:23178] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x39ecf1c3fb]
> [headnode:23178] [ 7] /home/me/ompi/bin/mpirun [0x40289a]
> [headnode:23178] *** End of error message ***
> Segmentation fault

So is it true that SEGV only occurred under the SGE environment and not
a normal environment? If it is then I am baffled because starting rsh
pls under the SGE environment in 1.1.1 should be no different than
starting rsh pls without SGE.

There seems to be only one strcmp that can fail in the
orte_pls_rsh_launch(). I can only assume there is some memory corruption
when getting either ras_node->node_name or orte_system_info.nodename for
strcmp.

https://svn.open-mpi.org/trac/ompi/browser/tags/v1.1-series/v1.1.1/orte/mca/pls/rsh/pls_rsh_module.c

Maybe a way to workaround it is by using a more recent OMPI version. A
lot of things in ORTE has been revamped since 1.1 so I would encourage
you to try a more recent OMPI since there maybe some fixes that probably
didn't get brought over to 1.1. Plus with 1.2 you should be able to use
the tight integration with the gridengine module there.

>
> many many thousand thanks for the great help here in the forum!
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
- Pak Lui
pak.lui_at_[hidden]