Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
From: Joshua Baker-LePain (jlb17_at_[hidden])
Date: 2012-03-14 18:48:51


On Wed, 14 Mar 2012 at 6:31pm, Reuti wrote

> I just tested with two different queues on two machines and a small
> mpihello and it is working as expected.

At this point the narrative is getting very confused, even for me. So I
tried to find a clear cut case where I can change one thing to flip
between "it works" and "it doesn't":

Case "it works":
  o Setup 2 queues -- lab.q and test.q. Both run at priority 0. lab.q has
    slots=cores on each host, test.q has 1 slot per host.

  o Submit job via:
    qsub -q "lab.q|test.q" -l mem_free=150M -pe ompi 64 jobscript.sh

  o Job runs just fine. Running 'ps aufx' on one of the nodes shows 2 orted
    processes, one with 4 children (the proceses running in the lab.q
    slots) and one with 1 child (the process running in the test.q slot),
    all happily running (caution: very long lines ahead):

sge 9673 0.0 0.0 14224 1204 ? S 14:31 0:00 \_ sge_shepherd-6997934 -bg
root 9674 0.0 0.0 11272 892 ? Ss 14:31 0:00 | \_ /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb 9677 0.0 0.0 8988 700 ? S 14:31 0:00 | \_ /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter /var/spool/sge/opt95/active_jobs/6997934.1/1.opt95
jlb 9679 0.1 0.0 47932 2008 ? S 14:31 0:00 | \_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 5 -mca orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527
jlb 9690 53.6 0.0 157376 3832 ? R 14:31 0:02 | \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb 9691 50.8 0.0 157376 3832 ? R 14:31 0:02 | \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb 9692 37.0 0.0 157376 3828 ? R 14:31 0:01 | \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb 9693 49.2 0.0 157376 3824 ? R 14:31 0:02 | \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
sge 9675 0.0 0.0 14228 1208 ? S 14:31 0:00 \_ sge_shepherd-6997934 -bg
root 9676 0.0 0.0 11268 888 ? Ss 14:31 0:00 \_ /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb 9678 0.0 0.0 8992 708 ? S 14:31 0:00 \_ /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter /var/spool/sge/opt95/active_jobs/6997934.1/2.opt95
jlb 9680 0.0 0.0 47932 2000 ? S 14:31 0:00 \_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 6 -mca orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527
jlb 9689 36.8 0.0 89776 3672 ? R 14:31 0:01 \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug

Case "it doesn't":
  o Take the above queue setup, and simply change test.q to have 2 slots
    per host.

  o Submit job with the same qsub line.

  o Job crashes. I had 'ps aufx' running in a continuous loop on one of the
    nodes. This was the last output which showed the job processes. Note
    that the actually mpihello processes never got into the "R" state:

sge 12423 0.0 0.0 14224 1196 ? S 14:41 0:00 \_ sge_shepherd-6997938 -bg
root 12425 0.0 0.0 11272 896 ? Ss 14:41 0:00 | \_ /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb 12428 0.0 0.0 8988 700 ? S 14:41 0:00 | \_ /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter /var/spool/sge/opt65/active_jobs/6997938.1/1.opt65
jlb 12430 0.0 0.0 47932 2016 ? S 14:41 0:00 | \_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 7 -mca orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940
jlb 12798 1.0 0.0 153244 3752 ? S 14:41 0:00 | \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb 12799 2.0 0.0 153244 3752 ? S 14:41 0:00 | \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb 12800 1.0 0.0 153244 3752 ? S 14:41 0:00 | \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
sge 12436 0.0 0.0 14228 1208 ? S 14:41 0:00 \_ sge_shepherd-6997938 -bg
root 12437 0.0 0.0 11268 884 ? Ss 14:41 0:00 \_ /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb 12439 0.0 0.0 8992 712 ? S 14:41 0:00 \_ /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter /var/spool/sge/opt65/active_jobs/6997938.1/2.opt65
jlb 12441 0.1 0.0 47932 2012 ? S 14:41 0:00 \_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 8 -mca orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940
jlb 12795 1.0 0.0 153100 3128 ? S 14:41 0:00 \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb 12796 2.0 0.0 153232 3752 ? S 14:41 0:00 \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug

> Joshua: the Centos6 is the same on all nodes and the you recompiled the
> application with the actual version of the library? By "threads" you
> refer to "processes"?

All the nodes are installed from the same kickstart file and kept fully
up to date. And, yes, the application is compiled against the exact
library I'm running it with.

Thanks again to all for looking at this.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF