Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] problem running mpirun and orted on same machine
From: Maurice Feskanich (maurice.feskanich_at_[hidden])
Date: 2012-02-03 13:27:48


Hi Folks,

I'm having a problem with running mpirun when one of the tasks winds up
running on the same machine as mpirun.

A little background: our system uses a plugin to send tasks to grid
engine. We are currently using version 1.3.4 (we are not able to move
to a newer version because of the requirements of the tools that use our
system.) Our code runs on Solaris (both Sparc and X86), and Linux.

What we are seeing is that sometimes mpirun gets a segmentaion violation
at line 342 of plm_base_launch_support.c:

     pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;

Investigation has found that mev->sender.vpid is a number that is one
greater than the number of non-nil elements in the pdatorted array.

Here is the dbx stacktrace:

t_at_1 (l_at_1) program terminated by signal SEGV (no mapping at the fault
address)
Current function is process_orted_launch_report (optimized)
   342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(dbx) where
current thread: t_at_1
=>[1] process_orted_launch_report(fd = ???, opal_event = ???, data =
???) (optimized), at 0xffffffff7f491e60 (line ~342) in
"plm_base_launch_support.c"
   [2] event_process_active(base = ???) (optimized), at
0xffffffff7f241d04 (line ~651) in "event.c"
   [3] opal_event_base_loop(base = ???, flags = ???) (optimized), at
0xffffffff7f242178 (line ~823) in "event.c"
   [4] opal_event_loop(flags = ???) (optimized), at 0xffffffff7f241f98
(line ~730) in "event.c"
   [5] opal_progress() (optimized), at 0xffffffff7f21d484 (line ~189) in
"opal_progress.c"
   [6] orte_plm_base_daemon_callback(num_daemons = ???) (optimized), at
0xffffffff7f492388 (line ~459) in "plm_base_launch_support.c" [7]
orte_plm_dream_spawn(0x8f0ac, 0x478560, 0x47868c, 0x12c,
0xffffffff7d305198, 0x8a8c0000), at 0xffffffff7d304a5c
   [8] orterun(argc = 11, argv = 0xffffffff7fffede8), line 748 in
"orterun.c"
   [9] main(argc = 11, argv = 0xffffffff7fffede8), line 13 in "main.c"

The vpids we use when we start the orteds are 1-based, but the pdatorted
array is zero-based.

Any help anyone can provide would be very appreciated. Please don't
hesitate to ask questions.

Thanks,

Maury Feskanich
   Oracle, Inc.