Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trouble running OpenMPI compiled for x86_64 (either m32 or m64)
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-07-29 14:30:02


I'm afraid we were unable to support xgrid after the 1.2 series as no
developer had access to an xgrid server. I recently received a complimentary
copy of OSX-server from Apple, and I expect to restore xgrid support at some
point in the 1.5 series.

It looks like you are hitting some issue with 1.2 relating to a change in
xgrid between OSX versions. I personally won't be going back that far to
deal with xgrid issues, so I would suggest sticking with 10.5 if xgrid
support is required.

Alternatively, you can just use OMPI's rsh support to do the launch. Get an
xgrid allocation (I don't know enough about xgrid yet to tell you all the
details), create a hostfile with that info, and then mpirun -hostfile <file>
-mca plm rsh ... (assuming you use OMPI 1.4.x).

On Thu, Jul 29, 2010 at 12:20 PM, Beatty, Daniel D CIV NAVAIR, 474300D <
daniel.beatty_at_[hidden]> wrote:

> Greetings all,
> I am running into some trouble using OpenMPI with OSX 10.6.4 in a
> Kerberized XGrid environment. Note, I did not have this trouble before in
> the OSX 10.5.8 Kerberized XGrid environment.
>
> The pattern of this trouble is as follows:
> 1. User submits a mpi job entering "mpirun -np 4 hello", to use a simple
> hello world MPI example.
> 2. mpirun will submit the job to XGrid.
> 3. A set of orted jobs get distributed to the machines, under the
> kerberized user's name.
> 4. In the case of the OpenMPI 1.2.8, 1.2.3 compiled for gfortran, 1.2.8
> compiled for gfortran, and 1.2.9 that comes with OSX 10.6.4, it will
> actually spawn the processes on the machine.
>
> It comes back with the following exception:
> 2010-07-29 10:25:49.063 mpirun[949:903] *** Terminating app due to uncaught
> exception 'NSInvalidArgumentException', reason: '***
> -[NSKVONotifying_XGConnection<0x100130f30> finalize]: called when collecting
> not enabled'
> *** Call stack at first throw:
> (
> 0 CoreFoundation 0x00007fff811f2cc4
> __exceptionPreprocess + 180
> 1 libobjc.A.dylib 0x00007fff851820f3
> objc_exception_throw + 45
> 2 CoreFoundation 0x00007fff8120d9f1
> -[NSObject(NSObject) finalize] + 129
> 3 mca_pls_xgrid.so 0x0000000100297ce3
> -[PlsXGridClient dealloc] + 419
> 4 mca_pls_xgrid.so 0x0000000100297837
> orte_pls_xgrid_finalize + 40
> 5 libopen-rte.0.dylib 0x000000010002d0f9
> orte_pls_base_close + 249
> 6 libopen-rte.0.dylib 0x0000000100012027
> orte_system_finalize + 119
> 7 libopen-rte.0.dylib 0x000000010000e968
> orte_finalize + 40
> 8 mpirun 0x00000001000011ff orterun +
> 2042
> 9 mpirun 0x0000000100000a03 main + 27
> 10 mpirun 0x00000001000009e0 start +
> 52
> 11 ??? 0x0000000000000004 0x0 + 4
> )
> terminate called after throwing an instance of 'NSException'
> [bigmac:00949] *** Process received signal ***
> [bigmac:00949] Signal: Abort trap (6)
> [bigmac:00949] Signal code: (0)
> [bigmac:00949] [ 0] 2 libSystem.B.dylib
> 0x00007fff833e435a _sigtramp + 26
> [bigmac:00949] [ 1] 3 ???
> 0x00007fff5fbff500 0x0 + 140734799803648
> [bigmac:00949] [ 2] 4 libstdc++.6.dylib
> 0x00007fff80e525d2 __tcf_0 + 0
> [bigmac:00949] [ 3] 5 libobjc.A.dylib
> 0x00007fff85185d29 _objc_terminate + 100
> [bigmac:00949] [ 4] 6 libstdc++.6.dylib
> 0x00007fff80e50ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11
> [bigmac:00949] [ 5] 7 libstdc++.6.dylib
> 0x00007fff80e50b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0
> [bigmac:00949] [ 6] 8 libstdc++.6.dylib
> 0x00007fff80e50bfc
> _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
> [bigmac:00949] [ 7] 9 libobjc.A.dylib
> 0x00007fff85182192 object_getIvar + 0
> [bigmac:00949] [ 8] 10 CoreFoundation
> 0x00007fff8120d9f1 -[NSObject(NSObject) finalize] + 129
> [bigmac:00949] [ 9] 11 mca_pls_xgrid.so
> 0x0000000100297ce3 -[PlsXGridClient dealloc] + 419
> [bigmac:00949] [10] 12 mca_pls_xgrid.so
> 0x0000000100297837 orte_pls_xgrid_finalize + 40
> [bigmac:00949] [11] 13 libopen-rte.0.dylib
> 0x000000010002d0f9 orte_pls_base_close + 249
> [bigmac:00949] [12] 14 libopen-rte.0.dylib
> 0x0000000100012027 orte_system_finalize + 119
> [bigmac:00949] [13] 15 libopen-rte.0.dylib
> 0x000000010000e968 orte_finalize + 40
> [bigmac:00949] [14] 16 mpirun
> 0x00000001000011ff orterun + 2042
> [bigmac:00949] [15] 17 mpirun
> 0x0000000100000a03 main + 27
> [bigmac:00949] [16] 18 mpirun
> 0x00000001000009e0 start + 52
> [bigmac:00949] [17] 19 ???
> 0x0000000000000004 0x0 + 4
> [bigmac:00949] *** End of error message ***
> Abort trap
>
>
> In the case of OpenMPI 1.4.2, I get even worse errors.
>
> I do not know if this is an XGrid problem or a OMPI problem. But, it is
> definitely producing trouble.
>
> Now some have suggested, having XGrid drive OpenMPI, but if
> XGRID_CONTROLLER_HOSTNAME is set, then how will OpenMPI not try to use XGrid
> as the launcher?
>
> Any ideas as to how to fix this?
>
>
>
>
> Daniel Beatty
> Computer Scientist, Detonation Sciences Branch
> Code 4743000
> 2400 E. Pilot Plant Rd.
> China Lake, CA 93555-6107
> daniel.beatty_at_[hidden]
> (760) 939-7097
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>