Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] torque pbs behaviour...
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-11 05:03:47


If it isn't already there, try putting a print statement tight at
program start, another just prior to MPI_Init, and another just after
MPI_Init. It could be that something is hanging somewhere during
program startup since it sounds like everything is launching just fine.

On Aug 10, 2009, at 9:48 PM, Klymak Jody wrote:

>
> On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:
>
>> Interesting! Well, I always make sure I have my personal OMPI build
>> before any system stuff, and I work exclusively on Mac OS-X:
>
>> Note that I always configure with --prefix=somewhere-in-my-own-dir,
>> never to a system directory. Avoids this kind of confusion.
>
> Yeah, I did configure --prefix=/usr/local/openmpi
>
>> What the errors are saying is that we are picking up components
>> from a very old version of OMPI that is distributed by Apple. It
>> may or may not be causing confusion for the system - hard to tell.
>> However, the fact that it is the IO forwarding subsystem that is
>> picking them up, and the fact that you aren't seeing any output
>> from your job, makes me a tad suspicious.
>
> Me too!
>
>> Can you run other jobs? In other words, do you get stdout/stderr
>> from other programs you run, or does every MPI program hang (even
>> simple ones)? If it is just your program, then it could just be
>> that your application is hanging before any output is generated.
>> Can you have it print something to stderr right when it starts?
>
> No simple ones, like the examples I gave before, run fine, just with
> the suspicious warnings.
>
> I'm running a big general circulation model (MITgcm). Under normal
> conditions it spits something out almost right away, and that is not
> being done here. STDOUT.0001 etc are all opened, but nothing is put
> into them.
>
> I'm pretty sure I'm compliling the gcm properly:
>
> otool -L mitgcmuv
> mitgcmuv:
> /usr/local/openmpi/lib/libmpi_f77.0.dylib (compatibility version
> 1.0.0, current version 1.0.0)
> /usr/local/openmpi/lib/libmpi.0.dylib (compatibility version 1.0.0,
> current version 1.0.0)
> /usr/local/openmpi/lib/libopen-rte.0.dylib (compatibility version
> 1.0.0, current version 1.0.0)
> /usr/local/openmpi/lib/libopen-pal.0.dylib (compatibility version
> 1.0.0, current version 1.0.0)
> /usr/lib/libutil.dylib (compatibility version 1.0.0, current
> version 1.0.0)
> /usr/local/lib/libgfortran.3.dylib (compatibility version 4.0.0,
> current version 4.0.0)
> /usr/local/lib/libgcc_s.1.dylib (compatibility version 1.0.0,
> current version 1.0.0)
> /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
> version 111.1.3)
>
> Thanks, Jody
>
>
>>
>> On Aug 10, 2009, at 8:53 PM, Klymak Jody wrote:
>>
>>>
>>> On 10-Aug-09, at 6:44 PM, Ralph Castain wrote:
>>>
>>>> Check your LD_LIBRARY_PATH - there is an earlier version of OMPI
>>>> in your path that is interfering with operation (i.e., it comes
>>>> before your 1.3.3 installation).
>>>
>>> Hmmmm, The OS X faq says not to do this:
>>>
>>> "Note that there is no need to add Open MPI's libdir to
>>> LD_LIBRARY_PATH; Open MPI's shared library build process
>>> automatically uses the "rpath" mechanism to automatically find the
>>> correct shared libraries (i.e., the ones associated with this
>>> build, vs., for example, the OS X-shipped OMPI shared libraries).
>>> Also note that we specifically do not recommend adding Open MPI's
>>> libdir to DYLD_LIBRARY_PATH."
>>>
>>> http://www.open-mpi.org/faq/?category=osx
>>>
>>> Regardless, if I set either, and run ompi_info I still get:
>>>
>>> [saturna.cluster:94981] mca: base: component_find: iof
>>> "mca_iof_proxy" uses an MCA interface that is not recognized
>>> (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>> [saturna.cluster:94981] mca: base: component_find: iof
>>> "mca_iof_svc" uses an MCA interface that is not recognized
>>> (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>
>>> echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH
>>> /usr/local/openmpi/lib: /usr/local/openmpi/lib:
>>>
>>> So I'm afraid I'm stumped again. I suppose I could go clean out
>>> all the libraries in /usr/lib/...
>>>
>>> Thanks again, sorry to be a pain...
>>>
>>> Cheers, Jody
>>>
>>>
>>>
>>>
>>>>
>>>> On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote:
>>>>
>>>>> So,
>>>>>
>>>>> mpirun --display-allocation -pernode --display-map hostname
>>>>>
>>>>> gives me the output below. Simple jobs seem to run, but the
>>>>> MITgcm does not, either under ssh or torque. It hangs at some
>>>>> early point in execution before anything is written, so its hard
>>>>> for me to tell what the error is. Could these MCA warnings have
>>>>> anything to do with it?
>>>>>
>>>>> I've recompiled the gcm with -L /usr/local/openmpi/lib, so
>>>>> hopefully that catches the right library.
>>>>>
>>>>> Thanks, Jody
>>>>>
>>>>>
>>>>> [xserve02.local:38126] mca: base: component_find: ras
>>>>> "mca_ras_dash_host" uses an MCA interface that is not recogniz
>>>>> ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>> [xserve02.local:38126] mca: base: component_find: ras
>>>>> "mca_ras_hostfile" uses an MCA interface that is not recognize
>>>>> d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>> [xserve02.local:38126] mca: base: component_find: ras
>>>>> "mca_ras_localhost" uses an MCA interface that is not recogniz
>>>>> ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>> [xserve02.local:38126] mca: base: component_find: ras
>>>>> "mca_ras_xgrid" uses an MCA interface that is not recognized (
>>>>> component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>> [xserve02.local:38126] mca: base: component_find: iof
>>>>> "mca_iof_proxy" uses an MCA interface that is not recognized (
>>>>> component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>> [xserve02.local:38126] mca: base: component_find: iof
>>>>> "mca_iof_svc" uses an MCA interface that is not recognized (co
>>>>> mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>>
>>>>> ====================== ALLOCATED NODES ======================
>>>>>
>>>>> Data for node: Name: xserve02.local Num slots: 8 Max
>>>>> slots: 0
>>>>> Data for node: Name: xserve01.local Num slots: 8 Max
>>>>> slots: 0
>>>>>
>>>>> =================================================================
>>>>>
>>>>> ======================== JOB MAP ========================
>>>>>
>>>>> Data for node: Name: xserve02.local Num procs: 1
>>>>> Process OMPI jobid: [20967,1] Process rank: 0
>>>>>
>>>>> Data for node: Name: xserve01.local Num procs: 1
>>>>> Process OMPI jobid: [20967,1] Process rank: 1
>>>>>
>>>>> =============================================================
>>>>> [xserve01.cluster:38518] mca: base: component_find: iof
>>>>> "mca_iof_proxy" uses an MCA interface that is not recognized
>>>>> (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>> [xserve01.cluster:38518] mca: base: component_find: iof
>>>>> "mca_iof_svc" uses an MCA interface that is not recognized (
>>>>> component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
>>>>> xserve02.local
>>>>> xserve01.cluster
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users