Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2005-08-19 08:56:55


On Aug 19, 2005, at 8:15 AM, Tim S. Woodall wrote:

> Josh,
>
> I believe that although the prior code called ras routines,
> they were simple library routines in ras base, that didn't
> require ras to be initialized (they just accessed the registry).

Yeah. That's right.

>
> So, w/ the new code, both ras/rds components must be
> initialized/selected.

We probably should be calling those _base_ routines directly instead of
through the interface, since we really just want the GPR calls
contained in those functions. That way we don't have to worry about the
ras/rds components being initialized/selected.

>
> My opinion would be to add the appropriate interface to the rmgr,
> move the code to rmgr/urm, and have rmgr/proxy simply forward the
> request to the seed.

What would be an appropriate interface to the rmgr? Something like the
singleton functionality that I suggested below or are we thinking of
something slightly different?

>
> Note that the intent of the rmgr was to abstract the services provided
> by rds/ras/pls - such that you could potentially drop in a new rmgr
> that didn't use any of these.

Awesome.

I can likely take a look at this later today, and work up a fix.

Cheers,
Josh

>
>
> Thanks,
> Tim
>
>
>
> Josh Hursey wrote:
>> Hey all,
>>
>> Sorry for my lag on this thread, I'm still settling back into
>> Bloomington and catching up on email traffic.
>>
>> This is certainly my fault WRT the addition of the RDS call to
>> orte_init_stage1(). I never tested the case where a process is a
>> singleton and not the seed. :(
>>
>> Since the RAS (or functionality represented by this subsystem) was
>> exposed at this level, it was assumed that the RDS is also active at
>> this time. The addition in orte_init_stage1 was to add host entries to
>> both the RAS and RDS (instead of just the RAS) when we start a
>> singleton process.
>>
>> A quick repair would be to protect the RDS section from all non-seed
>> processes. E.g.
>> if(orte_process_info.seed) {
>> ret = orte_rds.store_resource(&rds_single_host);
>> if (ORTE_SUCCESS != ret ) {
>> ORTE_ERROR_LOG(ret);
>> return ret;
>> }
>> }
>>
>> An additional fix would be to add a call to the rmgr to setup
>> singleton
>> processes, thus pulling out the 'singleton process only' chunk of code
>> from the orte_init_stage1() and into the rmgr. Something like:
>>
>> if (orte_process_info.singleton) {
>> if (ORTE_SUCCESS != (ret =
>> orte_rmgr_base_setup_singleton(my_jobid,...))) {
>> ORTE_ERROR_LOG(ret);
>> return ret;
>> }
>> }
>>
>> Currently this would only contain the addition of the singleton
>> process
>> to the RDS and RAS, but Ralph mentioned last week that he ran across
>> some other 'singleton only' stuff that might be needed.
>>
>> Is there a design issue in adding this functionality to the rmgr, with
>> the proper protection around access to the RDS?
>>
>> I guess my overall argument is that the RDS should be called in the
>> singleton+seed case since we are adding resources to the allocation
>> [RAS], and thus the resources globally available [RDS]. Do we assume
>> that if the process is a singleton and not the seed then it has
>> already
>> been placed in the RDS, and only needs to confirm it allocation in the
>> RAS? Shouldn't that registry handling only happen at the seed level if
>> we assume it has launched the singleton process?
>>
>> It is likely that I could have things confused a bit with how we
>> define
>> a singleton process, and how they are created with relation to the
>> seed.
>>
>> As a general bug notice in ORTE: There is an outstanding bug in the
>> proxy/replica NS components when creating new cellid's that I ran
>> across last Friday, before I had to stop. Something is getting mangled
>> in the packing of the command sent to the seed. I had to wrap up
>> before
>> I could seek a good fix, just enough to characterize the problem.
>>
>> Thoughts?
>>
>> Sorry for causing trouble,
>>
>> Josh
>>
>> On Aug 18, 2005, at 3:33 PM, Tim S. Woodall wrote:
>>
>>
>>> I'm seeing a problem in orte_init_stage1 when running w/ a persistent
>>> daemon.
>>> The problem is that the orte_inti call attempts to call rds subsystem
>>> directly,
>>> which is not supposed to be exposed at that level. rds is used
>>> internally by
>>> the rmgr - and only initialized on the seed. The proxy rmgr is loaded
>>> when
>>> a persistent daemon is available - and therefore the rds is not
>>> loaded.
>>>
>>> So... orte_init_stage1 shouldn't be calling rds directly...
>>>
>>> Tim
>>>
>>>
>>> Brian Barrett wrote:
>>>
>>>
>>>> Yeah, although there really shouldn't be a way for the pointer to be
>>>> NULL. Was this a static build? I was seeing some weird memory
>>>> issues on static builds last night... I'll take a look on odin and
>>>> see what I can find.
>>>>
>>>> Brian
>>>>
>>>> On Aug 18, 2005, at 11:18 AM, Tim S. Woodall wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> Brian,
>>>>>
>>>>> Wasn't the introduction of sds part of your changes for redstorm?
>>>>> Any ideas
>>>>> why it would be NULL here?
>>>>>
>>>>> Thanks,
>>>>> Tim
>>>>>
>>>>> Rainer Keller wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hello,
>>>>>> see the "same" (well probably not exactly same) thing here in
>>>>>> Opteron with
>>>>>> 64bit (-g and so on), I get:
>>>>>>
>>>>>> #0 0x0000000040085160 in orte_sds_base_contact_universe ()
>>>>>> at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
>>>>>> 29 return orte_sds_base_module->contact_universe();
>>>>>> (gdb) where
>>>>>> #0 0x0000000040085160 in orte_sds_base_contact_universe ()
>>>>>> at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
>>>>>> #1 0x0000000040063e95 in orte_init_stage1 ()
>>>>>> at ../../../orte/runtime/orte_init_stage1.c:185
>>>>>> #2 0x0000000040017e7d in orte_system_init ()
>>>>>> at ../../../orte/runtime/orte_system_init.c:38
>>>>>> #3 0x00000000400148f5 in orte_init () at ../../../orte/runtime/
>>>>>> orte_init.c:46
>>>>>> #4 0x000000004000dfc7 in main (argc=4, argv=0x7fbfffe8a8)
>>>>>> at ../../../../orte/tools/orterun/orterun.c:291
>>>>>> #5 0x0000002a95c0c017 in __libc_start_main () from
>>>>>> /lib64/libc.so.6
>>>>>> #6 0x000000004000bf2a in _start ()
>>>>>> (gdb)
>>>>>> within mpirun
>>>>>>
>>>>>> orte_sds_base_module here is Null...
>>>>>> This is without persistent orted; Just mpirun...
>>>>>>
>>>>>> CU,
>>>>>> ray
>>>>>>
>>>>>>
>>>>>> On Thursday 18 August 2005 16:57, Nathan DeBardeleben wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> FYI, this only happens when I let OMPI compile 64bit on Linux.
>>>>>>> When I
>>>>>>> throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of
>>>>>>> test
>>>>>>> codes, mpirun, registry subscription codes, and JNI all work like
>>>>>>> a champ.
>>>>>>> Something's wrong with the 64bit it appears to me.
>>>>>>>
>>>>>>> -- Nathan
>>>>>>> Correspondence
>>>>>>> -----------------------------------------------------------------
>>>>>>> --
>>>>>>> -
>>>>>>> -
>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>> Los Alamos National Laboratory
>>>>>>> Parallel Tools Team
>>>>>>> High Performance Computing Environments
>>>>>>> phone: 505-667-3428
>>>>>>> email: ndebard_at_[hidden]
>>>>>>> -----------------------------------------------------------------
>>>>>>> --
>>>>>>> -
>>>>>>> -
>>>>>>>
>>>>>>> Tim S. Woodall wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Nathan,
>>>>>>>>
>>>>>>>> I'll try to reproduce this sometime this week - but I'm pretty
>>>>>>>> swamped.
>>>>>>>> Is Greg also seeing the same behavior?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> To expand on this further, orte_init() seg faults on both
>>>>>>>>> bluesteel
>>>>>>>>> (32bit linux) and sparkplug (64bit linux) equally. The
>>>>>>>>> required
>>>>>>>>> condition is that orted must be running first (which of course
>>>>>>>>> we
>>>>>>>>> require for our work - a persistent orte daemon and registry).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [bluesteel]~/ptp > ./dump_info
>>>>>>>>>> Segmentation fault
>>>>>>>>>> [bluesteel]~/ptp > gdb dump_info
>>>>>>>>>> GNU gdb 6.1
>>>>>>>>>> Copyright 2004 Free Software Foundation, Inc.
>>>>>>>>>> GDB is free software, covered by the GNU General Public
>>>>>>>>>> License, and
>>>>>>>>>> you are
>>>>>>>>>> welcome to change it and/or distribute copies of it under
>>>>>>>>>> certain
>>>>>>>>>> conditions.
>>>>>>>>>> Type "show copying" to see the conditions.
>>>>>>>>>> There is absolutely no warranty for GDB. Type "show warranty"
>>>>>>>>>> for
>>>>>>>>>> details.
>>>>>>>>>> This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>>>>>> libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>>>>>>
>>>>>>>>>> (gdb) run
>>>>>>>>>> Starting program: /home/ndebard/ptp/dump_info
>>>>>>>>>>
>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>> 0x0000000000000000 in ?? ()
>>>>>>>>>> (gdb) where
>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>> #1 0x000000000045997d in orte_init_stage1 () at
>>>>>>>>>> orte_init_stage1.c:419
>>>>>>>>>> #2 0x00000000004156a7 in orte_system_init () at
>>>>>>>>>> orte_system_init.c:38
>>>>>>>>>> #3 0x00000000004151c7 in orte_init () at orte_init.c:46
>>>>>>>>>> #4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
>>>>>>>>>> dump_info.c:185
>>>>>>>>>> (gdb)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Nathan
>>>>>>>>> Correspondence
>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>> --
>>>>>>>>> -
>>>>>>>>> ---
>>>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>> Parallel Tools Team
>>>>>>>>> High Performance Computing Environments
>>>>>>>>> phone: 505-667-3428
>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>> --
>>>>>>>>> -
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Just to clarify:
>>>>>>>>>> 1: no orted started (meaning the MPIrun or registry programs
>>>>>>>>>> will
>>>>>>>>>> start one by themselves) causes those programs to lock up.
>>>>>>>>>> 2: starting orted by hand (trying to get these programs to
>>>>>>>>>> connect to
>>>>>>>>>> a centralized one) causes the connecting programs to seg
>>>>>>>>>> fault.
>>>>>>>>>>
>>>>>>>>>> -- Nathan
>>>>>>>>>> Correspondence
>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> -
>>>>>>>>>> ----
>>>>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>> Parallel Tools Team
>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> -
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> So I dropped an .ompi_ignore into that directory,
>>>>>>>>>>> reconfigured, and
>>>>>>>>>>> compile worked (yay!).
>>>>>>>>>>> However, not a lot of progress: mpirun locks up, all my
>>>>>>>>>>> registry test
>>>>>>>>>>> programs lock up as well. If I start the orted by hand, then
>>>>>>>>>>> any of my
>>>>>>>>>>>
>>>>>>>>>>> registry calling programs cause segfault:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> [sparkplug]~/ptp > gdb sub_test
>>>>>>>>>>>> GNU gdb 6.1
>>>>>>>>>>>> Copyright 2004 Free Software Foundation, Inc.
>>>>>>>>>>>> GDB is free software, covered by the GNU General Public
>>>>>>>>>>>> License, and
>>>>>>>>>>>> you are
>>>>>>>>>>>> welcome to change it and/or distribute copies of it under
>>>>>>>>>>>> certain
>>>>>>>>>>>> conditions.
>>>>>>>>>>>> Type "show copying" to see the conditions.
>>>>>>>>>>>> There is absolutely no warranty for GDB. Type "show
>>>>>>>>>>>> warranty" for
>>>>>>>>>>>> details.
>>>>>>>>>>>> This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>>>>>>>> libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>>>>>>>>
>>>>>>>>>>>> (gdb) run
>>>>>>>>>>>> Starting program: /home/ndebard/ptp/sub_test
>>>>>>>>>>>>
>>>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>>>> 0x0000000000000000 in ?? ()
>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>> #1 0x00000000004598a5 in orte_init_stage1 () at
>>>>>>>>>>>> orte_init_stage1.c:419 #2 0x00000000004155cf in
>>>>>>>>>>>> orte_system_init ()
>>>>>>>>>>>> at orte_system_init.c:38 #3 0x00000000004150ef in orte_init
>>>>>>>>>>>> () at
>>>>>>>>>>>> orte_init.c:46
>>>>>>>>>>>> #4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178)
>>>>>>>>>>>> at
>>>>>>>>>>>> sub_test.c:60
>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, I recompiled everything.
>>>>>>>>>>>
>>>>>>>>>>> Here's an example of me trying something a little more
>>>>>>>>>>> complicated
>>>>>>>>>>> (which I believe locks up for the same reason - something
>>>>>>>>>>> borked with
>>>>>>>>>>> the registry interaction).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> [sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
>>>>>>>>>>>>> Waiting for interactive job nodes.
>>>>>>>>>>>>> (nodes 18 16 17 18 19 20 21 22 23 24 25)
>>>>>>>>>>>>> Starting interactive job.
>>>>>>>>>>>>> NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>>>>>>> JOBID=18
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> so i got my nodes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> export
>>>>>>>>>>>>> OMPI_MCA_ptl_base_exclude=sm
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> export
>>>>>>>>>>>>> OMPI_MCA_pls_bproc_seed_priority=101
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and set these envvars like we need to use Greg's bproc,
>>>>>>>>>>>> without the
>>>>>>>>>>>> 2nd export the machine's load maxes and locks up.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> bpstat
>>>>>>>>>>>>> Node(s) Status Mode
>>>>>>>>>>>>> User Group 100-128 down
>>>>>>>>>>>>> ---------- root root 0-15
>>>>>>>>>>>>> up ---x------ vchandu vchandu
>>>>>>>>>>>>> 16-25 up
>>>>>>>>>>>>> ---x------
>>>>>>>>>>>>> ndebard ndebard
>>>>>>>>>>>>> 26-27 up
>>>>>>>>>>>>> ---x------
>>>>>>>>>>>>> root root 28-30 up
>>>>>>>>>>>>> ---x--x--x root root ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>>>>>> env | grep
>>>>>>>>>>>>> NODES
>>>>>>>>>>>>> NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> yes, i really have the nodes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> recompile for good measure
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-
>>>>>>>>>>>>> ndebard*
>>>>>>>>>>>>> /bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or
>>>>>>>>>>>>> directory
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> proof that there's no left over old directory
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> it never responds at this point - but I can kill it with ^C.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>> Killed
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> -- Nathan
>>>>>>>>>>> Correspondence
>>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> -
>>>>>>>>>>> -----
>>>>>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>>> Parallel Tools Team
>>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> -
>>>>>>>>>>> -----
>>>>>>>>>>>
>>>>>>>>>>> Jeff Squyres wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Is this what Tim Prins was working on?
>>>>>>>>>>>>
>>>>>>>>>>>> On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure why this is even building... Is someone
>>>>>>>>>>>>> working on this?
>>>>>>>>>>>>> I thought we had .ompi_ignore files in this directory.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tim
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So I'm seeing all these nice emails about people
>>>>>>>>>>>>>> developing on OMPI
>>>>>>>>>>>>>> today yet I can't get it to compile. Am I out here in
>>>>>>>>>>>>>> limbo on this
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> are others in the same boat? The errors I'm seeing are
>>>>>>>>>>>>>> about some
>>>>>>>>>>>>>> bproc
>>>>>>>>>>>>>> code calling undefined functions and they are linked again
>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- Nathan
>>>>>>>>>>>>>> Correspondence
>>>>>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>> - Nathan DeBardeleben, Ph.D.
>>>>>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>>>>>> Parallel Tools Team
>>>>>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Back from training and trying to test this but now OMPI
>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>> compile
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> at all:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
>>>>>>>>>>>>>>>> -I../../../../include -I../../../.. -I../../../..
>>>>>>>>>>>>>>>> -I../../../../include -I../../../../opal
>>>>>>>>>>>>>>>> -I../../../../orte
>>>>>>>>>>>>>>>> -I../../../../ompi -g -Wall -Wundef -Wno-long-long -
>>>>>>>>>>>>>>>> Wsign-compare
>>>>>>>>>>>>>>>> -Wmissing-prototypes -Wstrict-prototypes -Wcomment -
>>>>>>>>>>>>>>>> pedantic
>>>>>>>>>>>>>>>> -Werror-implicit-function-declaration -fno-strict-
>>>>>>>>>>>>>>>> aliasing -MT
>>>>>>>>>>>>>>>> ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
>>>>>>>>>>>>>>>> ras_lsf_bproc.c -o ras_lsf_bproc.o
>>>>>>>>>>>>>>>> ras_lsf_bproc.c: In function
>>>>>>>>>>>>>>>> `orte_ras_lsf_bproc_node_insert':
>>>>>>>>>>>>>>>> ras_lsf_bproc.c:32: error: implicit declaration of
>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>> `orte_ras_base_node_insert'
>>>>>>>>>>>>>>>> ras_lsf_bproc.c: In function
>>>>>>>>>>>>>>>> `orte_ras_lsf_bproc_node_query':
>>>>>>>>>>>>>>>> ras_lsf_bproc.c:37: error: implicit declaration of
>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>> `orte_ras_base_node_query'
>>>>>>>>>>>>>>>> make[4]: *** [ras_lsf_bproc.lo] Error 1
>>>>>>>>>>>>>>>> make[4]: Leaving directory
>>>>>>>>>>>>>>>> `/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
>>>>>>>>>>>>>>>> make[3]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/
>>>>>>>>>>>>>>>> ras'
>>>>>>>>>>>>>>>> make[2]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
>>>>>>>>>>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> make[1]: Leaving directory `/home/ndebard/ompi/orte'
>>>>>>>>>>>>>>>> make: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> [sparkplug]~/ompi >
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Clean SVN checkout this morning with configure:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [sparkplug]~/ompi > ./configure --enable-static --
>>>>>>>>>>>>>>>> disable-shared
>>>>>>>>>>>>>>>> --without-threads --prefix=/home/ndebard/local/ompi
>>>>>>>>>>>>>>>> --with-devel-headers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- Nathan
>>>>>>>>>>>>>>> Correspondence
>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>> -- Nathan DeBardeleben, Ph.D.
>>>>>>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>>>>>>> Parallel Tools Team
>>>>>>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brian Barrett wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is now fixed in SVN. You should no longer need the
>>>>>>>>>>>>>>>> --build=i586... hack to compile 32 bit code on
>>>>>>>>>>>>>>>> Opterons.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We've got a 64bit Linux (SUSE) box here. For a
>>>>>>>>>>>>>>>>>> variety of
>>>>>>>>>>>>>>>>>> reasons (Java, JNI, linking in with OMPI libraries,
>>>>>>>>>>>>>>>>>> etc which I
>>>>>>>>>>>>>>>>>> won't get into)
>>>>>>>>>>>>>>>>>> I need to compile OMPI 32 bit (or get 64bit versions
>>>>>>>>>>>>>>>>>> of a lot of
>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>> libraries).
>>>>>>>>>>>>>>>>>> I get various compile errors when I try different
>>>>>>>>>>>>>>>>>> things, but
>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>> let
>>>>>>>>>>>>>>>>>> me explain the system we have:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This goes on and on and on actually. And the 'is
>>>>>>>>>>>>>>>>>> incompatible
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> i386:x86-64 output' looks to be repeated for every
>>>>>>>>>>>>>>>>>> line before
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> error which actually caused the Make to bomb.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Any suggestions at all? Surely someone must have
>>>>>>>>>>>>>>>>>> tried to force
>>>>>>>>>>>>>>>>>> OMPI to
>>>>>>>>>>>>>>>>>> build in 32bit mode on a 64bit machine.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think anyone has tried to build 32 bit on an
>>>>>>>>>>>>>>>>> Opteron,
>>>>>>>>>>>>>>>>> which is the cause of the problems...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I know how to fix this, but won't happen until
>>>>>>>>>>>>>>>>> later in
>>>>>>>>>>>>>>>>> the weekend. I can't think of a good workaround until
>>>>>>>>>>>>>>>>> then.
>>>>>>>>>>>>>>>>> Well, one possibility is to set the target like you
>>>>>>>>>>>>>>>>> were doing
>>>>>>>>>>>>>>>>> and disable ROMIO. Actually, you'll also need to
>>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>>> Fortran 77. So something like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ./configure [usual options] --build=i586-suse-linux --
>>>>>>>>>>>>>>>>> disable-io-
>>>>>>>>>>>>>>>>> romio --disable-f77
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> might just do the trick.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Brian Barrett
>>>>>>>>>>>>>>>>> Open MPI developer
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/
>>
>> ----
>> Josh Hursey
>> jjhursey_at_[hidden]
>> http://www.open-mpi.org/
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/