Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2005-08-19 08:56:55


On Aug 19, 2005, at 8:15 AM, Tim S. Woodall wrote:

> Josh,
>
> I believe that although the prior code called ras routines,
> they were simple library routines in ras base, that didn't
> require ras to be initialized (they just accessed the registry).

Yeah. That's right.

>
> So, w/ the new code, both ras/rds components must be
> initialized/selected.

We probably should be calling those _base_ routines directly instead of
through the interface, since we really just want the GPR calls
contained in those functions. That way we don't have to worry about the
ras/rds components being initialized/selected.

>
> My opinion would be to add the appropriate interface to the rmgr,
> move the code to rmgr/urm, and have rmgr/proxy simply forward the
> request to the seed.

What would be an appropriate interface to the rmgr? Something like the
singleton functionality that I suggested below or are we thinking of
something slightly different?

>
> Note that the intent of the rmgr was to abstract the services provided
> by rds/ras/pls - such that you could potentially drop in a new rmgr
> that didn't use any of these.

Awesome.

I can likely take a look at this later today, and work up a fix.

Cheers,
Josh

>
>
> Thanks,
> Tim
>
>
>
> Josh Hursey wrote:
>> Hey all,
>>
>> Sorry for my lag on this thread, I'm still settling back into
>> Bloomington and catching up on email traffic.
>>
>> This is certainly my fault WRT the addition of the RDS call to
>> orte_init_stage1(). I never tested the case where a process is a
>> singleton and not the seed. :(
>>
>> Since the RAS (or functionality represented by this subsystem) was
>> exposed at this level, it was assumed that the RDS is also active at
>> this time. The addition in orte_init_stage1 was to add host entries to
>> both the RAS and RDS (instead of just the RAS) when we start a
>> singleton process.
>>
>> A quick repair would be to protect the RDS section from all non-seed
>> processes. E.g.
>> if(orte_process_info.seed) {
>> ret = orte_rds.store_resource(&rds_single_host);
>> if (ORTE_SUCCESS != ret ) {
>> ORTE_ERROR_LOG(ret);
>> return ret;
>> }
>> }
>>
>> An additional fix would be to add a call to the rmgr to setup
>> singleton
>> processes, thus pulling out the 'singleton process only' chunk of code
>> from the orte_init_stage1() and into the rmgr. Something like:
>>
>> if (orte_process_info.singleton) {
>> if (ORTE_SUCCESS != (ret =
>> orte_rmgr_base_setup_singleton(my_jobid,...))) {
>> ORTE_ERROR_LOG(ret);
>> return ret;
>> }
>> }
>>
>> Currently this would only contain the addition of the singleton
>> process
>> to the RDS and RAS, but Ralph mentioned last week that he ran across
>> some other 'singleton only' stuff that might be needed.
>>
>> Is there a design issue in adding this functionality to the rmgr, with
>> the proper protection around access to the RDS?
>>
>> I guess my overall argument is that the RDS should be called in the
>> singleton+seed case since we are adding resources to the allocation
>> [RAS], and thus the resources globally available [RDS]. Do we assume
>> that if the process is a singleton and not the seed then it has
>> already
>> been placed in the RDS, and only needs to confirm it allocation in the
>> RAS? Shouldn't that registry handling only happen at the seed level if
>> we assume it has launched the singleton process?
>>
>> It is likely that I could have things confused a bit with how we
>> define
>> a singleton process, and how they are created with relation to the
>> seed.
>>
>> As a general bug notice in ORTE: There is an outstanding bug in the
>> proxy/replica NS components when creating new cellid's that I ran
>> across last Friday, before I had to stop. Something is getting mangled
>> in the packing of the command sent to the seed. I had to wrap up
>> before
>> I could seek a good fix, just enough to characterize the problem.
>>
>> Thoughts?
>>
>> Sorry for causing trouble,
>>
>> Josh
>>
>> On Aug 18, 2005, at 3:33 PM, Tim S. Woodall wrote:
>>
>>
>>> I'm seeing a problem in orte_init_stage1 when running w/ a persistent
>>> daemon.
>>> The problem is that the orte_inti call attempts to call rds subsystem
>>> directly,
>>> which is not supposed to be exposed at that level. rds is used
>>> internally by
>>> the rmgr - and only initialized on the seed. The proxy rmgr is loaded
>>> when
>>> a persistent daemon is available - and therefore the rds is not
>>> loaded.
>>>
>>> So... orte_init_stage1 shouldn't be calling rds directly...
>>>
>>> Tim
>>>
>>>
>>> Brian Barrett wrote:
>>>
>>>
>>>> Yeah, although there really shouldn't be a way for the pointer to be
>>>> NULL. Was this a static build? I was seeing some weird memory
>>>> issues on static builds last night... I'll take a look on odin and
>>>> see what I can find.
>>>>
>>>> Brian
>>>>
>>>> On Aug 18, 2005, at 11:18 AM, Tim S. Woodall wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> Brian,
>>>>>
>>>>> Wasn't the introduction of sds part of your changes for redstorm?
>>>>> Any ideas
>>>>> why it would be NULL here?
>>>>>
>>>>> Thanks,
>>>>> Tim
>>>>>
>>>>> Rainer Keller wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hello,
>>>>>> see the "same" (well probably not exactly same) thing here in
>>>>>> Opteron with
>>>>>> 64bit (-g and so on), I get:
>>>>>>
>>>>>> #0 0x0000000040085160 in orte_sds_base_contact_universe ()
>>>>>> at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
>>>>>> 29 return orte_sds_base_module->contact_universe();
>>>>>> (gdb) where
>>>>>> #0 0x0000000040085160 in orte_sds_base_contact_universe ()
>>>>>> at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
>>>>>> #1 0x0000000040063e95 in orte_init_stage1 ()
>>>>>> at ../../../orte/runtime/orte_init_stage1.c:185
>>>>>> #2 0x0000000040017e7d in orte_system_init ()
>>>>>> at ../../../orte/runtime/orte_system_init.c:38
>>>>>> #3 0x00000000400148f5 in orte_init () at ../../../orte/runtime/
>>>>>> orte_init.c:46
>>>>>> #4 0x000000004000dfc7 in main (argc=4, argv=0x7fbfffe8a8)
>>>>>> at ../../../../orte/tools/orterun/orterun.c:291
>>>>>> #5 0x0000002a95c0c017 in __libc_start_main () from
>>>>>> /lib64/libc.so.6
>>>>>> #6 0x000000004000bf2a in _start ()
>>>>>> (gdb)
>>>>>> within mpirun
>>>>>>
>>>>>> orte_sds_base_module here is Null...
>>>>>> This is without persistent orted; Just mpirun...
>>>>>>
>>>>>> CU,
>>>>>> ray
>>>>>>
>>>>>>
>>>>>> On Thursday 18 August 2005 16:57, Nathan DeBardeleben wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> FYI, this only happens when I let OMPI compile 64bit on Linux.
>>>>>>> When I
>>>>>>> throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of
>>>>>>> test
>>>>>>> codes, mpirun, registry subscription codes, and JNI all work like
>>>>>>> a champ.
>>>>>>> Something's wrong with the 64bit it appears to me.
>>>>>>>
>>>>>>> -- Nathan
>>>>>>> Correspondence
>>>>>>> -----------------------------------------------------------------
>>>>>>> --
>>>>>>> -
>>>>>>> -
>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>> Los Alamos National Laboratory
>>>>>>> Parallel Tools Team
>>>>>>> High Performance Computing Environments
>>>>>>> phone: 505-667-3428
>>>>>>> email: ndebard_at_[hidden]
>>>>>>> -----------------------------------------------------------------
>>>>>>> --
>>>>>>> -
>>>>>>> -
>>>>>>>
>>>>>>> Tim S. Woodall wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Nathan,
>>>>>>>>
>>>>>>>> I'll try to reproduce this sometime this week - but I'm pretty
>>>>>>>> swamped.
>>>>>>>> Is Greg also seeing the same behavior?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> To expand on this further, orte_init() seg faults on both
>>>>>>>>> bluesteel
>>>>>>>>> (32bit linux) and sparkplug (64bit linux) equally. The
>>>>>>>>> required
>>>>>>>>> condition is that orted must be running first (which of course
>>>>>>>>> we
>>>>>>>>> require for our work - a persistent orte daemon and registry).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [bluesteel]~/ptp > ./dump_info
>>>>>>>>>> Segmentation fault
>>>>>>>>>> [bluesteel]~/ptp > gdb dump_info
>>>>>>>>>> GNU gdb 6.1
>>>>>>>>>> Copyright 2004 Free Software Foundation, Inc.
>>>>>>>>>> GDB is free software, covered by the GNU General Public
>>>>>>>>>> License, and
>>>>>>>>>> you are
>>>>>>>>>> welcome to change it and/or distribute copies of it under
>>>>>>>>>> certain
>>>>>>>>>> conditions.
>>>>>>>>>> Type "show copying" to see the conditions.
>>>>>>>>>> There is absolutely no warranty for GDB. Type "show warranty"
>>>>>>>>>> for
>>>>>>>>>> details.
>>>>>>>>>> This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>>>>>> libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>>>>>>
>>>>>>>>>> (gdb) run
>>>>>>>>>> Starting program: /home/ndebard/ptp/dump_info
>>>>>>>>>>
>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>> 0x0000000000000000 in ?? ()
>>>>>>>>>> (gdb) where
>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>> #1 0x000000000045997d in orte_init_stage1 () at
>>>>>>>>>> orte_init_stage1.c:419
>>>>>>>>>> #2 0x00000000004156a7 in orte_system_init () at
>>>>>>>>>> orte_system_init.c:38
>>>>>>>>>> #3 0x00000000004151c7 in orte_init () at orte_init.c:46
>>>>>>>>>> #4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
>>>>>>>>>> dump_info.c:185
>>>>>>>>>> (gdb)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Nathan
>>>>>>>>> Correspondence
>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>> --
>>>>>>>>> -
>>>>>>>>> ---
>>>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>> Parallel Tools Team
>>>>>>>>> High Performance Computing Environments
>>>>>>>>> phone: 505-667-3428
>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>> --
>>>>>>>>> -
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Just to clarify:
>>>>>>>>>> 1: no orted started (meaning the MPIrun or registry programs
>>>>>>>>>> will
>>>>>>>>>> start one by themselves) causes those programs to lock up.
>>>>>>>>>> 2: starting orted by hand (trying to get these programs to
>>>>>>>>>> connect to
>>>>>>>>>> a centralized one) causes the connecting programs to seg
>>>>>>>>>> fault.
>>>>>>>>>>
>>>>>>>>>> -- Nathan
>>>>>>>>>> Correspondence
>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> -
>>>>>>>>>> ----
>>>>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>> Parallel Tools Team
>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> -
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> So I dropped an .ompi_ignore into that directory,
>>>>>>>>>>> reconfigured, and
>>>>>>>>>>> compile worked (yay!).
>>>>>>>>>>> However, not a lot of progress: mpirun locks up, all my
>>>>>>>>>>> registry test
>>>>>>>>>>> programs lock up as well. If I start the orted by hand, then
>>>>>>>>>>> any of my
>>>>>>>>>>>
>>>>>>>>>>> registry calling programs cause segfault:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> [sparkplug]~/ptp > gdb sub_test
>>>>>>>>>>>> GNU gdb 6.1
>>>>>>>>>>>> Copyright 2004 Free Software Foundation, Inc.
>>>>>>>>>>>> GDB is free software, covered by the GNU General Public
>>>>>>>>>>>> License, and
>>>>>>>>>>>> you are
>>>>>>>>>>>> welcome to change it and/or distribute copies of it under
>>>>>>>>>>>> certain
>>>>>>>>>>>> conditions.
>>>>>>>>>>>> Type "show copying" to see the conditions.
>>>>>>>>>>>> There is absolutely no warranty for GDB. Type "show
>>>>>>>>>>>> warranty" for
>>>>>>>>>>>> details.
>>>>>>>>>>>> This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>>>>>>>> libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>>>>>>>>
>>>>>>>>>>>> (gdb) run
>>>>>>>>>>>> Starting program: /home/ndebard/ptp/sub_test
>>>>>>>>>>>>
>>>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>>>> 0x0000000000000000 in ?? ()
>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>> #1 0x00000000004598a5 in orte_init_stage1 () at
>>>>>>>>>>>> orte_init_stage1.c:419 #2 0x00000000004155cf in
>>>>>>>>>>>> orte_system_init ()
>>>>>>>>>>>> at orte_system_init.c:38 #3 0x00000000004150ef in orte_init
>>>>>>>>>>>> () at
>>>>>>>>>>>> orte_init.c:46
>>>>>>>>>>>> #4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178)
>>>>>>>>>>>> at
>>>>>>>>>>>> sub_test.c:60
>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, I recompiled everything.
>>>>>>>>>>>
>>>>>>>>>>> Here's an example of me trying something a little more
>>>>>>>>>>> complicated
>>>>>>>>>>> (which I believe locks up for the same reason - something
>>>>>>>>>>> borked with
>>>>>>>>>>> the registry interaction).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> [sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
>>>>>>>>>>>>> Waiting for interactive job nodes.
>>>>>>>>>>>>> (nodes 18 16 17 18 19 20 21 22 23 24 25)
>>>>>>>>>>>>> Starting interactive job.
>>>>>>>>>>>>> NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>>>>>>> JOBID=18
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> so i got my nodes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> export
>>>>>>>>>>>>> OMPI_MCA_ptl_base_exclude=sm
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> export
>>>>>>>>>>>>> OMPI_MCA_pls_bproc_seed_priority=101
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and set these envvars like we need to use Greg's bproc,
>>>>>>>>>>>> without the
>>>>>>>>>>>> 2nd export the machine's load maxes and locks up.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> bpstat
>>>>>>>>>>>>> Node(s) Status Mode
>>>>>>>>>>>>> User Group 100-128 down
>>>>>>>>>>>>> ---------- root root 0-15
>>>>>>>>>>>>> up ---x------ vchandu vchandu
>>>>>>>>>>>>> 16-25 up
>>>>>>>>>>>>> ---x------
>>>>>>>>>>>>> ndebard ndebard
>>>>>>>>>>>>> 26-27 up
>>>>>>>>>>>>> ---x------
>>>>>>>>>>>>> root root 28-30 up
>>>>>>>>>>>>> ---x--x--x root root ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>>>>>> env | grep
>>>>>>>>>>>>> NODES
>>>>>>>>>>>>> NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> yes, i really have the nodes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> recompile for good measure
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-
>>>>>>>>>>>>> ndebard*
>>>>>>>>>>>>> /bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or
>>>>>>>>>>>>> directory
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> proof that there's no left over old directory
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> it never responds at this point - but I can kill it with ^C.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>> Killed
>>>>>>>>>>>>> ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> -- Nathan
>>>>>>>>>>> Correspondence
>>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> -
>>>>>>>>>>> -----
>>>>>>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>>> Parallel Tools Team
>>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> -
>>>>>>>>>>> -----
>>>>>>>>>>>
>>>>>>>>>>> Jeff Squyres wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Is this what Tim Prins was working on?
>>>>>>>>>>>>
>>>>>>>>>>>> On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure why this is even building... Is someone
>>>>>>>>>>>>> working on this?
>>>>>>>>>>>>> I thought we had .ompi_ignore files in this directory.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tim
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So I'm seeing all these nice emails about people
>>>>>>>>>>>>>> developing on OMPI
>>>>>>>>>>>>>> today yet I can't get it to compile. Am I out here in
>>>>>>>>>>>>>> limbo on this
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> are others in the same boat? The errors I'm seeing are
>>>>>>>>>>>>>> about some
>>>>>>>>>>>>>> bproc
>>>>>>>>>>>>>> code calling undefined functions and they are linked again
>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- Nathan
>>>>>>>>>>>>>> Correspondence
>>>>>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>> - Nathan DeBardeleben, Ph.D.
>>>>>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>>>>>> Parallel Tools Team
>>>>>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>>>>>> ----------------------------------------------------------
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Back from training and trying to test this but now OMPI
>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>> compile
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> at all:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
>>>>>>>>>>>>>>>> -I../../../../include -I../../../.. -I../../../..
>>>>>>>>>>>>>>>> -I../../../../include -I../../../../opal
>>>>>>>>>>>>>>>> -I../../../../orte
>>>>>>>>>>>>>>>> -I../../../../ompi -g -Wall -Wundef -Wno-long-long -
>>>>>>>>>>>>>>>> Wsign-compare
>>>>>>>>>>>>>>>> -Wmissing-prototypes -Wstrict-prototypes -Wcomment -
>>>>>>>>>>>>>>>> pedantic
>>>>>>>>>>>>>>>> -Werror-implicit-function-declaration -fno-strict-
>>>>>>>>>>>>>>>> aliasing -MT
>>>>>>>>>>>>>>>> ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
>>>>>>>>>>>>>>>> ras_lsf_bproc.c -o ras_lsf_bproc.o
>>>>>>>>>>>>>>>> ras_lsf_bproc.c: In function
>>>>>>>>>>>>>>>> `orte_ras_lsf_bproc_node_insert':
>>>>>>>>>>>>>>>> ras_lsf_bproc.c:32: error: implicit declaration of
>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>> `orte_ras_base_node_insert'
>>>>>>>>>>>>>>>> ras_lsf_bproc.c: In function
>>>>>>>>>>>>>>>> `orte_ras_lsf_bproc_node_query':
>>>>>>>>>>>>>>>> ras_lsf_bproc.c:37: error: implicit declaration of
>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>> `orte_ras_base_node_query'
>>>>>>>>>>>>>>>> make[4]: *** [ras_lsf_bproc.lo] Error 1
>>>>>>>>>>>>>>>> make[4]: Leaving directory
>>>>>>>>>>>>>>>> `/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
>>>>>>>>>>>>>>>> make[3]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/
>>>>>>>>>>>>>>>> ras'
>>>>>>>>>>>>>>>> make[2]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
>>>>>>>>>>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> make[1]: Leaving directory `/home/ndebard/ompi/orte'
>>>>>>>>>>>>>>>> make: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>> [sparkplug]~/ompi >
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Clean SVN checkout this morning with configure:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [sparkplug]~/ompi > ./configure --enable-static --
>>>>>>>>>>>>>>>> disable-shared
>>>>>>>>>>>>>>>> --without-threads --prefix=/home/ndebard/local/ompi
>>>>>>>>>>>>>>>> --with-devel-headers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- Nathan
>>>>>>>>>>>>>>> Correspondence
>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>> -- Nathan DeBardeleben, Ph.D.
>>>>>>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>>>>>>> Parallel Tools Team
>>>>>>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brian Barrett wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is now fixed in SVN. You should no longer need the
>>>>>>>>>>>>>>>> --build=i586... hack to compile 32 bit code on
>>>>>>>>>>>>>>>> Opterons.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We've got a 64bit Linux (SUSE) box here. For a
>>>>>>>>>>>>>>>>>> variety of
>>>>>>>>>>>>>>>>>> reasons (Java, JNI, linking in with OMPI libraries,
>>>>>>>>>>>>>>>>>> etc which I
>>>>>>>>>>>>>>>>>> won't get into)
>>>>>>>>>>>>>>>>>> I need to compile OMPI 32 bit (or get 64bit versions
>>>>>>>>>>>>>>>>>> of a lot of
>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>> libraries).
>>>>>>>>>>>>>>>>>> I get various compile errors when I try different
>>>>>>>>>>>>>>>>>> things, but
>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>> let
>>>>>>>>>>>>>>>>>> me explain the system we have:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This goes on and on and on actually. And the 'is
>>>>>>>>>>>>>>>>>> incompatible
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> i386:x86-64 output' looks to be repeated for every
>>>>>>>>>>>>>>>>>> line before
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> error which actually caused the Make to bomb.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Any suggestions at all? Surely someone must have
>>>>>>>>>>>>>>>>>> tried to force
>>>>>>>>>>>>>>>>>> OMPI to
>>>>>>>>>>>>>>>>>> build in 32bit mode on a 64bit machine.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think anyone has tried to build 32 bit on an
>>>>>>>>>>>>>>>>> Opteron,
>>>>>>>>>>>>>>>>> which is the cause of the problems...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I know how to fix this, but won't happen until
>>>>>>>>>>>>>>>>> later in
>>>>>>>>>>>>>>>>> the weekend. I can't think of a good workaround until
>>>>>>>>>>>>>>>>> then.
>>>>>>>>>>>>>>>>> Well, one possibility is to set the target like you
>>>>>>>>>>>>>>>>> were doing
>>>>>>>>>>>>>>>>> and disable ROMIO. Actually, you'll also need to
>>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>>> Fortran 77. So something like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ./configure [usual options] --build=i586-suse-linux --
>>>>>>>>>>>>>>>>> disable-io-
>>>>>>>>>>>>>>>>> romio --disable-f77
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> might just do the trick.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Brian Barrett
>>>>>>>>>>>>>>>>> Open MPI developer
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/
>>
>> ----
>> Josh Hursey
>> jjhursey_at_[hidden]
>> http://www.open-mpi.org/
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/