Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2005-08-18 13:05:30


Just to double check, can you run ompi_info and send me the results?

Thanks,

Brian

On Aug 18, 2005, at 10:45 AM, Rainer Keller wrote:

> Hello,
> see the "same" (well probably not exactly same) thing here in
> Opteron with
> 64bit (-g and so on), I get:
>
> #0 0x0000000040085160 in orte_sds_base_contact_universe ()
> at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
> 29 return orte_sds_base_module->contact_universe();
> (gdb) where
> #0 0x0000000040085160 in orte_sds_base_contact_universe ()
> at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
> #1 0x0000000040063e95 in orte_init_stage1 ()
> at ../../../orte/runtime/orte_init_stage1.c:185
> #2 0x0000000040017e7d in orte_system_init ()
> at ../../../orte/runtime/orte_system_init.c:38
> #3 0x00000000400148f5 in orte_init () at ../../../orte/runtime/
> orte_init.c:46
> #4 0x000000004000dfc7 in main (argc=4, argv=0x7fbfffe8a8)
> at ../../../../orte/tools/orterun/orterun.c:291
> #5 0x0000002a95c0c017 in __libc_start_main () from /lib64/libc.so.6
> #6 0x000000004000bf2a in _start ()
> (gdb)
> within mpirun
>
> orte_sds_base_module here is Null...
> This is without persistent orted; Just mpirun...
>
> CU,
> ray
>
>
> On Thursday 18 August 2005 16:57, Nathan DeBardeleben wrote:
>
>> FYI, this only happens when I let OMPI compile 64bit on Linux.
>> When I
>> throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test
>> codes, mpirun, registry subscription codes, and JNI all work like
>> a champ.
>> Something's wrong with the 64bit it appears to me.
>>
>> -- Nathan
>> Correspondence
>> ---------------------------------------------------------------------
>> Nathan DeBardeleben, Ph.D.
>> Los Alamos National Laboratory
>> Parallel Tools Team
>> High Performance Computing Environments
>> phone: 505-667-3428
>> email: ndebard_at_[hidden]
>> ---------------------------------------------------------------------
>>
>> Tim S. Woodall wrote:
>>
>>> Nathan,
>>>
>>> I'll try to reproduce this sometime this week - but I'm pretty
>>> swamped.
>>> Is Greg also seeing the same behavior?
>>>
>>> Thanks,
>>> Tim
>>>
>>> Nathan DeBardeleben wrote:
>>>
>>>> To expand on this further, orte_init() seg faults on both bluesteel
>>>> (32bit linux) and sparkplug (64bit linux) equally. The required
>>>> condition is that orted must be running first (which of course we
>>>> require for our work - a persistent orte daemon and registry).
>>>>
>>>>
>>>>> [bluesteel]~/ptp > ./dump_info
>>>>> Segmentation fault
>>>>> [bluesteel]~/ptp > gdb dump_info
>>>>> GNU gdb 6.1
>>>>> Copyright 2004 Free Software Foundation, Inc.
>>>>> GDB is free software, covered by the GNU General Public
>>>>> License, and
>>>>> you are
>>>>> welcome to change it and/or distribute copies of it under certain
>>>>> conditions.
>>>>> Type "show copying" to see the conditions.
>>>>> There is absolutely no warranty for GDB. Type "show warranty" for
>>>>> details.
>>>>> This GDB was configured as "x86_64-suse-linux"...Using host
>>>>> libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>
>>>>> (gdb) run
>>>>> Starting program: /home/ndebard/ptp/dump_info
>>>>>
>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>> 0x0000000000000000 in ?? ()
>>>>> (gdb) where
>>>>> #0 0x0000000000000000 in ?? ()
>>>>> #1 0x000000000045997d in orte_init_stage1 () at
>>>>> orte_init_stage1.c:419
>>>>> #2 0x00000000004156a7 in orte_system_init () at
>>>>> orte_system_init.c:38
>>>>> #3 0x00000000004151c7 in orte_init () at orte_init.c:46
>>>>> #4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
>>>>> dump_info.c:185
>>>>> (gdb)
>>>>>
>>>>
>>>> -- Nathan
>>>> Correspondence
>>>> -------------------------------------------------------------------
>>>> --
>>>> Nathan DeBardeleben, Ph.D.
>>>> Los Alamos National Laboratory
>>>> Parallel Tools Team
>>>> High Performance Computing Environments
>>>> phone: 505-667-3428
>>>> email: ndebard_at_[hidden]
>>>> -------------------------------------------------------------------
>>>> --
>>>>
>>>> Nathan DeBardeleben wrote:
>>>>
>>>>> Just to clarify:
>>>>> 1: no orted started (meaning the MPIrun or registry programs will
>>>>> start one by themselves) causes those programs to lock up.
>>>>> 2: starting orted by hand (trying to get these programs to
>>>>> connect to
>>>>> a centralized one) causes the connecting programs to seg fault.
>>>>>
>>>>> -- Nathan
>>>>> Correspondence
>>>>> ------------------------------------------------------------------
>>>>> ---
>>>>> Nathan DeBardeleben, Ph.D.
>>>>> Los Alamos National Laboratory
>>>>> Parallel Tools Team
>>>>> High Performance Computing Environments
>>>>> phone: 505-667-3428
>>>>> email: ndebard_at_[hidden]
>>>>> ------------------------------------------------------------------
>>>>> ---
>>>>>
>>>>> Nathan DeBardeleben wrote:
>>>>>
>>>>>> So I dropped an .ompi_ignore into that directory,
>>>>>> reconfigured, and
>>>>>> compile worked (yay!).
>>>>>> However, not a lot of progress: mpirun locks up, all my
>>>>>> registry test
>>>>>> programs lock up as well. If I start the orted by hand, then
>>>>>> any of my
>>>>>>
>>>>>> registry calling programs cause segfault:
>>>>>>
>>>>>>> [sparkplug]~/ptp > gdb sub_test
>>>>>>> GNU gdb 6.1
>>>>>>> Copyright 2004 Free Software Foundation, Inc.
>>>>>>> GDB is free software, covered by the GNU General Public
>>>>>>> License, and
>>>>>>> you are
>>>>>>> welcome to change it and/or distribute copies of it under
>>>>>>> certain
>>>>>>> conditions.
>>>>>>> Type "show copying" to see the conditions.
>>>>>>> There is absolutely no warranty for GDB. Type "show
>>>>>>> warranty" for
>>>>>>> details.
>>>>>>> This GDB was configured as "x86_64-suse-linux"...Using host
>>>>>>> libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>>>>
>>>>>>> (gdb) run
>>>>>>> Starting program: /home/ndebard/ptp/sub_test
>>>>>>>
>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>> 0x0000000000000000 in ?? ()
>>>>>>> (gdb) where
>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>> #1 0x00000000004598a5 in orte_init_stage1 () at
>>>>>>> orte_init_stage1.c:419 #2 0x00000000004155cf in
>>>>>>> orte_system_init ()
>>>>>>> at orte_system_init.c:38 #3 0x00000000004150ef in orte_init
>>>>>>> () at
>>>>>>> orte_init.c:46
>>>>>>> #4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at
>>>>>>> sub_test.c:60
>>>>>>> (gdb)
>>>>>>>
>>>>>>
>>>>>> Yes, I recompiled everything.
>>>>>>
>>>>>> Here's an example of me trying something a little more
>>>>>> complicated
>>>>>> (which I believe locks up for the same reason - something
>>>>>> borked with
>>>>>> the registry interaction).
>>>>>>
>>>>>>
>>>>>>>> [sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
>>>>>>>> Waiting for interactive job nodes.
>>>>>>>> (nodes 18 16 17 18 19 20 21 22 23 24 25)
>>>>>>>> Starting interactive job.
>>>>>>>> NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>> JOBID=18
>>>>>>>>
>>>>>>>
>>>>>>> so i got my nodes
>>>>>>>
>>>>>>>
>>>>>>>> ndebard_at_sparkplug:~/ompi-test> export
>>>>>>>> OMPI_MCA_ptl_base_exclude=sm
>>>>>>>> ndebard_at_sparkplug:~/ompi-test> export
>>>>>>>> OMPI_MCA_pls_bproc_seed_priority=101
>>>>>>>>
>>>>>>>
>>>>>>> and set these envvars like we need to use Greg's bproc,
>>>>>>> without the
>>>>>>> 2nd export the machine's load maxes and locks up.
>>>>>>>
>>>>>>>
>>>>>>>> ndebard_at_sparkplug:~/ompi-test> bpstat
>>>>>>>> Node(s) Status Mode
>>>>>>>> User Group 100-128 down
>>>>>>>> ---------- root root 0-15
>>>>>>>> up ---x------ vchandu vchandu
>>>>>>>> 16-25 up ---x------
>>>>>>>> ndebard ndebard
>>>>>>>> 26-27 up ---x------
>>>>>>>> root root 28-30 up
>>>>>>>> ---x--x--x root root ndebard_at_sparkplug:~/ompi-test>
>>>>>>>> env | grep
>>>>>>>> NODES
>>>>>>>> NODES=16,17,18,19,20,21,22,23,24,25
>>>>>>>>
>>>>>>>
>>>>>>> yes, i really have the nodes
>>>>>>>
>>>>>>>
>>>>>>>> ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
>>>>>>>> ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>
>>>>>>>
>>>>>>> recompile for good measure
>>>>>>>
>>>>>>>
>>>>>>>> ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-
>>>>>>>> ndebard*
>>>>>>>> /bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or
>>>>>>>> directory
>>>>>>>>
>>>>>>>
>>>>>>> proof that there's no left over old directory
>>>>>>>
>>>>>>>
>>>>>>>> ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
>>>>>>>>
>>>>>>>
>>>>>>> it never responds at this point - but I can kill it with ^C.
>>>>>>>
>>>>>>>
>>>>>>>> mpirun: killing job...
>>>>>>>> Killed
>>>>>>>> ndebard_at_sparkplug:~/ompi-test>
>>>>>>>>
>>>>>>
>>>>>> -- Nathan
>>>>>> Correspondence
>>>>>> -----------------------------------------------------------------
>>>>>> ----
>>>>>> Nathan DeBardeleben, Ph.D.
>>>>>> Los Alamos National Laboratory
>>>>>> Parallel Tools Team
>>>>>> High Performance Computing Environments
>>>>>> phone: 505-667-3428
>>>>>> email: ndebard_at_[hidden]
>>>>>> -----------------------------------------------------------------
>>>>>> ----
>>>>>>
>>>>>> Jeff Squyres wrote:
>>>>>>
>>>>>>> Is this what Tim Prins was working on?
>>>>>>>
>>>>>>> On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
>>>>>>>
>>>>>>>> I'm not sure why this is even building... Is someone working
>>>>>>>> on this?
>>>>>>>> I thought we had .ompi_ignore files in this directory.
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>
>>>>>>>>> So I'm seeing all these nice emails about people developing
>>>>>>>>> on OMPI
>>>>>>>>> today yet I can't get it to compile. Am I out here in
>>>>>>>>> limbo on this
>>>>>>>>> or
>>>>>>>>> are others in the same boat? The errors I'm seeing are
>>>>>>>>> about some
>>>>>>>>> bproc
>>>>>>>>> code calling undefined functions and they are linked again
>>>>>>>>> below.
>>>>>>>>>
>>>>>>>>> -- Nathan
>>>>>>>>> Correspondence
>>>>>>>>> --------------------------------------------------------------
>>>>>>>>> ------
>>>>>>>>> - Nathan DeBardeleben, Ph.D.
>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>> Parallel Tools Team
>>>>>>>>> High Performance Computing Environments
>>>>>>>>> phone: 505-667-3428
>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>> --------------------------------------------------------------
>>>>>>>>> ------
>>>>>>>>> -
>>>>>>>>>
>>>>>>>>> Nathan DeBardeleben wrote:
>>>>>>>>>
>>>>>>>>>> Back from training and trying to test this but now OMPI
>>>>>>>>>> doesn't
>>>>>>>>>> compile
>>>>>>>>>>
>>>>>>>>>> at all:
>>>>>>>>>>
>>>>>>>>>>> gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
>>>>>>>>>>> -I../../../../include -I../../../.. -I../../../..
>>>>>>>>>>> -I../../../../include -I../../../../opal -I../../../../orte
>>>>>>>>>>> -I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-
>>>>>>>>>>> compare
>>>>>>>>>>> -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
>>>>>>>>>>> -Werror-implicit-function-declaration -fno-strict-
>>>>>>>>>>> aliasing -MT
>>>>>>>>>>> ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
>>>>>>>>>>> ras_lsf_bproc.c -o ras_lsf_bproc.o
>>>>>>>>>>> ras_lsf_bproc.c: In function
>>>>>>>>>>> `orte_ras_lsf_bproc_node_insert':
>>>>>>>>>>> ras_lsf_bproc.c:32: error: implicit declaration of function
>>>>>>>>>>> `orte_ras_base_node_insert'
>>>>>>>>>>> ras_lsf_bproc.c: In function
>>>>>>>>>>> `orte_ras_lsf_bproc_node_query':
>>>>>>>>>>> ras_lsf_bproc.c:37: error: implicit declaration of function
>>>>>>>>>>> `orte_ras_base_node_query'
>>>>>>>>>>> make[4]: *** [ras_lsf_bproc.lo] Error 1
>>>>>>>>>>> make[4]: Leaving directory
>>>>>>>>>>> `/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
>>>>>>>>>>> make[3]: *** [all-recursive] Error 1
>>>>>>>>>>> make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
>>>>>>>>>>> make[2]: *** [all-recursive] Error 1
>>>>>>>>>>> make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
>>>>>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>>>>>> make[1]: Leaving directory `/home/ndebard/ompi/orte'
>>>>>>>>>>> make: *** [all-recursive] Error 1
>>>>>>>>>>> [sparkplug]~/ompi >
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Clean SVN checkout this morning with configure:
>>>>>>>>>>
>>>>>>>>>>> [sparkplug]~/ompi > ./configure --enable-static --disable-
>>>>>>>>>>> shared
>>>>>>>>>>> --without-threads --prefix=/home/ndebard/local/ompi
>>>>>>>>>>> --with-devel-headers
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -- Nathan
>>>>>>>>>> Correspondence
>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>> ------
>>>>>>>>>> -- Nathan DeBardeleben, Ph.D.
>>>>>>>>>> Los Alamos National Laboratory
>>>>>>>>>> Parallel Tools Team
>>>>>>>>>> High Performance Computing Environments
>>>>>>>>>> phone: 505-667-3428
>>>>>>>>>> email: ndebard_at_[hidden]
>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>> ------
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Brian Barrett wrote:
>>>>>>>>>>
>>>>>>>>>>> This is now fixed in SVN. You should no longer need the
>>>>>>>>>>> --build=i586... hack to compile 32 bit code on Opterons.
>>>>>>>>>>>
>>>>>>>>>>> Brian
>>>>>>>>>>>
>>>>>>>>>>> On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We've got a 64bit Linux (SUSE) box here. For a variety of
>>>>>>>>>>>>> reasons (Java, JNI, linking in with OMPI libraries, etc
>>>>>>>>>>>>> which I
>>>>>>>>>>>>> won't get into)
>>>>>>>>>>>>> I need to compile OMPI 32 bit (or get 64bit versions of
>>>>>>>>>>>>> a lot of
>>>>>>>>>>>>> other
>>>>>>>>>>>>> libraries).
>>>>>>>>>>>>> I get various compile errors when I try different
>>>>>>>>>>>>> things, but
>>>>>>>>>>>>> first
>>>>>>>>>>>>> let
>>>>>>>>>>>>> me explain the system we have:
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> This goes on and on and on actually. And the 'is
>>>>>>>>>>>>> incompatible
>>>>>>>>>>>>> with
>>>>>>>>>>>>> i386:x86-64 output' looks to be repeated for every line
>>>>>>>>>>>>> before
>>>>>>>>>>>>> this
>>>>>>>>>>>>> error which actually caused the Make to bomb.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any suggestions at all? Surely someone must have tried
>>>>>>>>>>>>> to force
>>>>>>>>>>>>> OMPI to
>>>>>>>>>>>>> build in 32bit mode on a 64bit machine.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think anyone has tried to build 32 bit on an
>>>>>>>>>>>> Opteron,
>>>>>>>>>>>> which is the cause of the problems...
>>>>>>>>>>>>
>>>>>>>>>>>> I think I know how to fix this, but won't happen until
>>>>>>>>>>>> later in
>>>>>>>>>>>> the weekend. I can't think of a good workaround until
>>>>>>>>>>>> then.
>>>>>>>>>>>> Well, one possibility is to set the target like you were
>>>>>>>>>>>> doing
>>>>>>>>>>>> and disable ROMIO. Actually, you'll also need to disable
>>>>>>>>>>>> Fortran 77. So something like:
>>>>>>>>>>>>
>>>>>>>>>>>> ./configure [usual options] --build=i586-suse-linux --
>>>>>>>>>>>> disable-io-
>>>>>>>>>>>> romio --disable-f77
>>>>>>>>>>>>
>>>>>>>>>>>> might just do the trick.
>>>>>>>>>>>>
>>>>>>>>>>>> Brian
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Brian Barrett
>>>>>>>>>>>> Open MPI developer
>>>>>>>>>>>> http://www.open-mpi.org/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> --
> ---------------------------------------------------------------------
> Dipl.-Inf. Rainer Keller email: keller_at_[hidden]
> High Performance Computing Tel: ++49 (0)711-685 5858
> Center Stuttgart (HLRS) Fax: ++49 (0)711-678 7626
> Nobelstrasse 19, R. O0.030 http://www.hlrs.de/people/keller
> 70550 Stuttgart
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>