Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Tim S. Woodall (twoodall_at_[hidden])
Date: 2005-08-17 12:18:39


Nathan,

I'll try to reproduce this sometime this week - but I'm pretty swamped.
Is Greg also seeing the same behavior?

Thanks,
Tim

Nathan DeBardeleben wrote:
> To expand on this further, orte_init() seg faults on both bluesteel
> (32bit linux) and sparkplug (64bit linux) equally. The required
> condition is that orted must be running first (which of course we
> require for our work - a persistent orte daemon and registry).
>
>
>>[bluesteel]~/ptp > ./dump_info
>>Segmentation fault
>>[bluesteel]~/ptp > gdb dump_info
>>GNU gdb 6.1
>>Copyright 2004 Free Software Foundation, Inc.
>>GDB is free software, covered by the GNU General Public License, and
>>you are
>>welcome to change it and/or distribute copies of it under certain
>>conditions.
>>Type "show copying" to see the conditions.
>>There is absolutely no warranty for GDB. Type "show warranty" for
>>details.
>>This GDB was configured as "x86_64-suse-linux"...Using host
>>libthread_db library "/lib64/tls/libthread_db.so.1".
>>
>>(gdb) run
>>Starting program: /home/ndebard/ptp/dump_info
>>
>>Program received signal SIGSEGV, Segmentation fault.
>>0x0000000000000000 in ?? ()
>>(gdb) where
>>#0 0x0000000000000000 in ?? ()
>>#1 0x000000000045997d in orte_init_stage1 () at orte_init_stage1.c:419
>>#2 0x00000000004156a7 in orte_system_init () at orte_system_init.c:38
>>#3 0x00000000004151c7 in orte_init () at orte_init.c:46
>>#4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
>>dump_info.c:185
>>(gdb)
>
>
> -- Nathan
> Correspondence
> ---------------------------------------------------------------------
> Nathan DeBardeleben, Ph.D.
> Los Alamos National Laboratory
> Parallel Tools Team
> High Performance Computing Environments
> phone: 505-667-3428
> email: ndebard_at_[hidden]
> ---------------------------------------------------------------------
>
>
>
> Nathan DeBardeleben wrote:
>
>
>>Just to clarify:
>> 1: no orted started (meaning the MPIrun or registry programs will
>>start one by themselves) causes those programs to lock up.
>> 2: starting orted by hand (trying to get these programs to connect to
>>a centralized one) causes the connecting programs to seg fault.
>>
>>-- Nathan
>>Correspondence
>>---------------------------------------------------------------------
>>Nathan DeBardeleben, Ph.D.
>>Los Alamos National Laboratory
>>Parallel Tools Team
>>High Performance Computing Environments
>>phone: 505-667-3428
>>email: ndebard_at_[hidden]
>>---------------------------------------------------------------------
>>
>>
>>
>>Nathan DeBardeleben wrote:
>>
>>
>>
>>
>>>So I dropped an .ompi_ignore into that directory, reconfigured, and
>>>compile worked (yay!).
>>>However, not a lot of progress: mpirun locks up, all my registry test
>>>programs lock up as well. If I start the orted by hand, then any of my
>>>registry calling programs cause segfault:
>>>
>>>
>>>
>>>
>>>
>>>
>>>>[sparkplug]~/ptp > gdb sub_test
>>>>GNU gdb 6.1
>>>>Copyright 2004 Free Software Foundation, Inc.
>>>>GDB is free software, covered by the GNU General Public License, and
>>>>you are
>>>>welcome to change it and/or distribute copies of it under certain
>>>>conditions.
>>>>Type "show copying" to see the conditions.
>>>>There is absolutely no warranty for GDB. Type "show warranty" for
>>>>details.
>>>>This GDB was configured as "x86_64-suse-linux"...Using host
>>>>libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>
>>>>(gdb) run
>>>>Starting program: /home/ndebard/ptp/sub_test
>>>>
>>>>Program received signal SIGSEGV, Segmentation fault.
>>>>0x0000000000000000 in ?? ()
>>>>(gdb) where
>>>>#0 0x0000000000000000 in ?? ()
>>>>#1 0x00000000004598a5 in orte_init_stage1 () at orte_init_stage1.c:419
>>>>#2 0x00000000004155cf in orte_system_init () at orte_system_init.c:38
>>>>#3 0x00000000004150ef in orte_init () at orte_init.c:46
>>>>#4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at
>>>>sub_test.c:60
>>>>(gdb)
>>>>
>>>>
>>>>
>>>>
>>>
>>>Yes, I recompiled everything.
>>>
>>>Here's an example of me trying something a little more complicated
>>>(which I believe locks up for the same reason - something borked with
>>>the registry interaction).
>>>
>>>
>>>
>>>
>>>
>>>
>>>>>[sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
>>>>>Waiting for interactive job nodes.
>>>>>(nodes 18 16 17 18 19 20 21 22 23 24 25)
>>>>>Starting interactive job.
>>>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>>>JOBID=18
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>so i got my nodes
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>ndebard_at_sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm
>>>>>ndebard_at_sparkplug:~/ompi-test> export
>>>>>OMPI_MCA_pls_bproc_seed_priority=101
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>and set these envvars like we need to use Greg's bproc, without the
>>>>2nd export the machine's load maxes and locks up.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>ndebard_at_sparkplug:~/ompi-test> bpstat
>>>>>Node(s) Status Mode
>>>>>User Group 100-128 down
>>>>>---------- root root 0-15
>>>>>up ---x------ vchandu vchandu
>>>>>16-25 up ---x------
>>>>>ndebard ndebard
>>>>>26-27 up ---x------
>>>>>root root 28-30 up
>>>>>---x--x--x root root ndebard_at_sparkplug:~/ompi-test> env | grep
>>>>>NODES
>>>>>NODES=16,17,18,19,20,21,22,23,24,25
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>yes, i really have the nodes
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>ndebard_at_sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
>>>>>ndebard_at_sparkplug:~/ompi-test>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>recompile for good measure
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>ndebard_at_sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-ndebard*
>>>>>/bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or directory
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>proof that there's no left over old directory
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>ndebard_at_sparkplug:~/ompi-test> mpirun -np 1 test-mpi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>it never responds at this point - but I can kill it with ^C.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>mpirun: killing job...
>>>>>Killed
>>>>>ndebard_at_sparkplug:~/ompi-test>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>-- Nathan
>>>Correspondence
>>>---------------------------------------------------------------------
>>>Nathan DeBardeleben, Ph.D.
>>>Los Alamos National Laboratory
>>>Parallel Tools Team
>>>High Performance Computing Environments
>>>phone: 505-667-3428
>>>email: ndebard_at_[hidden]
>>>---------------------------------------------------------------------
>>>
>>>
>>>
>>>Jeff Squyres wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>>Is this what Tim Prins was working on?
>>>>
>>>>
>>>>On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>I'm not sure why this is even building... Is someone working on this?
>>>>>I thought we had .ompi_ignore files in this directory.
>>>>>
>>>>>Tim
>>>>>
>>>>>
>>>>>Nathan DeBardeleben wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>So I'm seeing all these nice emails about people developing on OMPI
>>>>>>today yet I can't get it to compile. Am I out here in limbo on this
>>>>>>or
>>>>>>are others in the same boat? The errors I'm seeing are about some
>>>>>>bproc
>>>>>>code calling undefined functions and they are linked again below.
>>>>>>
>>>>>>-- Nathan
>>>>>>Correspondence
>>>>>>---------------------------------------------------------------------
>>>>>>Nathan DeBardeleben, Ph.D.
>>>>>>Los Alamos National Laboratory
>>>>>>Parallel Tools Team
>>>>>>High Performance Computing Environments
>>>>>>phone: 505-667-3428
>>>>>>email: ndebard_at_[hidden]
>>>>>>---------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>Nathan DeBardeleben wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Back from training and trying to test this but now OMPI doesn't
>>>>>>>compile
>>>>>>>at all:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
>>>>>>>>-I../../../../include -I../../../.. -I../../../..
>>>>>>>>-I../../../../include -I../../../../opal -I../../../../orte
>>>>>>>>-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare
>>>>>>>>-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
>>>>>>>>-Werror-implicit-function-declaration -fno-strict-aliasing -MT
>>>>>>>>ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
>>>>>>>>ras_lsf_bproc.c -o ras_lsf_bproc.o
>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert':
>>>>>>>>ras_lsf_bproc.c:32: error: implicit declaration of function
>>>>>>>>`orte_ras_base_node_insert'
>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query':
>>>>>>>>ras_lsf_bproc.c:37: error: implicit declaration of function
>>>>>>>>`orte_ras_base_node_query'
>>>>>>>>make[4]: *** [ras_lsf_bproc.lo] Error 1
>>>>>>>>make[4]: Leaving directory
>>>>>>>>`/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
>>>>>>>>make[3]: *** [all-recursive] Error 1
>>>>>>>>make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
>>>>>>>>make[2]: *** [all-recursive] Error 1
>>>>>>>>make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
>>>>>>>>make[1]: *** [all-recursive] Error 1
>>>>>>>>make[1]: Leaving directory `/home/ndebard/ompi/orte'
>>>>>>>>make: *** [all-recursive] Error 1
>>>>>>>>[sparkplug]~/ompi >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>Clean SVN checkout this morning with configure:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>[sparkplug]~/ompi > ./configure --enable-static --disable-shared
>>>>>>>>--without-threads --prefix=/home/ndebard/local/ompi
>>>>>>>>--with-devel-headers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>-- Nathan
>>>>>>>Correspondence
>>>>>>>---------------------------------------------------------------------
>>>>>>>Nathan DeBardeleben, Ph.D.
>>>>>>>Los Alamos National Laboratory
>>>>>>>Parallel Tools Team
>>>>>>>High Performance Computing Environments
>>>>>>>phone: 505-667-3428
>>>>>>>email: ndebard_at_[hidden]
>>>>>>>---------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Brian Barrett wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>This is now fixed in SVN. You should no longer need the
>>>>>>>>--build=i586... hack to compile 32 bit code on Opterons.
>>>>>>>>
>>>>>>>>Brian
>>>>>>>>
>>>>>>>>On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>We've got a 64bit Linux (SUSE) box here. For a variety of reasons
>>>>>>>>>>(Java, JNI, linking in with OMPI libraries, etc which I won't get
>>>>>>>>>>into)
>>>>>>>>>>I need to compile OMPI 32 bit (or get 64bit versions of a lot of
>>>>>>>>>>other
>>>>>>>>>>libraries).
>>>>>>>>>>I get various compile errors when I try different things, but
>>>>>>>>>>first
>>>>>>>>>>let
>>>>>>>>>>me explain the system we have:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>><snip>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>This goes on and on and on actually. And the 'is incompatible
>>>>>>>>>>with
>>>>>>>>>>i386:x86-64 output' looks to be repeated for every line before
>>>>>>>>>>this
>>>>>>>>>>error which actually caused the Make to bomb.
>>>>>>>>>>
>>>>>>>>>>Any suggestions at all? Surely someone must have tried to force
>>>>>>>>>>OMPI to
>>>>>>>>>>build in 32bit mode on a 64bit machine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>I don't think anyone has tried to build 32 bit on an Opteron, which
>>>>>>>>>is the cause of the problems...
>>>>>>>>>
>>>>>>>>>I think I know how to fix this, but won't happen until later in the
>>>>>>>>>weekend. I can't think of a good workaround until then. Well, one
>>>>>>>>>possibility is to set the target like you were doing and disable
>>>>>>>>>ROMIO. Actually, you'll also need to disable Fortran 77. So
>>>>>>>>>something like:
>>>>>>>>>
>>>>>>>>>./configure [usual options] --build=i586-suse-linux --disable-io-
>>>>>>>>>romio --disable-f77
>>>>>>>>>
>>>>>>>>>might just do the trick.
>>>>>>>>>
>>>>>>>>>Brian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>--
>>>>>>>>>Brian Barrett
>>>>>>>>>Open MPI developer
>>>>>>>>>http://www.open-mpi.org/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>_______________________________________________
>>>>>>>>>devel mailing list
>>>>>>>>>devel_at_[hidden]
>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>devel mailing list
>>>>>>>devel_at_[hidden]
>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>devel mailing list
>>>>>>devel_at_[hidden]
>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>_______________________________________________
>>>>>devel mailing list
>>>>>devel_at_[hidden]
>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>_______________________________________________
>>>devel mailing list
>>>devel_at_[hidden]
>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>>
>>>
>>
>>_______________________________________________
>>devel mailing list
>>devel_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>