Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] help
From: sri pramoda (sri_pramoda_at_[hidden])
Date: 2013-09-26 00:28:21


  Help : unsubscribe

--------------------------------------------
On Fri, 20/9/13, users-request_at_[hidden] <users-request_at_[hidden]> wrote:

 Subject: users Digest, Vol 2685, Issue 2
 To: users_at_[hidden]
 Date: Friday, 20 September, 2013, 11:00 AM
 
 Send users mailing list submissions
 to
     users_at_[hidden]
 
 To subscribe or unsubscribe via the World Wide Web, visit
     http://www.open-mpi.org/mailman/listinfo.cgi/users
 or, via email, send a message with subject or body 'help'
 to
     users-request_at_[hidden]
 
 You can reach the person managing the list at
     users-owner_at_[hidden]
 
 When replying, please edit your Subject line so it is more
 specific
 than "Re: Contents of users digest..."
 
 
 Today's Topics:
 
    1. error building openmpi-1.7.3a1r29213 on
 Solaris (Siegmar Gross)
    2. intermittent node file error running
 with torque/maui
       integration (Noam Bernstein)
    3. Re: intermittent node file error
 running with    torque/maui
       integration (Noam Bernstein)
    4. Re: intermittent node file error
 running with    torque/maui
       integration (Noam Bernstein)
    5. Re: intermittent node file error
 running with    torque/maui
       integration (Reuti)
    6. Re: intermittent node file error
 running with    torque/maui
       integration (Noam Bernstein)
    7. Re: intermittent node file error
 running with    torque/maui
       integration (Noam Bernstein)
    8. Debugging Runtime/Ethernet Problems
 (Lloyd Brown)
    9. Re: Debugging Runtime/Ethernet Problems
 (Elken, Tom)
   10. Re: Debugging Runtime/Ethernet Problems (Ralph
 Castain)
   11. Re: compilation aborted for Handler.cpp (code 2)
       (Jeff Squyres (jsquyres))
   12. Re: Debugging Runtime/Ethernet Problems (Jeff
 Squyres (jsquyres))
   13. Re: compilation aborted for Handler.cpp (code 2)
       (Jeff Squyres (jsquyres))
   14. Re: intermittent node file error running with
 torque/maui
       integration (Gus Correa)
   15. Re: error building openmpi-1.7.3a1r29213 on
 Solaris
       (Jeff Squyres (jsquyres))
 
 
 ----------------------------------------------------------------------
 
 Message: 1
 Date: Fri, 20 Sep 2013 13:00:41 +0200 (CEST)
 From: Siegmar Gross <Siegmar.Gross_at_[hidden]>
 To: users_at_[hidden]
 Subject: [OMPI users] error building openmpi-1.7.3a1r29213
 on Solaris
 Message-ID: <201309201100.r8KB0fTr022555_at_[hidden]>
 Content-Type: TEXT/plain; charset=us-ascii
 
 Hi,
 
 I tried to install openmpi-1.7.3a1r29213 on "openSuSE Linux
 12.1",
 "Solaris 10 x86_64", and "Solaris 10 sparc" with "Sun C
 5.12" and
 gcc-4.8.0 in 64-bit mode. Unfortunately "make" breaks with
 the same
 error for both compilers on both Solaris platforms.
 
 
 tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_cc 126 tail -10 \
   log.make.SunOS.sparc.64_cc
 Making all in mca/if/posix_ipv4
 make[2]: Entering directory `.../opal/mca/if/posix_ipv4'
   CC       if_posix.lo
 "../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c",
   line 277: undefined struct/union member: ifr_mtu
 cc: acomp failed for
  
 ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c
 make[2]: *** [if_posix.lo] Error 1
 make[2]: Leaving directory `.../opal/mca/if/posix_ipv4'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `.../opal'
 make: *** [all-recursive] Error 1
 
 
 tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_gcc 131 tail -12 \
   log.make.SunOS.sparc.64_gcc
 Making all in mca/if/posix_ipv4
 make[2]: Entering directory `.../opal/mca/if/posix_ipv4'
   CC       if_posix.lo
 ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c:
   In function 'if_posix_open':
 ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c:
   277:31: error: 'struct ifreq' has no member named
 'ifr_mtu'
          
    intf->if_mtu = ifr->ifr_mtu;
                
            
    ^
 make[2]: *** [if_posix.lo] Error 1
 make[2]: Leaving directory `.../opal/mca/if/posix_ipv4'
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory `.../opal'
 make: *** [all-recursive] Error 1
 
 
 
 I have had this problem before and Jeff solved it. Here is
 my
 old e-mail.
 
 Date: Tue, 7 May 2013 19:38:11 +0200 (CEST)
 From: Siegmar Gross <Siegmar.Gross_at_[hidden]>
 Subject: Re: commit/ompi-java: jsquyres: Up to SVN r28392
 To: jsquyres_at_[hidden]
 Cc: Siegmar.Gross_at_[hidden]
 MIME-Version: 1.0
 Content-MD5: O1pjPK/1JiMXXZ/EHyMU0Q==
 X-HRZ-JLUG-MailScanner-Information: Passed JLUG virus check
 X-HRZ-JLUG-MailScanner: No virus found
 X-Envelope-From: fd1026_at_[hidden]
 X-Spam-Status: No
 
 Hello Jeff
 
> Ok, I made a change in the OMPI trunk that should fix
 this:
>
>     https://svn.open-mpi.org/trac/ompi/changeset/28460
>
> And I pulled it into the ompi-java hg repo.  Could
 you give
> it a whirl and let me know if this works for you?
 
 Perfect :-)))).  Now I can build Open MPI on Solaris
 without
 "#if 0" :-). Thank you very much for your help.
 
 
 "make check"  still produces the old bus error on
 Solaris Sparc.
 All checks are fine on Linux and Solaris x86_64.
 
 ...
 PASS: ddt_test
 /bin/bash: line 5: 12453 Bus Error     
          ${dir}$tst
 FAIL: ddt_raw
 ========================================================
 1 of 5 tests failed
 Please report to http://www.open-mpi.org/community/help/
 ========================================================
 make[3]: *** [check-TESTS] Error 1
 ...
 
 
 Kind regards
 
 Siegmar
 
 
> On May 6, 2013, at 7:20 AM, Siegmar Gross
 <Siegmar.Gross_at_[hidden]>
 wrote:
>
> > Hello Jeff
> >
> >>>>
 "../../../../../ompi-java/opal/mca/if/posix_ipv4/if_posix.c",
> >>>> line 279: undefined struct/union
 member: ifr_mtu
> >>>>
> >>>> Sigh.  Solaris kills me. 
 :-\
> >>>>
> >>>> Just so I understand -- Solaris has
 SIOCGIFMTU, but doesn't
> >>>> have struct ifreq.ifr_mtu?
> >>>
> >>> I found SIOCGIFMTU in sys/sockio.h with
 the following comment.
> >>
> >> Is there a Solaris-defined constant we can use
 here to know
> >> that we're on Solaris?  If so, I can
 effectively make that code
> >> only be there if SIOCFIGMTU exists and we're
 not on Solaris.
> >
> > I searched our header files for "sunos" and
 "solaris" with
> > "-ignore-case", but didn't find anything useful.
 You have a very
> > minimal environment, if you use "sh" and you would
 have a useful
> > environment variable, if you use "tcsh".
> >
> > tyr java 321 su -
> > ...
> > # env
> > HOME=/root
> > HZ=
> > LANG=C
> > LC_ALL=C
> > LOGNAME=root
> > MAIL=/var/mail/root
> > PATH=/usr/sbin:/usr/bin
> > SHELL=/sbin/sh
> > TERM=dtterm
> > TZ=Europe/Berlin
> > # tcsh
> > # env | grep TYPE
> > HOSTTYPE=sun4
> > OSTYPE=solaris
> > MACHTYPE=sparc
> > #
> >
> > The best solution would be "uname -s", if that is
 possible.
> >
> > # /usr/bin/uname -s
> > SunOS
 
 
 I would be grateful, if somebody can solve the problem once
 more.
 
 
 Kind regards
 
 Siegmar
 
 
 
 ------------------------------
 
 Message: 2
 Date: Fri, 20 Sep 2013 09:55:44 -0400
 From: Noam Bernstein <noam.bernstein_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: [OMPI users] intermittent node file error running
 with
     torque/maui   
 integration
 Message-ID: <B695C61E-461C-47E1-8634-FB492CA04947_at_[hidden]>
 Content-Type: text/plain; charset=us-ascii
 
 Hi - we've been using openmpi for a while, but only for the
 last few months
 with torque/maui.  Intermittently (maybe 1/10 jobs), we
 get mpi jobs that fail with the error:
 
 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open
 failure in file ras_tm_module.c at line 142
 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open
 failure in file ras_tm_module.c at line 82
 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open
 failure in file base/ras_base_allocate.c at line 149
 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open
 failure in file base/plm_base_launch_support.c at line 99
 [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open
 failure in file plm_tm_module.c at line 194
 
 This is completely unrepeatable - resubmitting the same job
 almost
 always works the second time around.  The line appears
 to be
 associated with looking for the torque/maui generated node
 file,
 and when I do something like
   echo $PBS_NODEFILE
   cat $PBS_NODEFILE
 it appears that the file is present and correct. 
 
 We're running OpenMPI 1.6.4, configured with
 ./configure \
         --prefix=${DEST} \
         --with-tm=/usr/local/torque \
        
 --enable-mpirun-prefix-by-default \
         --with-openib=/usr \
         --with-openib-libdir=/usr/lib64
 
 Has anyone seen anything like this before, or has any ideas
 of what might
 be happening?  It appears to be a line where openmpi
 looks for
 the PBS node file, which is on a local filesystem (e.g.
 PBS_NODEFILE=/var/spool/torque/aux//4600.tin).
 
            
            
            
 thanks,
            
            
            
 Noam
 
 
 
 Noam Bernstein
 Center for Computational Materials Science
 NRL Code 6390
 noam.bernstein_at_[hidden]
 
 
 
 
 
 
 ------------------------------
 
 Message: 3
 Date: Fri, 20 Sep 2013 10:04:43 -0400
 From: Noam Bernstein <noam.bernstein_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] intermittent node file error
 running with
     torque/maui integration
 Message-ID: <75E58DCB-47A5-45AC-A9FB-35C0478C22CC_at_[hidden]>
 Content-Type: text/plain; charset=us-ascii
 
 
 On Sep 20, 2013, at 9:55 AM, Noam Bernstein <noam.bernstein_at_[hidden]>
 wrote:
>
> This is completely unrepeatable - resubmitting the same
 job almost
> always works the second time around.  The line
 appears to be
> associated with looking for the torque/maui generated
 node file,
> and when I do something like
>  echo $PBS_NODEFILE
>  cat $PBS_NODEFILE
> it appears that the file is present and correct. 
 
 
 Never mind - I was sure that my earlier tests showed that
 the $PBS_NODEFILE
 was there, but now it seems like every time the job fails
 it's because this
 file really is missing.  Time to check why torque isn't
 always creating
 the nodefile.
 
            
            
            
            
 Noam
 
 ------------------------------
 
 Message: 4
 Date: Fri, 20 Sep 2013 10:12:39 -0400
 From: Noam Bernstein <noam.bernstein_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] intermittent node file error
 running with
     torque/maui integration
 Message-ID: <A3A2843B-6AF1-4E0D-AAC2-DF0B55A6A005_at_[hidden]>
 Content-Type: text/plain; charset=us-ascii
 
 On Sep 20, 2013, at 10:04 AM, Noam Bernstein <noam.bernstein_at_[hidden]>
 wrote:
 
>
> Never mind - I was sure that my earlier tests showed
 that the $PBS_NODEFILE
> was there, but now it seems like every time the job
 fails it's because this
> file really is missing.  Time to check why torque
 isn't always creating
> the nodefile.
 
 Even weirder now - most of the time jobs fail it's because
 the PBS_NODEFILE
 is really missing.  But a small fraction of the time
 (< 1%) the PBS_NODEFILE
 is there, but mpirun still fails in the way my original
 message specified.
 
 Has anyone ever seen anything like this before?
 
            
            
            
         Noam
 
 ------------------------------
 
 Message: 5
 Date: Fri, 20 Sep 2013 16:22:31 +0200
 From: Reuti <reuti_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] intermittent node file error
 running with
     torque/maui integration
 Message-ID:
     <FE881348-7073-4A81-86AB-DE1968A010D4_at_[hidden]>
 Content-Type: text/plain; charset=us-ascii
 
 Hi,
 
 Am 20.09.2013 um 16:12 schrieb Noam Bernstein:
 
> On Sep 20, 2013, at 10:04 AM, Noam Bernstein <noam.bernstein_at_[hidden]>
 wrote:
>
>> Never mind - I was sure that my earlier tests
 showed that the $PBS_NODEFILE
>> was there, but now it seems like every time the job
 fails it's because this
>> file really is missing.  Time to check why
 torque isn't always creating
>> the nodefile.
>
> Even weirder now - most of the time jobs fail it's
 because the PBS_NODEFILE
> is really missing.  But a small fraction of the
 time (< 1%) the PBS_NODEFILE
> is there, but mpirun still fails in the way my original
 message specified.
>
> Has anyone ever seen anything like this before?
 
 Is the location for the spool directory local or shared by
 NFS? Disk full?
 
 -- Reuti
 
 ------------------------------
 
 Message: 6
 Date: Fri, 20 Sep 2013 10:36:21 -0400
 From: Noam Bernstein <noam.bernstein_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] intermittent node file error
 running with
     torque/maui integration
 Message-ID: <AB5DDE1F-887A-45CC-B0FE-0F8BF4D110EF_at_[hidden]>
 Content-Type: text/plain; charset=us-ascii
 
 
 On Sep 20, 2013, at 10:22 AM, Reuti <reuti_at_[hidden]>
 wrote:
 
>
> Is the location for the spool directory local or shared
 by NFS? Disk full?
 
 No - locally mounted, and far from full on all the nodes.
 
            
            
            
         Noam
 
 
 
 ------------------------------
 
 Message: 7
 Date: Fri, 20 Sep 2013 10:40:58 -0400
 From: Noam Bernstein <noam.bernstein_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] intermittent node file error
 running with
     torque/maui integration
 Message-ID: <CBE710D4-2D84-4392-BD1E-85E8DE5D5398_at_[hidden]>
 Content-Type: text/plain; charset=us-ascii
 
 
 On Sep 20, 2013, at 10:36 AM, Noam Bernstein <noam.bernstein_at_[hidden]>
 wrote:
 
>
> On Sep 20, 2013, at 10:22 AM, Reuti <reuti_at_[hidden]>
 wrote:
>
>>
>> Is the location for the spool directory local or
 shared by NFS? Disk full?
>
> No - locally mounted, and far from full on all the
 nodes.
 
 Another new observation, which may shift the focus to
 torque.  I
 just rebooted some of the nodes that were showing this
 behavior.
 So far, none of them have shown it in a few hundred test
 jobs,
 while before at least 1-5 of each set of 100 had failures.
 
            
            
            
         Noam
 
 ------------------------------
 
 Message: 8
 Date: Fri, 20 Sep 2013 08:49:20 -0600
 From: Lloyd Brown <lloyd_brown_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: [OMPI users] Debugging Runtime/Ethernet Problems
 Message-ID: <523C6070.7000704_at_[hidden]>
 Content-Type: text/plain; charset=ISO-8859-1
 
 Hi, all.
 
 We've got a couple of clusters running RHEL 6.2, and have
 several
 centrally-installed versions/compilations of OpenMPI. 
 Some of the nodes
 have 4xQDR Infiniband, and all the nodes have 1 gigabit
 ethernet.  I was
 gathering some bandwidth and latency numbers using the
 OSU/OMB tests,
 and noticed some weird behavior.
 
 When I run a simple "mpirun ./osu_bw" on a couple of
 IB-enabled node, I
 get numbers consistent with our IB speed (up to about 3800
 MB/s), and
 when I run the same thing on two nodes with only Ethernet, I
 get speeds
 consistent with that (up to about 120 MB/s).  So far,
 so good.
 
 The trouble is when I try to add some "--mca" parameters to
 force it to
 use TCP/Ethernet, the program seems to hang.  I get the
 headers of the
 "osu_bw" output, but no results, even on the first case (1
 byte payload
 per packet).  This is occurring on both the IB-enabled
 nodes, and on the
 Ethernet-only nodes.  The specific syntax I was using
 was:  "mpirun
 --mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw"
 
 The problem occurs at least with OpenMPI 1.6.3 compiled with
 GNU 4.4
 compilers, with 1.6.3 compiled with Intel 13.0.1 compilers,
 and with
 1.6.5 compiled with Intel 13.0.1 compilers.  I haven't
 tested any other
 combinations yet.
 
 Any ideas here?  It's very possible this is a system
 configuration
 problem, but I don't know where to look.  At this
 point, any ideas would
 be welcome, either about the specific situation, or general
 pointers on
 mpirun debugging flags to use.  I can't find much in
 the docs yet on
 run-time debugging for OpenMPI, as opposed to debugging the
 application.
  Maybe I'm just looking in the wrong place.
 
 
 Thanks,
 
 --
 Lloyd Brown
 Systems Administrator
 Fulton Supercomputing Lab
 Brigham Young University
 http://marylou.byu.edu
 
 
 ------------------------------
 
 Message: 9
 Date: Fri, 20 Sep 2013 15:05:14 +0000
 From: "Elken, Tom" <tom.elken_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] Debugging Runtime/Ethernet
 Problems
 Message-ID:
     <1182FB2B5679CE4B8BAD97725F014BB73284E992_at_[hidden]>
    
 Content-Type: text/plain; charset="us-ascii"
 
> The trouble is when I try to add some "--mca"
 parameters to force it to
> use TCP/Ethernet, the program seems to hang.  I
 get the headers of the
> "osu_bw" output, but no results, even on the first case
 (1 byte payload
> per packet).  This is occurring on both the
 IB-enabled nodes, and on the
> Ethernet-only nodes.  The specific syntax I was
 using was:  "mpirun
> --mca btl ^openib --mca btl_tcp_if_exclude ib0
 ./osu_bw"
  
 When we want to run over TCP and IPoIB on an IB/PSM equipped
 cluster, we use:
 --mca btl sm --mca btl tcp,self --mca btl_tcp_if_exclude
 eth0 --mca btl_tcp_if_include ib0 --mca mtl ^psm
 
 based on this, it looks like the following might work for
 you:
 --mca btl sm,tcp,self --mca btl_tcp_if_exclude ib0 --mca
 btl_tcp_if_include eth0 --mca btl ^openib
 
 If you don't have ib0 ports configured on the IB nodes,
 probably you don't need the" --mca btl_tcp_if_exclude ib0."
 
 -Tom
 
>
> The problem occurs at least with OpenMPI 1.6.3 compiled
 with GNU 4.4
> compilers, with 1.6.3 compiled with Intel 13.0.1
 compilers, and with
> 1.6.5 compiled with Intel 13.0.1 compilers.  I
 haven't tested any other
> combinations yet.
>
> Any ideas here?  It's very possible this is a
 system configuration
> problem, but I don't know where to look.  At this
 point, any ideas would
> be welcome, either about the specific situation, or
 general pointers on
> mpirun debugging flags to use.  I can't find much
 in the docs yet on
> run-time debugging for OpenMPI, as opposed to debugging
 the application.
>  Maybe I'm just looking in the wrong place.
>
>
> Thanks,
>
> --
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ------------------------------
 
 Message: 10
 Date: Fri, 20 Sep 2013 08:17:43 -0700
 From: Ralph Castain <rhc_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] Debugging Runtime/Ethernet
 Problems
 Message-ID: <917B367F-1687-4A91-B173-DE1BBA7C7866_at_[hidden]>
 Content-Type: text/plain; charset=us-ascii
 
 I don't think you are allowed to specify both include and
 exclude options at the same time as they conflict - you
 should either exclude ib0 or include eth0 (or whatever).
 
 My guess is that the various nodes are trying to communicate
 across disjoint networks. We've seen that before when, for
 example, eth0 on one node is on one subnet, and eth0 on
 another node is on a different subnet. You might look for
 that kind of arrangement.
 
 
 On Sep 20, 2013, at 8:05 AM, "Elken, Tom" <tom.elken_at_[hidden]>
 wrote:
 
>> The trouble is when I try to add some "--mca"
 parameters to force it to
>> use TCP/Ethernet, the program seems to hang. 
 I get the headers of the
>> "osu_bw" output, but no results, even on the first
 case (1 byte payload
>> per packet).  This is occurring on both the
 IB-enabled nodes, and on the
>> Ethernet-only nodes.  The specific syntax I
 was using was:  "mpirun
>> --mca btl ^openib --mca btl_tcp_if_exclude ib0
 ./osu_bw"
>
> When we want to run over TCP and IPoIB on an IB/PSM
 equipped cluster, we use:
> --mca btl sm --mca btl tcp,self --mca
 btl_tcp_if_exclude eth0 --mca btl_tcp_if_include ib0 --mca
 mtl ^psm
>
> based on this, it looks like the following might work
 for you:
> --mca btl sm,tcp,self --mca btl_tcp_if_exclude ib0
 --mca btl_tcp_if_include eth0 --mca btl ^openib
>
> If you don't have ib0 ports configured on the IB nodes,
 probably you don't need the" --mca btl_tcp_if_exclude ib0."
>
> -Tom
>
>>
>> The problem occurs at least with OpenMPI 1.6.3
 compiled with GNU 4.4
>> compilers, with 1.6.3 compiled with Intel 13.0.1
 compilers, and with
>> 1.6.5 compiled with Intel 13.0.1 compilers.  I
 haven't tested any other
>> combinations yet.
>>
>> Any ideas here?  It's very possible this is a
 system configuration
>> problem, but I don't know where to look.  At
 this point, any ideas would
>> be welcome, either about the specific situation, or
 general pointers on
>> mpirun debugging flags to use.  I can't find
 much in the docs yet on
>> run-time debugging for OpenMPI, as opposed to
 debugging the application.
>> Maybe I'm just looking in the wrong place.
>>
>>
>> Thanks,
>>
>> --
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 
 ------------------------------
 
 Message: 11
 Date: Fri, 20 Sep 2013 15:28:48 +0000
 From: "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] compilation aborted for
 Handler.cpp (code 2)
 Message-ID:
     <EF66BBEB19BADC41AC8CCF5F684F07FC4F8BC7EC_at_[hidden]>
 Content-Type: text/plain; charset="iso-8859-1"
 
 I can't tell if this is a busted compiler installation or
 not.  The first error is:
 
 -----
 /usr/include/c++/4.6.3/bits/stl_algobase.h(573): error: type
 name is not allowed
         const bool __simple =
 (__is_trivial(_ValueType1)
                
                
             ^
           detected during:
             instantiation of
 "_BI2
 std::__copy_move_backward_a2<_IsMove,_BI1,_BI2>(_BI1,
 _BI1, _BI2) [with _IsMove=false, _BI1=uint32_t={unsigned
 int} *, _BI2=uint32_t={unsigned int} *]" at line 625
             instantiation of
 "_BI2 std::copy_backward(_BI1, _BI1, _BI2) [with
 _BI1=uint32_t={unsigned int} *, _BI2=uint32_t={unsigned int}
 *]" at line 315 of "/usr/include/c++/4.6.3/bits/vector.tcc"
             instantiation of
 "void std::vector<_Tp,
 _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<std::_Vector_base<_Tp,
 _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp,
 _Alloc>>, const _Tp &) [with
 _Tp=uint32_t={unsigned int},
 _Alloc=std::allocator<uint32_t={unsigned int}>]" at
 line 834 of "/usr/include/c++/4.6.3/bits/stl_vector.h"
             instantiation of
 "void std::vector<_Tp, _Alloc>::push_back(const _Tp
 &) [with _Tp=uint32_t={unsigned int},
 _Alloc=std::allocator<uint32_t={unsigned int}>]" at
 line 42 of "Handler.cpp"
 -----
 
 I verified that OMPI 1.6.5 builds fine for me for
 icpc/13.1.0.146 Build 20130121 on RHEL 6.
 
 Perhaps you have some kind of bad interaction between your
 icpc installation and your local g++ installation...?
 
 
 
 On Sep 18, 2013, at 12:58 PM, Syed Ahsan Ali <ahsanshah01_at_[hidden]>
 wrote:
 
> Please find attached again.
>
> On Tue, Sep 17, 2013 at 11:35 AM, Jeff Squyres
 (jsquyres)
> <jsquyres_at_[hidden]>
 wrote:
>> On Sep 16, 2013, at 9:00 AM, Syed Ahsan Ali <ahsanshah01_at_[hidden]>
 wrote:
>>
>>> I am trying to compile openmpi-1.6.5 on
 fc16.x86_64 with icc and ifort
>>> but getting the subject error. config.out and
 make.out is attached.
>>> Following command was used for configure
>>>
>>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort
 F90=ifort
>>> --prefix=/home/openmpi_gfortran -enable-mpi-f90
 --enable-mpi-f77 |&
>>> tee config.out
>>
>> I'm sorry; I can't open a .rar file.  Can you
 send the logs compressed with a conventional compression
 program like gzip, bzip2, or zip?
>>
>> Thanks.
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
>
 <logs.zip>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 --
 Jeff Squyres
 jsquyres_at_[hidden]
 For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 
 ------------------------------
 
 Message: 12
 Date: Fri, 20 Sep 2013 15:33:43 +0000
 From: "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] Debugging Runtime/Ethernet
 Problems
 Message-ID:
     <EF66BBEB19BADC41AC8CCF5F684F07FC4F8BC892_at_[hidden]>
 Content-Type: text/plain; charset="us-ascii"
 
 Correct -- it doesn't make sense to specify both include
 *and* exclude: by specifying one, you're implicitly (but
 exactly/precisely) specifying the other.
 
 My suggestion would be to use positive notation, not
 negative notation.  For example:
 
 mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 ...
 
 That way, you *know* you're only getting the TCP and self
 BTLs, and you *know* you're only getting eth0.  If that
 works, then spread out from there, e.g.:
 
 mpirun --mca btl tcp,sm,self --mca btl_tcp_if_include
 eth0,eth1 ...
 
 E.g., also include the "sm" BTL (which is only used for
 shared memory communications between 2 procs on the same
 server, and is therefore useless for a
 2-proc-across-2-server run of osu_bw, but you get the idea),
 but also use eth0 and eth1. 
 
 And so on.
 
 The problem with using ^openib and/or btl_tcp_if_exclude is
 that you might end up using some BTLs and/or TCP interfaces
 that you don't expect, and therefore can run into problems.
 
 Make sense?
 
 
 
 On Sep 20, 2013, at 11:17 AM, Ralph Castain <rhc_at_[hidden]>
 wrote:
 
> I don't think you are allowed to specify both include
 and exclude options at the same time as they conflict - you
 should either exclude ib0 or include eth0 (or whatever).
>
> My guess is that the various nodes are trying to
 communicate across disjoint networks. We've seen that before
 when, for example, eth0 on one node is on one subnet, and
 eth0 on another node is on a different subnet. You might
 look for that kind of arrangement.
>
>
> On Sep 20, 2013, at 8:05 AM, "Elken, Tom" <tom.elken_at_[hidden]>
 wrote:
>
>>> The trouble is when I try to add some "--mca"
 parameters to force it to
>>> use TCP/Ethernet, the program seems to
 hang.  I get the headers of the
>>> "osu_bw" output, but no results, even on the
 first case (1 byte payload
>>> per packet).  This is occurring on both
 the IB-enabled nodes, and on the
>>> Ethernet-only nodes.  The specific syntax
 I was using was:  "mpirun
>>> --mca btl ^openib --mca btl_tcp_if_exclude ib0
 ./osu_bw"
>>
>> When we want to run over TCP and IPoIB on an IB/PSM
 equipped cluster, we use:
>> --mca btl sm --mca btl tcp,self --mca
 btl_tcp_if_exclude eth0 --mca btl_tcp_if_include ib0 --mca
 mtl ^psm
>>
>> based on this, it looks like the following might
 work for you:
>> --mca btl sm,tcp,self --mca btl_tcp_if_exclude ib0
 --mca btl_tcp_if_include eth0 --mca btl ^openib
>>
>> If you don't have ib0 ports configured on the IB
 nodes, probably you don't need the" --mca btl_tcp_if_exclude
 ib0."
>>
>> -Tom
>>
>>>
>>> The problem occurs at least with OpenMPI 1.6.3
 compiled with GNU 4.4
>>> compilers, with 1.6.3 compiled with Intel
 13.0.1 compilers, and with
>>> 1.6.5 compiled with Intel 13.0.1
 compilers.  I haven't tested any other
>>> combinations yet.
>>>
>>> Any ideas here?  It's very possible this
 is a system configuration
>>> problem, but I don't know where to look. 
 At this point, any ideas would
>>> be welcome, either about the specific
 situation, or general pointers on
>>> mpirun debugging flags to use.  I can't
 find much in the docs yet on
>>> run-time debugging for OpenMPI, as opposed to
 debugging the application.
>>> Maybe I'm just looking in the wrong place.
>>>
>>>
>>> Thanks,
>>>
>>> --
>>> Lloyd Brown
>>> Systems Administrator
>>> Fulton Supercomputing Lab
>>> Brigham Young University
>>> http://marylou.byu.edu
>>>
 _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 --
 Jeff Squyres
 jsquyres_at_[hidden]
 For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 
 ------------------------------
 
 Message: 13
 Date: Fri, 20 Sep 2013 15:35:59 +0000
 From: "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] compilation aborted for
 Handler.cpp (code 2)
 Message-ID:
     <EF66BBEB19BADC41AC8CCF5F684F07FC4F8BC914_at_[hidden]>
 Content-Type: text/plain; charset="iso-8859-1"
 
 Sorry for the delay replying -- I actually replied on the
 original thread yesterday, but it got hung up in my outbox
 and I didn't notice that it didn't actually go out until a
 few moments ago.  :-(
 
 I'm *guessing* that this is a problem with your local icpc
 installation.
 
 Can you compile / run other C++ codes that use the STL with
 icpc?
 
 
 On Sep 20, 2013, at 6:59 AM, Syed Ahsan Ali <ahsanshah01_at_[hidden]>
 wrote:
 
> Output of make V=1 is attached. Again same error. If
 intel compiler is
> using C++ headers from gfortran then how can we avoid
 this.
>
> On Fri, Sep 20, 2013 at 11:07 AM, Bert Wesarg
> <bert.wesarg_at_[hidden]>
 wrote:
>> Hi,
>>
>> On Fri, Sep 20, 2013 at 4:49 AM, Syed Ahsan Ali
 <ahsanshah01_at_[hidden]>
 wrote:
>>> I am trying to compile openmpi-1.6.5 on
 fc16.x86_64 with icc and ifort
>>> but getting the subject error. config.out and
 make.out is attached.
>>> Following command was used for configure
>>>
>>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort
 F90=ifort
>>> --prefix=/home/openmpi_gfortran -enable-mpi-f90
 --enable-mpi-f77 |&
>>> tee config.out
>>
>> could you also run make with 'make V=1' and send
 the output. Anyway it
>> looks like the intel compiler uses the C++ headers
 from GCC 4.6.3 and
>> I don't know if this is supported.
>>
>> Bert
>>
>>> Please help/advise.
>>> Thank you and best regards
>>> Ahsan
>>>
>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
>
 <makeV.zip>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 --
 Jeff Squyres
 jsquyres_at_[hidden]
 For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 
 ------------------------------
 
 Message: 14
 Date: Fri, 20 Sep 2013 11:52:11 -0400
 From: Gus Correa <gus_at_[hidden]>
 To: Open MPI Users <users_at_[hidden]>
 Subject: Re: [OMPI users] intermittent node file error
 running with
     torque/maui integration
 Message-ID: <523C6F2B.60401_at_[hidden]>
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 
 Hi Noam
 
 Could it be that Torque, or probably more likely NFS,
 is too slow to create/make available the PBS_NODEFILE?
 
 What if you insert a "sleep 2",
 or whatever number of seconds you want,
 before the mpiexec command line?
 Or maybe better, a "ls -l $PBS_NODEFILE; cat
 $PBS_NODEFILE",
 just to make sure the file it is available and
 filled with the node list, before mpiexec takes over?
 
 My two cents,
 Gus Correa
 
 On 09/20/2013 09:55 AM, Noam Bernstein wrote:
> Hi - we've been using openmpi for a while, but only for
 the last few months
> with torque/maui.  Intermittently (maybe 1/10
 jobs), we get mpi jobs that fail with the error:
>
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File
 open failure in file ras_tm_module.c at line 142
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File
 open failure in file ras_tm_module.c at line 82
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File
 open failure in file base/ras_base_allocate.c at line 149
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File
 open failure in file base/plm_base_launch_support.c at line
 99
> [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File
 open failure in file plm_tm_module.c at line 194
>
> This is completely unrepeatable - resubmitting the same
 job almost
> always works the second time around.  The line
 appears to be
> associated with looking for the torque/maui generated
 node file,
> and when I do something like
>    echo $PBS_NODEFILE
>    cat $PBS_NODEFILE
> it appears that the file is present and correct.
>
> We're running OpenMPI 1.6.4, configured with
> ./configure \
>          --prefix=${DEST} \
>         
 --with-tm=/usr/local/torque \
>         
 --enable-mpirun-prefix-by-default \
>          --with-openib=/usr \
>         
 --with-openib-libdir=/usr/lib64
>
> Has anyone seen anything like this before, or has any
 ideas of what might
> be happening?  It appears to be a line where
 openmpi looks for
> the PBS node file, which is on a local filesystem (e.g.
 PBS_NODEFILE=/var/spool/torque/aux//4600.tin).
>
>        
            
            
     thanks,
>        
            
            
     Noam
>
>
>
> Noam Bernstein
> Center for Computational Materials Science
> NRL Code 6390
> noam.bernstein_at_[hidden]
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 
 ------------------------------
 
 Message: 15
 Date: Fri, 20 Sep 2013 15:52:43 +0000
 From: "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
 To: Siegmar Gross <Siegmar.Gross_at_[hidden]>,
 Open MPI
     Users    <users_at_[hidden]>
 Subject: Re: [OMPI users] error building
 openmpi-1.7.3a1r29213 on
     Solaris
 Message-ID:
     <EF66BBEB19BADC41AC8CCF5F684F07FC4F8BCB02_at_[hidden]>
 Content-Type: text/plain; charset="us-ascii"
 
 Looks like Ralph noticed that we fixed this on the trunk and
 forgot to bring it over to v1.7.  I just committed it
 on v1.7 in r29215.  Give it a whirl in tonight's v1.7
 nightly tarball.
 
 
 On Sep 20, 2013, at 7:00 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]>
 wrote:
 
> Hi,
>
> I tried to install openmpi-1.7.3a1r29213 on "openSuSE
 Linux 12.1",
> "Solaris 10 x86_64", and "Solaris 10 sparc" with "Sun C
 5.12" and
> gcc-4.8.0 in 64-bit mode. Unfortunately "make" breaks
 with the same
> error for both compilers on both Solaris platforms.
>
>
> tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_cc 126 tail
 -10 \
>  log.make.SunOS.sparc.64_cc
> Making all in mca/if/posix_ipv4
> make[2]: Entering directory
 `.../opal/mca/if/posix_ipv4'
>  CC       if_posix.lo
>
 "../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c",
>  line 277: undefined struct/union member: ifr_mtu
> cc: acomp failed for

 ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c
> make[2]: *** [if_posix.lo] Error 1
> make[2]: Leaving directory
 `.../opal/mca/if/posix_ipv4'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `.../opal'
> make: *** [all-recursive] Error 1
>
>
> tyr openmpi-1.7.3a1r29213-SunOS.sparc.64_gcc 131 tail
 -12 \
>  log.make.SunOS.sparc.64_gcc
> Making all in mca/if/posix_ipv4
> make[2]: Entering directory
 `.../opal/mca/if/posix_ipv4'
>  CC       if_posix.lo
>
 ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c:
>  In function 'if_posix_open':
>
 ../../../../../openmpi-1.7.3a1r29213/opal/mca/if/posix_ipv4/if_posix.c:
>  277:31: error: 'struct ifreq' has no member named
 'ifr_mtu'
>         
    intf->if_mtu = ifr->ifr_mtu;
>               
            
    ^
> make[2]: *** [if_posix.lo] Error 1
> make[2]: Leaving directory
 `.../opal/mca/if/posix_ipv4'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `.../opal'
> make: *** [all-recursive] Error 1
>
>
>
> I have had this problem before and Jeff solved it. Here
 is my
> old e-mail.
>
> Date: Tue, 7 May 2013 19:38:11 +0200 (CEST)
> From: Siegmar Gross <Siegmar.Gross_at_[hidden]>
> Subject: Re: commit/ompi-java: jsquyres: Up to SVN
 r28392
> To: jsquyres_at_[hidden]
> Cc: Siegmar.Gross_at_[hidden]
> MIME-Version: 1.0
> Content-MD5: O1pjPK/1JiMXXZ/EHyMU0Q==
> X-HRZ-JLUG-MailScanner-Information: Passed JLUG virus
 check
> X-HRZ-JLUG-MailScanner: No virus found
> X-Envelope-From: fd1026_at_[hidden]
> X-Spam-Status: No
>
> Hello Jeff
>
>> Ok, I made a change in the OMPI trunk that should
 fix this:
>>
>>    https://svn.open-mpi.org/trac/ompi/changeset/28460
>>
>> And I pulled it into the ompi-java hg repo. 
 Could you give
>> it a whirl and let me know if this works for you?
>
> Perfect :-)))).  Now I can build Open MPI on
 Solaris without
> "#if 0" :-). Thank you very much for your help.
>
>
> "make check"  still produces the old bus error on
 Solaris Sparc.
> All checks are fine on Linux and Solaris x86_64.
>
> ...
> PASS: ddt_test
> /bin/bash: line 5: 12453 Bus Error     
          ${dir}$tst
> FAIL: ddt_raw
>
 ========================================================
> 1 of 5 tests failed
> Please report to http://www.open-mpi.org/community/help/
>
 ========================================================
> make[3]: *** [check-TESTS] Error 1
> ...
>
>
> Kind regards
>
> Siegmar
>
>
>> On May 6, 2013, at 7:20 AM, Siegmar Gross
> <Siegmar.Gross_at_[hidden]>
 wrote:
>>
>>> Hello Jeff
>>>
>>>>>>
 "../../../../../ompi-java/opal/mca/if/posix_ipv4/if_posix.c",
>>>>>> line 279: undefined struct/union
 member: ifr_mtu
>>>>>>
>>>>>> Sigh.  Solaris kills me. 
 :-\
>>>>>>
>>>>>> Just so I understand -- Solaris has
 SIOCGIFMTU, but doesn't
>>>>>> have struct ifreq.ifr_mtu?
>>>>>
>>>>> I found SIOCGIFMTU in sys/sockio.h with
 the following comment.
>>>>
>>>> Is there a Solaris-defined constant we can
 use here to know
>>>> that we're on Solaris?  If so, I can
 effectively make that code
>>>> only be there if SIOCFIGMTU exists and
 we're not on Solaris.
>>>
>>> I searched our header files for "sunos" and
 "solaris" with
>>> "-ignore-case", but didn't find anything
 useful. You have a very
>>> minimal environment, if you use "sh" and you
 would have a useful
>>> environment variable, if you use "tcsh".
>>>
>>> tyr java 321 su -
>>> ...
>>> # env
>>> HOME=/root
>>> HZ=
>>> LANG=C
>>> LC_ALL=C
>>> LOGNAME=root
>>> MAIL=/var/mail/root
>>> PATH=/usr/sbin:/usr/bin
>>> SHELL=/sbin/sh
>>> TERM=dtterm
>>> TZ=Europe/Berlin
>>> # tcsh
>>> # env | grep TYPE
>>> HOSTTYPE=sun4
>>> OSTYPE=solaris
>>> MACHTYPE=sparc
>>> #
>>>
>>> The best solution would be "uname -s", if that
 is possible.
>>>
>>> # /usr/bin/uname -s
>>> SunOS
>
>
> I would be grateful, if somebody can solve the problem
 once more.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 --
 Jeff Squyres
 jsquyres_at_[hidden]
 For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 
 ------------------------------
 
 Subject: Digest Footer
 
 _______________________________________________
 users mailing list
 users_at_[hidden]
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 ------------------------------
 
 End of users Digest, Vol 2685, Issue 2
 **************************************