Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi query
From: Nisha Dhankher -M.Tech(CSE) (nishadhankher-coaeseeit_at_[hidden])
Date: 2014-04-08 05:37:27


latest rocks 6.2 carry this version only

On Tue, Apr 8, 2014 at 3:49 AM, Jeff Squyres (jsquyres)
<jsquyres_at_[hidden]>wrote:

> Open MPI 1.4.3 is *ancient*. Please upgrade -- we just released Open MPI
> 1.8 last week.
>
> Also, please look at this FAQ entry -- it steps you through a lot of basic
> troubleshooting steps about getting basic MPI programs working.
>
> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>
> Once you get basic MPI programs working, then try with MPI Blast.
>
>
>
> On Apr 5, 2014, at 3:11 AM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
>
> > Mpirun --mca btl ^openib --mca btl_tcp_if_include eth0 -np 16
> -machinefile mf mpiblast -d all.fas -p blastn -i query.fas -o out.txt
> >
> > was the command i executed on cluster...
> >
> >
> >
> > On Sat, Apr 5, 2014 at 12:34 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> > sorry Ralph my mistake its not names...it is "it does not happen on same
> nodes."
> >
> >
> > On Sat, Apr 5, 2014 at 12:33 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> > same vm on all machines that is virt-manager
> >
> >
> > On Sat, Apr 5, 2014 at 12:32 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> > opmpi version 1.4.3
> >
> >
> > On Fri, Apr 4, 2014 at 8:13 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> > Okay, so if you run mpiBlast on all the non-name nodes, everything is
> okay? What do you mean by "names nodes"?
> >
> >
> > On Apr 4, 2014, at 7:32 AM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> >
> >> no it does not happen on names nodes
> >>
> >>
> >> On Fri, Apr 4, 2014 at 7:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> >> Hi Nisha
> >>
> >> I'm sorry if my questions appear abrasive - I'm just a little
> frustrated at the communication bottleneck as I can't seem to get a clear
> picture of your situation. So you really don't need to keep calling me
> "sir" :-)
> >>
> >> The error you are hitting is very unusual - it means that the processes
> are able to make a connection, but are failing to correctly complete a
> simple handshake exchange of their process identifications. There are only
> a few ways that can happen, and I'm trying to get you to test for them.
> >>
> >> So let's try and see if we can narrow this down. You mention that it
> works on some machines, but not all. Is this consistent - i.e., is it
> always the same machines that work, and the same ones that generate the
> error? If you exclude the ones that show the error, does it work? If so,
> what is different about those nodes? Are they a different architecture?
> >>
> >>
> >> On Apr 3, 2014, at 11:09 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> >>
> >>> sir
> >>> smae virt-manager is bein used by all pc's.no i did n't enable
> openmpi-hetro.Yes openmpi version is same in all through same kickstart
> file.
> >>> ok...actually sir...rocks itself installed,configured openmpi and
> mpich on it own through hpc roll.
> >>>
> >>>
> >>> On Fri, Apr 4, 2014 at 9:25 AM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> >>>
> >>> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> >>>
> >>>> thankyou Ralph.
> >>>> Yes cluster is heterogenous...
> >>>
> >>> And did you configure OMPI --enable-heterogeneous? And are you running
> it with ---hetero-nodes? What version of OMPI are you using anyway?
> >>>
> >>> Note that we don't care if the host pc's are hetero - what we care
> about is the VM. If all the VMs are the same, then it shouldn't matter.
> However, most VM technologies don't handle hetero hardware very well -
> i.e., you can't emulate an x86 architecture on top of a Sparc or Power chip
> or vice versa.
> >>>
> >>>
> >>>> And i haven't made compute nodes on direct physical nodes (pc's)
> becoz in college it is not possible to take whole lab of 32 pc's for your
> work so i ran on vm.
> >>>
> >>> Yes, but at least it would let you test the setup to run MPI across
> even a couple of pc's - this is simple debugging practice.
> >>>
> >>>> In Rocks cluster, frontend give the same kickstart to all the pc's so
> openmpi version should be same i guess.
> >>>
> >>> Guess? or know? Makes a difference - might be worth testing.
> >>>
> >>>> Sir
> >>>> mpiformatdb is a command to distribute database fragments to
> different compute nodes after partitioning od database.
> >>>> And sir have you done mpiblast ?
> >>>
> >>> Nope - but that isn't the issue, is it? The issue is with the MPI
> setup.
> >>>
> >>>>
> >>>>
> >>>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> >>>> What is "mpiformatdb"? We don't have an MPI database in our system,
> and I have no idea what that command means
> >>>>
> >>>> As for that error - it means that the identifier we exchange between
> processes is failing to be recognized. This could mean a couple of things:
> >>>>
> >>>> 1. the OMPI version on the two ends is different - could be you
> aren't getting the right paths set on the various machines
> >>>>
> >>>> 2. the cluster is heterogeneous
> >>>>
> >>>> You say you have "virtual nodes" running on various PC's? That would
> be an unusual setup - VM's can be problematic given the way they handle TCP
> connections, so that might be another source of the problem if my
> understanding of your setup is correct. Have you tried running this across
> the PCs directly - i.e., without any VMs?
> >>>>
> >>>>
> >>>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> >>>>
> >>>>> i first formatted my database with mpiformatdb command then i ran
> command :
> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
> query.fas -o output.txt
> >>>>> but then it gave this error 113 from some hosts and continue to run
> for other but with no results even after 2 hours lapsed.....on rocks 6.0
> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
> ram to each
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> >>>>> i also made machine file which contain ip adresses of all compute
> nodes + .ncbirc file for path to mpiblast and shared ,local storage path....
> >>>>> Sir
> >>>>> I ran the same command of mpirun on my college supercomputer 8 nodes
> each having 24 processors but it just running....gave no result uptill 3
> hours...
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> >>>>> i first formatted my database with mpiformatdb command then i ran
> command :
> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
> query.fas -o output.txt
> >>>>> but then it gave this error 113 from some hosts and continue to run
> for other but with results even after 2 hours lapsed.....on rocks 6.0
> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
> ram to each
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> >>>>> I'm having trouble understanding your note, so perhaps I am getting
> this wrong. Let's see if I can figure out what you said:
> >>>>>
> >>>>> * your perl command fails with "no route to host" - but I don't see
> any host in your cmd. Maybe I'm just missing something.
> >>>>>
> >>>>> * you tried running a couple of "mpirun", but the mpirun command
> wasn't recognized? Is that correct?
> >>>>>
> >>>>> * you then ran mpiblast and it sounds like it successfully started
> the processes, but then one aborted? Was there an error message beyond just
> the -1 return status?
> >>>>>
> >>>>>
> >>>>> On Apr 2, 2014, at 11:17 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
> >>>>>
> >>>>>> error btl_tcp_endpint.c: 638 connection failed due to error 113
> >>>>>>
> >>>>>> In openmpi: this error came when i run my mpiblast program on rocks
> cluster.Connect to hosts failed on ip 10.1.255.236,10.1.255.244 . And when
> i run following command linux_shell$ perl -e 'die$!=113' this msg comes:
> "No route to host at -e line 1." shell$ mpirun --mca btl ^tcp shell$ mpirun
> --mca btl_tcp_if_include eth1,eth2 shell$ mpirun --mca btl_tcp_if_include
> 10.1.255.244 was also executed but it did nt recognized these
> commands....nd aborted.... what should i do...? When i run my mpiblast
> program for the frst time then it give mpi_abort error...bailing out of
> signal -1 on rank 2 processor...then i removed my public ethernet
> cable....and then give btl_tcp endpint error 113....
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>