Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] qsub - mpirun problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-09-29 03:27:56


Hi Zhiliang

This has nothing to do with how you configured Open MPI. The issue is
that your Torque queue manager isn't setting the expected environment
variables to tell us the allocation. I'm not sure why it wouldn't be
doing so, and I'm afraid I'm not enough of a Torque person to know how
to guide you.

What is happening, though, is that we are actually launching via ssh
instead of Torque since we don't see the Torque system. Your system
appears happy to let us do so, so this may not be a real problem for
you other than the annoyance of having to specify the machinefile
every time.

I'm curious as to how you find the machinefile - what is the file
named? In a typical Torque install, the file is located in some
default tmp directory and is given a name that includes the PBS jobid.
Since you didn't find that environment variable, how did you know what
filename to pass mpirun?

Thanks
Ralph

On Sep 28, 2008, at 8:07 PM, Zhiliang Hu wrote:

> Ralph,
>
> Thank you for your quick response.
>
> Indeed as you expected, "printenv | grep PBS" produced nothing.
>
> BTW, I have:
>
>> qmgr -c 'p s'
>
> # Create queues and set their attributes.
> #
> #
> # Create and define queue default
> #
> create queue default
> set queue default queue_type = Execution
> set queue default resources_default.nodes = 7
> set queue default enabled = True
> set queue default started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = nagrp2
> set server default_queue = default
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server resources_available.nodect = 6
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server next_job_number = 793
>
> - I am not sure what/how is missing from my configurations (do you
> mean the installation "configure" step with optional directives) or
> else?
>
> Thank you,
>
> Zhiliang
>
> At 07:16 PM 9/28/2008 -0600, you wrote:
>> Hi Zhiliang
>>
>> First thing to check is that your Torque system is defining and
>> setting the environmental variables we are expecting in a Torque
>> system. It is quite possible that your Torque system isn't configured
>> as we expect.
>>
>> Can you run a job and send us the output from "printenv | grep PBS"?
>> We should see a PBS jobid, the name of the file containing the names
>> of the allocated nodes, etc.
>>
>> Since you are able to run with -machinefile, my guess is that your
>> system isn't setting those environmental variables as we expect. In
>> that case, you will have to keep specifying the machinefile by hand.
>>
>> Thanks
>> Ralph
>>
>> On Sep 28, 2008, at 7:02 PM, Zhiliang Hu wrote:
>>
>>> I have asked this question on TorqueUsers list. Responses from that
>>> list suggests that the question be asked on this list:
>>>
>>> The situation is:
>>>
>>> I can submit my jobs as in:
>>>> qsub -l nodes=6:ppn=2 /path/to/mpi_program
>>>
>>> where "mpi_program" is:
>>> /path/to/mpirun -np 12 /path/to/my_program
>>>
>>> -- however everything went to run on the head node (one time on
>>> the
>>> first compute node). Jobs can be done anyway.
>>>
>>> While the mpirun can run on its own by specifying a "-machinefile",
>>> it is pointed out by Glen among others, and also on this web site http://wiki.hpc.ufl.edu/index.php/Common_Problems
>>> (I got the same error as the last example on that web page) that
>>> it's not a good idea to provide machinefile since it's "already
>>> handled by OpenMPI and Torque".
>>>
>>> My question is, why the OpenMPI and Torque is not handling the jobs
>>> to all nodes?
>>>
>>> ps 1:
>>> The OpenMPI is configured and installed with the "--with-tm" option,
>>> and the "ompi_info" does show lines:
>>>
>>> MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7)
>>> MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7)
>>>
>>> ps 2:
>>> "/path/to/mpirun -np 12 -machinefile /path/to/machinefile /path/
>>> to/ my_program"
>>> works normal (send jobs to all nodes).
>>>
>>> Thanks,
>>>
>>> Zhiliang
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users