mpiexec GMPI_SLAVE env.t problem
Pete Wyckoff
pw at osc.edu
Fri Aug 15 20:25:40 EDT 2003
brh at unimelb.edu.au said on Fri, 15 Aug 2003 15:19 +1000:
> I tried "mpi -v -v -v -np1 env", etc. as per below
> and the only relevant difference in the shell & mpiexec env.t seems to be
>
> env 65 OLDPWD=/home/brh
> env 66 GMPI_ID=0
> env 67 GMPI_SLAVE=node040
> env 68 MPIEXEC_STDIN_PORT=36757
>
> see below for the outputs (except minor edits "to protect the innocent")
>
> I know I have long paths, etc. but somehow I dont think it's shell env.t
> limitations.
Agreed. But something odd is definitely happening on your system.
First, comparing the output from the "env" in the bash script to
the environment displayed in mpiexec's debugging output shows these
differences:
- added GMPI_* and MPIEXEC_*
- added LOGD=/usr/var/adm/syslog.dated
- changed PWD and added OLDPWD
- changed _
The last two seem like reasonable shell things precipitated by the "cd"
you do before running mpiexec. The second one, LOGD, I have no clue
where that comes from. So we're mostly good here, as expected.
Then the more interesting comparison is looking at what changes from
mpiexec's debugging output to what shows up in the "env" run as a
sub-process as spawned by the pbs_mom:
- removed three critical ones: GMPI_ID GMPI_SLAVE MPIEXEC_STDIN_PORT
- removed PBS_NODEFILE
- removed the three ones you had marked "[removed -bh]" (or maybe
you did those too)
- removed _
- removed INTEL_LICENSE_FILE
- removed LOGD
- changed PATH, MANPATH, LD_LIBRARY_PATH
I ran your script locally and got the same results for the first step,
except for the addition of LOGD you have, and saw only the removal of
three shell-ish variables for the second step: OLDPWD, _, PS1.
The only thing I notice odd is that your SHELL seems to be tcsh but your
PBS script is executed in bash. Could there be some tcsh system files
like /etc/csh.{cshrc,login} and /etc/profile.d/*.csh that are mucking
with the environment?
Once again I'm at a loss. If you want to, you can grab the pbs_mom
which will run your script in an strace and mail me offline the huge
output that results. Make sure the node has no jobs when you start
this:
node040:root# strace -vFf -s 5000 -o /tmp/strace.out -p $(pgrep pbs_mom)
Run the job, make sure it completes, ctrl-c the strace, run gzip. The
big -s arg will let us see the environment all the way from pbs down to
the code itself. Hopefully we'll be able to see the important variables
disappear somewhere that can be fixed.
-- Pete
More information about the mpiexec
mailing list