stdin question
Pete Wyckoff
pw at osc.edu
Mon Aug 22 16:40:30 EDT 2005
ckirby3 at colsa.com wrote on Mon, 22 Aug 2005 12:34 -0500:
> I am running Torque 1.2.0p4 with mpiexec 0.79 both compiled using gcc 3.3.
> We are trying to run on 512 nodes using Myrinet with the gm 2.0.21 driver.
> I've applied the OS X patch for gm.c mentioned in the mailing list.
>
> What is needed to run a program as follows,
>
> #mpiexec -n 10 myprogram < inputdata >& output
>
> How can I redirect the "inputdata" file as input to "myprogram" and also
> redirect the stdout to the "output" file? I've played around
> with -nostdin, -nostdout and -allstdin arguments but nothing seems to work.
> The program does random reads to "inputdata" and only reads portions of
> "inputdata" at a time. All mpi ranks need to read "inputdata" as stdin.
You can't do that. Stdin is connected via a socket from the compute
process back to mpiexec. Seeking on sockets doesn't work.
To make sure all nodes see the input, add "-allstdin" on the mpiexec
command line. (They can all read() and get the same data, but none
can lseek().)
You might change the code to read the entire file first, then "seek"
in memory instead.
You might also want to change the code around to accept an input
file on the command line, then predistribute the file to the compute
nodes that need it with a tool such as pbsdcp or trust NFS to
present the input file to all.
> Another problem shows up in the mom logs of torque with the following
> message being repeated several time and the job will fail. However I can
> get one job to run successfully after rebooting but subsequent job fail.
>
> 08/20/2005 14:22:29;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
> message from addr xxx.xxx.xxx.xxx:15003
> 08/20/2005 14:22:29;0001; pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply
> to 17606.mach5c.mach5.roc task 1
> 08/20/2005 14:22:29;0001; pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply
> to 17606.mach5c.mach5.roc task 1
>
> and the stderr file from torque shows this,
>
> [196] Error: write to socket failed !
> [200] Error: write to socket failed !
> mpiexec: Error: read_gm_startup_ports: eof in gmpi_port#1 iter 195.
Means something died unexpectedly. The torque guys might recognize
the mom log entries, but it looks like some other task died.
The [] messages are probably from mpich/gm.
The last line is mpiexec, during the startup protocol, trying to
read from the 195th task that checked in (not necessarily MPI task
number 195), but this task disconnected the socket. mpiexec exits
at this point, perhaps causing the other two [] lines from tasks
that notice there is no mpiexec to talk to anymore.
Check all the mom_logs and see if you can figure out which one was
the first to die. The other errors cascade from that first one.
-- Pete
More information about the mpiexec
mailing list