Pbs_server post job file processing error


















Is waiting in between job submissions the best solution? Can I do something to improve this? Also, suppose job with job ID From 'qstat -f' outputs like the ones above, it seems that the problem has something to do with 'Stale NFS file handle'in the post job file processing.

By submitting hundreds of test jobs, I have been able to identify a number of nodes that produce failed jobs. One strange thing about these problematic nodes is that if they are the only node chosen to be used, the job runs fine on them.

The problem only arises when they are mixed with other nodes. Since I do not know how to fix the post job processing 'Stale NFS file handle', I avoid these nodes by submitting 'dummy' jobs to them.

Now all jobs produce output as expected, and there is no need to wait before consecutive submissions. Python does not have a convention on the use of sys.

The second issue is the failure of post job file processing. This might indicate a misconfigured node. From now on I am guessing. Since a successful job takes about , it is probably true that with a delay of 50 seconds you have all your jobs land onto the first available node. With a smaller delay you get more nodes involved and eventually hit a misconfigured node or get two scripts execute concurrently on a single node.

I would modify the submit script adding a line. This would add the names of the execution hosts to the debug. Then you or the Torque admin might want to look for the unprocessed output files in the MOM spool directory on the failing node to get some info for further diagnosis. Stack Overflow for Teams — Collaborate and share knowledge with a private group. The exiting-state still takes abnormally long about a minute or so , but it does finish, so the nodes get cleared for further jobs.

Post job file processing error might be related to stageout issues causing mail to be sent. If your smtp is not configured correctly, then this is not avoidable it seems. Dear Wizards, I have encountered what must be a pretty normal problem. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group.

Create a free Team What is Teams? Learn more. Torque reports error when posting job to client nodes Ask Question. Asked 7 years, 7 months ago. Active 2 months ago. Viewed 2k times. Any suggestions would be appreciated! PBS on Aitken. PBS on Electra. PBS on Pleiades. PBS on Endeavour. Managing Jobs. Monitoring with myNAS. PBS Reference. PBS exit codes. Packaging Multiple Runs.



0コメント

  • 1000 / 1000