Job Scheduler/Resource Manager Evaluation
Schedulers Evaluated
Analysis
All of the schedulers that were investigated have overlapping feature sets. Each of the products described below should be able to handle job scheduling, queueing, and system resource management for z-cluster and the IBG.
Condor is very feature-rich but heavily targeted at HTC and not necessarily grid-environments. However, it does seem to be moving in that direction with the release of Condor-G. It does support MPI jobs but only if the resources are reserved for the duration of the task - thus removing one of its stronger features (checkpointing and migration.) The source is no longer downloadable from Wisconsin's website, but according to the documentation it is available upon request.
PbsPro and Torque are both based upon the original OpenPBS codebase and are thus pretty similar in implementation and API. Although PbsPro is a commerical product, Altair indicates that a free educational license is available. Torque seems to have significant contributions from the grid community which improve upon the limitations of OpenPBS. However, the documentation on these improvements is fairly light. Both products support MPI well and have stable APIs lending to good integration with Globus and grid software (e.g. grid portals containers.)
Sun Grid Engine (SGE), although relatively new in contrast to Condor and the OpenPBS-based products, seems to have a strong community following with excellent documentation. The GridEngine API is also supported from Globus and the major grid portals. SGE has a similar feature set to Torque and PbsPro and supports cluster implemenations very well. GridEngine code is open source as part of a Sun Community Project.
The Maui meta-scheduler was also briefly evaluated for possible use on the bio-grid. Maui can provide true grid job-scheduling by coordinating job-submission across clusters of machines managed via SGE or the PBS-based schedulers.
Recommendation
As most of the schedulers have similar feature sets, no one product stands out in particular against the others for use on the bio-grid. Since each of the tools could serve the required functions, it is my recommendation that we choose one based upon the licensing, API-support, and availability of source code and documentation. In this reqard, Sun Grid Engine seems to be a good choice with Torque as a close second. Additional feedback on this analysis or alternative recommendations are welcome.
Additional Notes
Massimo suggested that Condor may have some interoperability issues with our openMosix-based cluster. The following email from the Condor mailing lists indicates that the two can work together, with some limitations:
From
https://lists.cs.wisc.edu/archive/condor-users/2005-May/msg00113.shtml :
Re: [Condor-users] Joining an OpenMosix cluster to a Condor Pool
* Date: Wed, 11 May 2005 15:24:01 -0400
* From: Scott B <scottb99@xxxxxxxxx>
* Subject: Re: [Condor-users] Joining an OpenMosix cluster to a Condor Pool
Thanks for the info. I'm fairly new at using CONDOR and as of right
now it's in an experimental test mode while we think about ways we
want to use it.
I find it interesting how the different groups are approaching
distributed processing. It would be interesting to know if kernel
level distributed processing researchers communicate and/or
collaborate with effors at distributed systems like CONDOR or GLOBUS.
I would imagine that both groups at some point will be implementing
software that overlapps somewhat.
On 5/10/05, Mark Silberstein <marks@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
> I have it working flawlessly, for already 2-3 years.
> The idea is that you should make "Condor know about mosix, and mosix -
> know about Condor"
> The only thing you should do is to start condor_master daemon with
> runhome. Since MOSIX-related environment is inherited to all forked
> processes, all condor-originated jobs will not be moved around, so that
> eventually hosts having Condor jobs running will not be used by MOSIX.
> To cause Condor not to run things on machines, occupied by MOSIX, you
> have to edit STARTD policy not to run things if the load average is
> higher than, say, 0.6-0.7
> There's one subtlety, you should pay attention to.
> The only problem is when you parallelize your applications using MOSIX,
> i.e. start with one process, which forks many processes simultaneously,
> each one working on different part of the input. After all forked
> processes start on single node, MOSIX will usually move them to other
> nodes right away, since its algorithms will prefer less loaded nodes.
> This will not be true if your cluster is fully occupied by Condor jobs.
> MOSIX will NOT move the processes immediately, it will take a while
> until it realizes that it's still worth moving them to the busy nodes
> ( remember, all these are running Condor jobs at this time). This will
> cause an increase of the "owner load average", so Condor will evict the
> running Condor jobs. This also takes a bit of a time. So ONLY now your
> forked processes are running as you expected.
> If you have any more questions - you might want to take it off the list
> Mark
> On Mon, 2005-05-09 at 14:45 -0400, Scott B wrote:
> > Has anybody attempted to join an OpenMosix cluster of systems to a
> > CONDOR pool? I can think of several problems that might occur.
> >
> > Or would it be folly to even consider it?
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>
--
RobVogelbacher - 05 Jul 2005
Features:
- a job queuing mechanism
- scheduling policy
- priority scheme
- resource monitoring
- resource management.
- users submit serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress
- wupports job checkpointing and migration
- uses "class-adds" to match resource consumers/requesters with resource owners
- can be used to harness idle computing power from PCs
- supports job-ordering using directed, acyclic graphs
- works well with PVM via add-on module
- Condor-G can be used in conjunction with Globus to provide external users the ability to submit jobs into a job queue
- Condor-G requires Globus installation and resources similar to globus-job-submit or job submission portal (from FAQ)
Limitations:
- Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().
- Inter-process communication is not allowed. This includes pipes, semaphores, and shared memory.
- Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
- Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.
- Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().
- Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
- Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
- File locks are allowed, but not retained between checkpoints.
- All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.
- A fair amount of disk space must be available on the submitting machine for storing a job's checkpoint images. A checkpoint image is approximately equal to the virtual memory consumed by a job while it runs. If disk space is short, a special checkpoint server can be designated for storing all the checkpoint images for a pool.
- On Digital Unix (OSF/1), HP-UX, and Linux, your job must be statically linked. Dynamic linking is allowed on all other platforms.
Features:
- xPBS provides a graphical interface for submitting both batch and interactive jobs, querying job, queue, and system status, and tracking the progress of jobs. Also available is the PBS command line interface (CLI) providing the same functionality as xPBS.
- Users can specify the priority of their jobs, and defaults can be provided at both the queue and system level.
- enables the user to define a wide range of interdependencies between batch jobs. Such dependencies include: execution order, synchronization, and execution conditioned on the success or failure of a specified other job.
- provides transparent job scheduling on any system by any authorized user. Jobs can be submitted from any client system or any compute server.
- Configuration options in PBS permit the administrator to allow or deny access on a per-system, per-group, and/or per-user basis.
- For chargeback or usage analysis, PBS maintains detailed logs of system activity. Custom charges can be tracked per-user, per-group, and/or per-system.
- Take advantage of otherwise idle workstations by configuring them to be used for computation when the user is away.
- API
- Automatic Load-leveling based on hardware configuration, resource availability, keyboard activity (similar to condor pooling)
- does not require that jobs be targeted to a specific computer system. This allows users to submit their job, and have it run on the first available system that meets their resource requirements. This also prevents waiting on a busy system when other computers are idle.
- Maps user account names on one system to the appropriate name on remote server systems. This allows PBS to fully function in environments where users do not have a consistent username across all the resources they have access to.
- supports parallel programming libraries such as MPI, MPL, PVM, and HPF. Such applications can be scheduled to run within a single multiprocessor systems or across multiple systems.
- Highly configurable job-scheduler
- Automatic file staging
Limitations
- Commercial product, requires educational institution license
- Source code not available
- OpenPBS missing many of listed features, but source code available
Overview
- Grid Engine is a Distributed Resource Management (DRM) software. DRM software aggregates compute power and delivers it as a network service. Grid Engine software is used with a set of computers to create powerful compute farms, which are used in a wide range of technical computing applications such as the development of semiconductors, bioinformatics, mechanical design, software development, oil/gas exploration, and financial analysis; additionally, massively scaling supercomputers use Grid Engine software in a variety of academic and research pursuits. Users enjoy access to large computing capability and organizations enjoy effective utilization of their computing resource investments approaching 100%. Essentially Grid Engine presents users a seamless, integrated computing capability. Grid Engine is used to support a wide variety of requirements; for instance, where users start many interactive and batch tasks as in product design or financial simulations; where sets of repetitious tasks are run as in software QA; where large numbers of users are placing jobs on limited resources as in education environments; and where users are launching parallel applications across massive numbers of processors for applications such as weather simulation.
Features
- can optimally place computing tasks and balance the load on a set of networked computers
- allow users to generate and queue more computing tasks than can be run at the moment
- ensure that tasks are executed with respect to priority and to providing all users with a fair share of access over time
- master daemon, scheduler and shadow master (failover) functionality
- execution daemon functionality for most available Unix flavors
- batch and interactive job support
- parallel make
- administrative, monitoring and submission command line interface clients
- GUI
- C language and Java language DRMAA (Distributed Resource Management Application API) binding
- man pages
- support of certificate based security
- command line interface that is aligned with the POSIX standard for batch queuing systems 1003.2d
- Motif based GUI representing the full Grid Engine client functionality
- library interface (GDI). The GDI (Grid Engine Database Interface) is used by most Grid Engine clients to implement their functionality. See the GDI module description for more information. Note that this interface is evolving.
- interfactive jobs through the qrsh command or qtcsh. Standard output will be redirected back to the terminal.
- shadowing of master host for fault tolerance
Supported Platforms
- Sun Solaris
- Linux
- SGI IRIX
- Compaq/HP Tru64
- IBM AIX
- HP HP/UX
- Apple Mac OS/X
- others (maybe as part of commercial app?)
Security
- can be configured to accept only messages that are coming from special hosts and reserved port. This mechanism provides security comparable to that of rsh.
- for improved security installations an integration with Kerberos5 and DCE exists. This integration has not been used frequently and does not come with an out-of-the-box installation procedure. It still should be possible to get it working with a little effort if you are particularly concerned about security.
- an SSL based security framework prototype exists which represents an opportunity for an interesting and relatively small developement project toward a Grid Engine embedded security solution without a need for Kerberos or DCE.
Minimum System Requirements
- Master Host: 100 MB of free memory (minimum); 500 MB of free disk space (minimum)
- Execution Host: 20 MB of free memory (minimum); 50 MB of free disk space (minimum)*
- Fileserver: 20 MB disk space plus approx. 20MB per architecture
Limitations
- Multi-cluster support may be limited (would need to look at JOSH)
- Fast-evolving code base
Overview
- TORQUE (Tera-scale Open-source Resource and Queue manager) is a resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project and has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations. This version may be freely modified and redistributed subject to the constraints of the included license.
Fault Tolerance
- Additional failure conditions checked/handled
- Many, many bugs fixed
- Node health check script support
Scheduling Interface
- Extended query interface providing the scheduler with additional and more accurate information
- Extended control interface allowing the scheduler increased control over job behavior and attributes
- Allows the collection of statistics for completed jobs
Scalability
- Significantly improved server to MOM communication model
- Ability to handle larger clusters (over 15 TF/2,500 processors)
- Ability to handle larger jobs (over 2000 processors)
- Ability to support larger server messages
Usability
- Extensive logging additions
- More human readable logging (ie no more 'error 15038 on command 42')
Additional Notes:
- Maui can be used with most resource managers
--
Rob Vogelbacher - 05 Jul 2005