While many hardware and software manufacturers are working on
improving the running time of deep learning jobs, EDL optimizes
For more about the project EDL, please refer to this invited blog
on the Kubernetes official blog.
EDL includes two parts:
a Kubernetes controller for the elastic scheduling of distributed
deep learning jobs, and
making PaddlePaddle a fault-tolerable deep learning framework.
This directory contains the Kubernetes controller. For more
information about fault-tolerance, please refer to the
We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for
graduate students of Tsinghua University. The performance test report
of EDL on this cluster is
PaddlePaddle EDL is provided under the Apache-2.0 license.