The intent of this blog post is to share how I got the XGBoost layer and our code under the AWS Lambda limit so that others can leverage from it; also in the process I want to open it for feedback so that I can learn if things can be done differently. So please feel free to leave comments about the same.
Download the required packages
So to start, download XGBoost and scikit-learn with its dependencies. Steps are listed below. (Note: sklearn is required for XGBClassifier. I have seen some XGBoost layers available which do not have sklearn packaged with them; but XGBClassifier doesn’t work with those layers.)
- Create a requirements.txt file with the following content :
xgboost==0.90 scikit-learn==0.22.1
- Create a get_layer_packages.sh script to download the packages along with their dependencies :
#!/bin/bash export PKG_DIR="python" rm -rf ${PKG_DIR} && mkdir -p ${PKG_DIR} docker run --rm -v $(pwd):/foo1 -w /foo1 lambci/lambda:build-python3.6 \ pip install -r requirements.txt -t ${PKG_DIR}
You can replace the content in requirements.txt with any other packages for your custom layers. Also, if you think that the dependencies are available in another layer and you don’t want to include them, you can just pass the “ — no-deps” flag in pip install. Please note that this requires docker installed on your system.
- Total package size at this point is approximately 347 MB :
$ du -c -m -d 1 python 1 python/joblib-0.14.1.dist-info 1 python/scikit_learn-0.22.1.dist-info 1 python/bin 148 python/xgboost 1 python/xgboost-0.90.dist-info 79 python/numpy 29 python/sklearn 2 python/joblib 1 python/scipy-1.4.1.dist-info 1 python/numpy-1.18.1.dist-info 90 python/scipy 347 python 347 total
Cleanup by removing unnecessary folders
- Remove dist-info, __pycache__, test, tests folders and strip .so files :
$ rm -rf python/scipy-1.4.1.dist-info/ $ rm -rf python/xgboost-0.90.dist-info/ $ rm -rf python/joblib-0.14.1.dist-info/ $ rm -rf python/bin/ $ rm -rf python/scikit_learn-0.22.1.dist-info/ $ rm -rf python/numpy-1.18.1.dist-info/ $ find python/ -name __pycache__ | xargs rm -rf $ find python/ -name *.so | xargs strip $ find python/ -name test | xargs rm -rf $ find python/ -name tests | xargs rm -rf
Size after this step is around 265 MB:
$ du -c -m -d 1 python 18 python/sklearn 66 python/scipy 1 python/joblib 41 python/numpy 141 python/xgboost 265 python/ 265 total
Typically this step would be enough for most of the layers. But in this case, as you can see we are still over by around 15 MB before we can deploy it.
Selective cleanup depending on the use-case
Now I was in an unknown territory and I still had to get rid of little more than 15 MB. This is where the insanity started — first by randomly deleting stuff which was not required and realizing that the modules don’t even load and then to my last chance by logically thinking about what I needed to do. I think all of us go through that phase of going crazy over something and then when we are about to lose hope, there’s is this calm mind which works better because there’s nothing to lose. I am glad that I gave it my last shot with that state of mind.
Disclaimer: This is particular to the use-case that we needed. Use your discretion while deleting or modifying stuff henceforth mentioned.
- XGBoost cleanup
$ ls python/xgboost/ build-python.sh compat.py dmlc-core __init__.py libpath.py plotting.py rabit.py src VERSIONcallback.py core.py include lib make rabit sklearn.py training.py
The ones highlighted above are the directories. You’ll have to read and explore what you need and don’t need for your use-case. For running the XGBClassifier on AWS Lambda for predictions, since we don’t need the interface to allreduce and broadcast for distributed XGBoost, I decided to remove rabit and dmlc-core directories. I also deleted the src and include directories that weren’t required either.
$ rm -rf python/xgboost/rabit $ rm -rf python/xgboost/dmlc-core/ $ rm -rf python/xgboost/src/ $ rm -rf python/xgboost/include/
This cleaned up around 3 MB.
- sklearn cleanup
Next I repeated the process for sklearn. I chose sklearn since it was not a true dependency of xgboost, and only a part of it’s functionality was required by XGBClassifier.
$ rm -rf python/sklearn/gaussian_process/ $ rm -rf python/sklearn/model_selection/ $ rm -rf python/sklearn/tree/ $ rm -rf python/sklearn/impute/ $ rm -rf python/sklearn/feature_extraction/ $ rm -rf python/sklearn/cluster/ $ rm -rf python/sklearn/cross_decomposition/ $ rm -rf python/sklearn/neighbors/ $ rm -rf python/sklearn/experimental/ $ rm -rf python/sklearn/neural_network/ $ rm -rf python/sklearn/covariance/ $ rm -rf python/sklearn/datasets/ $ rm -rf python/sklearn/decomposition/ $ rm -rf python/sklearn/inspection/ $ rm -rf python/sklearn/metrics/ $ rm -rf python/sklearn/feature_selection/ $ rm -rf python/sklearn/svm/ $ rm -rf python/sklearn/manifold/ $ rm -rf python/sklearn/linear_model/ $ rm -rf python/sklearn/mixture/ $ rm python/sklearn/isotonic.py $ rm python/sklearn/_isotonic.cpython-36m-x86_64-linux-gnu.so
This cleaned up around 11 MB.
Also, these are the changes I had to make to the xgboost/compat.py file –
@@ -67,10 +67,6 @@ from sklearn.base import BaseEstimator from sklearn.base import RegressorMixin, ClassifierMixin from sklearn.preprocessing import LabelEncoder - try: - from sklearn.model_selection import KFold, StratifiedKFold - except ImportError: - from sklearn.cross_validation import KFold, StratifiedKFold SKLEARN_INSTALLED = True @@ -78,8 +74,8 @@ XGBRegressorBase = RegressorMixin XGBClassifierBase = ClassifierMixin - XGBKFold = KFold - XGBStratifiedKFold = StratifiedKFold + XGBKFold = None + XGBStratifiedKFold = None XGBLabelEncoder = LabelEncoder except ImportError: SKLEARN_INSTALLED = False
At this point, the layer is under the limit. However, there’s just minimum space for code. I tried to clean up a few extra couple of MBs for code and it’s package requirements.
- scipy cleanup
$ rm -rf python/scipy/ndimage/ Change in python/scipy/setup.py : @@ -23,7 +23,6 @@ config.add_subpackage('spatial') config.add_subpackage('special') config.add_subpackage('stats') - config.add_subpackage('ndimage') config.add_subpackage('_build_utils') config.add_subpackage('_lib') config.make_config_py() Change in python/scipy/stats/stats.py : @@ -177,7 +177,7 @@ from scipy._lib.six import callable, string_types from scipy.spatial.distance import cdist -from scipy.ndimage import measurements +#from scipy.ndimage import measurements from scipy._lib._version import NumpyVersion from scipy._lib._util import _lazywhere, check_random_state, MapWrapper import scipy.special as special
- numpy cleanup
$ rm -rf python/numpy/random/_examples/ $ rm -rf python/numpy/distutils/fcompiler/ $ rm -rf python/numpy/f2py/
After all these cleanups, 2274588 bytes (262144000–259869412) or 2.17 MB space is available for your code and package deployment.
$ du -m -c -d 1 python/ 7 python/sklearn 66 python/scipy 1 python/joblib 40 python/numpy 138 python/xgboost 249 python/ 249 total $ du -b -c -d 1 python/ 6164864 python/sklearn 67757024 python/scipy 580588 python/joblib 41463817 python/numpy 143899023 python/xgboost 259869412 python/ 259869412 total
Though I did have to mess around with the code and delete some stuff that was not required for our use case, I still think this was worth the effort to get it working on Lambda rather than having to deploy on some external machine. With this, I can avail the many benefits of serverless computing without having to worry about scalability, server manageability and code deployments to name a few.