XGBClassifier on AWS Lambda

  • Swati Agarwal
  • March 11, 2020

The intent of this blog post is to share how I got the XGBoost layer and our code under the AWS Lambda limit so that others can leverage from it; also in the process I want to open it for feedback so that I can learn if things can be done differently. So please feel free to leave comments about the same.

Download the required packages

So to start, download XGBoost and scikit-learn with its dependencies. Steps are listed below. (Note: sklearn is required for XGBClassifier. I have seen some XGBoost layers available which do not have sklearn packaged with them; but XGBClassifier doesn’t work with those layers.)

  • Create a requirements.txt file with the following content :
xgboost==0.90
scikit-learn==0.22.1
  • Create a get_layer_packages.sh script to download the packages along with their dependencies :
#!/bin/bash

export PKG_DIR="python"

rm -rf ${PKG_DIR} && mkdir -p ${PKG_DIR}

docker run --rm -v $(pwd):/foo1 -w /foo1 lambci/lambda:build-python3.6 \
    pip install -r requirements.txt -t ${PKG_DIR}

You can replace the content in requirements.txt with any other packages for your custom layers. Also, if you think that the dependencies are available in another layer and you don’t want to include them, you can just pass the “ — no-deps” flag in pip install. Please note that this requires docker installed on your system.

  • Total package size at this point is approximately 347 MB :
$ du -c -m -d 1 python

1 python/joblib-0.14.1.dist-info
1 python/scikit_learn-0.22.1.dist-info
1 python/bin
148 python/xgboost
1 python/xgboost-0.90.dist-info
79 python/numpy
29 python/sklearn
2 python/joblib
1 python/scipy-1.4.1.dist-info
1 python/numpy-1.18.1.dist-info
90 python/scipy
347 python
347 total
Cleanup by removing unnecessary folders
  • Remove dist-info, __pycache__, test, tests folders and strip .so files :
$ rm -rf python/scipy-1.4.1.dist-info/
$ rm -rf python/xgboost-0.90.dist-info/
$ rm -rf python/joblib-0.14.1.dist-info/
$ rm -rf python/bin/
$ rm -rf python/scikit_learn-0.22.1.dist-info/
$ rm -rf python/numpy-1.18.1.dist-info/
$ find python/ -name __pycache__ | xargs rm -rf
$ find python/ -name *.so | xargs strip
$ find python/ -name test | xargs rm -rf
$ find python/ -name tests | xargs rm -rf

Size after this step is around 265 MB:

$ du -c -m -d 1 python

18      python/sklearn
66      python/scipy
1       python/joblib
41      python/numpy
141     python/xgboost
265     python/
265     total

Typically this step would be enough for most of the layers. But in this case, as you can see we are still over by around 15 MB before we can deploy it.

Selective cleanup depending on the use-case

Now I was in an unknown territory and I still had to get rid of little more than 15 MB. This is where the insanity started — first by randomly deleting stuff which was not required and realizing that the modules don’t even load and then to my last chance by logically thinking about what I needed to do. I think all of us go through that phase of going crazy over something and then when we are about to lose hope, there’s is this calm mind which works better because there’s nothing to lose. I am glad that I gave it my last shot with that state of mind.

Disclaimer: This is particular to the use-case that we needed. Use your discretion while deleting or modifying stuff henceforth mentioned.

  • XGBoost cleanup
$ ls python/xgboost/

build-python.sh  compat.py  dmlc-core  __init__.py  libpath.py  plotting.py  rabit.py    src          VERSIONcallback.py      core.py    include    lib          make        rabit        sklearn.py  training.py

The ones highlighted above are the directories. You’ll have to read and explore what you need and don’t need for your use-case. For running the XGBClassifier on AWS Lambda for predictions, since we don’t need the interface to allreduce and broadcast for distributed XGBoost, I decided to remove rabit and dmlc-core directories. I also deleted the src and include directories that weren’t required either.

$ rm -rf python/xgboost/rabit
$ rm -rf python/xgboost/dmlc-core/
$ rm -rf python/xgboost/src/
$ rm -rf python/xgboost/include/

This cleaned up around 3 MB.

  • sklearn cleanup

Next I repeated the process for sklearn. I chose sklearn since it was not a true dependency of xgboost, and only a part of it’s functionality was required by XGBClassifier.

$ rm -rf python/sklearn/gaussian_process/
$ rm -rf python/sklearn/model_selection/
$ rm -rf python/sklearn/tree/
$ rm -rf python/sklearn/impute/
$ rm -rf python/sklearn/feature_extraction/
$ rm -rf python/sklearn/cluster/
$ rm -rf python/sklearn/cross_decomposition/
$ rm -rf python/sklearn/neighbors/
$ rm -rf python/sklearn/experimental/
$ rm -rf python/sklearn/neural_network/
$ rm -rf python/sklearn/covariance/
$ rm -rf python/sklearn/datasets/
$ rm -rf python/sklearn/decomposition/
$ rm -rf python/sklearn/inspection/
$ rm -rf python/sklearn/metrics/
$ rm -rf python/sklearn/feature_selection/
$ rm -rf python/sklearn/svm/
$ rm -rf python/sklearn/manifold/
$ rm -rf python/sklearn/linear_model/
$ rm -rf python/sklearn/mixture/
$ rm python/sklearn/isotonic.py
$ rm python/sklearn/_isotonic.cpython-36m-x86_64-linux-gnu.so

This cleaned up around 11 MB.

Also, these are the changes I had to make to the xgboost/compat.py file –

@@ -67,10 +67,6 @@
from sklearn.base import BaseEstimator
from sklearn.base import RegressorMixin, ClassifierMixin
from sklearn.preprocessing import LabelEncoder
-    try:
-        from sklearn.model_selection import KFold, StratifiedKFold
-    except ImportError:
-        from sklearn.cross_validation import KFold, StratifiedKFold
SKLEARN_INSTALLED = True
@@ -78,8 +74,8 @@
XGBRegressorBase = RegressorMixin
XGBClassifierBase = ClassifierMixin
-    XGBKFold = KFold
-    XGBStratifiedKFold = StratifiedKFold
+    XGBKFold = None
+    XGBStratifiedKFold = None
XGBLabelEncoder = LabelEncoder
except ImportError:
SKLEARN_INSTALLED = False

At this point, the layer is under the limit. However, there’s just minimum space for code. I tried to clean up a few extra couple of MBs for code and it’s package requirements.

  • scipy cleanup
$ rm -rf python/scipy/ndimage/

Change in python/scipy/setup.py :
@@ -23,7 +23,6 @@
config.add_subpackage('spatial')
config.add_subpackage('special')
config.add_subpackage('stats')
-    config.add_subpackage('ndimage')
config.add_subpackage('_build_utils')
config.add_subpackage('_lib')
config.make_config_py()

Change in python/scipy/stats/stats.py :
@@ -177,7 +177,7 @@
from scipy._lib.six import callable, string_types
from scipy.spatial.distance import cdist
-from scipy.ndimage import measurements
+#from scipy.ndimage import measurements
from scipy._lib._version import NumpyVersion
from scipy._lib._util import _lazywhere, check_random_state, MapWrapper
import scipy.special as special
  • numpy cleanup
$ rm -rf python/numpy/random/_examples/
$ rm -rf python/numpy/distutils/fcompiler/
$ rm -rf python/numpy/f2py/

After all these cleanups, 2274588 bytes (262144000–259869412) or 2.17 MB space is available for your code and package deployment.

$ du -m -c -d 1 python/

7       python/sklearn
66      python/scipy
1       python/joblib
40      python/numpy
138     python/xgboost
249     python/
249     total

$ du -b -c -d 1 python/

6164864         python/sklearn
67757024        python/scipy
580588          python/joblib
41463817        python/numpy
143899023       python/xgboost
259869412       python/
259869412       total

Though I did have to mess around with the code and delete some stuff that was not required for our use case, I still think this was worth the effort to get it working on Lambda rather than having to deploy on some external machine. With this, I can avail the many benefits of serverless computing without having to worry about scalability, server manageability and code deployments to name a few.

  • Tags:
  • AI
  • AWS Lambda
  • Tech
You might also Like
Analyze Earnings Call using Bewgle’s NLP platform

Analyze Earnings Call using Bewgle’s NLP platform

  • Kshitija Ambulgekar
  • December 23, 2022

At Bewgle we apply our NLP capabilities on any unstructured data, using our patented AI models, to generate actionable insights from it. Bewgle’s Natural language processing, machine learning models, fundamentally analyze the text and output the answers to questions, the insights, topics, sentiment, adjectives and other key features that we promise to our customers. Here … Continue reading "Analyze Earnings Call using Bewgle’s NLP platform"

News Articles Analysis: How to get Insights using BEWGLE’s NLP Platform

News Articles Analysis: How to get Insights using BEWGLE’s NLP Platform

  • Vivek Hegde
  • November 10, 2022

Analyzing news articles can be helpful in drawing insights into how online platforms/news agencies are portraying a brand/product(s). So it’s essential information for any brand to understand and act upon. Keeping up to date with new trends/innovations/launches etc can be hard as it involves going over multiple articles on a daily basis. What if we … Continue reading "News Articles Analysis: How to get Insights using BEWGLE’s NLP Platform"

Lowering Cholesterol Naturally – Through Bewgle Lens

Lowering Cholesterol Naturally – Through Bewgle Lens

  • Swati Agarwal
  • August 2, 2022

At Bewgle, we take immense pride in our NLP capabilities on any unstructured data. Though we have primarily focused on drawing insights from feedback or similar text, I, as a data enthusiast, wanted to challenge the system beyond feedback. One of the use cases that I wanted to try was that of deriving insights from … Continue reading "Lowering Cholesterol Naturally – Through Bewgle Lens"