CIRI Application - Kubernetes (GKE) + TensorFlow

This project was produced as part of the final project for Harvard University’s AC295 Fall 2021 course.

waste-picture

Context

Waste management is one of the most challenging problems to solve in the 21st century.

The average American produces ~1,700 pounds of garbage per year which is roughly three times the global average, according to a report by the research firm Verisk Maplecroft in 2019.

Poor waste management is linked to significant environmental risks, such as climate change and pollution will likely have significant long-term effects to our environment that will impact future generations.

The introduction of a “single-stream” approach to recycling where materials are not pre-sorted into common recycling classes like paper, aluminum, metal, and glass has greatly increased the rate of participation in recycling programs - but has led to substantial incrases in contamination as well.

Recycling contamination occurs when non-recyclable materials are mixed with recyclable materials. Current estimates are that 1 in 4 items inserted into a recycling bin are inappropriate for recycling. Contamination can inadvertently lead to:

  • Increased cost of recycling as more effort is required for waste sorting - which can lead to non-viable economic models for local recycling.
  • Reduces over-all quantity of recycled material as leads to contamination of recyclable items to the point where they are no longer suitable for recycling.
  • Potential damage to recyling equipment or danger to recycling plant employees.

In many cases consumers are unaware of the negative impacts of recycling contamination and are well intented by trying to add items to recycling bins. Additionally the introduction of varied plastics and packaging has led to increased difficulty in identifying recycable vs. non-recylcable items.

Our goal is to develop a prototype application that allows users to easily classify “recyclable” vs. “non-recyclable” materials via a Deep Learning model. The application will be comprised of a multi-tier architecture hosted on Kubernetes (GKE). Application deployment will be via CI/CD integration into project GitHub repositories.

Data

recyclable-vs-non

The dataset used for training included 2467 images from Trashnet challenge segmented into 6 human-annotated categories: cardboard (393), glass (491), metal (400), paper (584), plastic (472) and trash (127).

Training dataset was expanded with additional curated images from the Waste Classification v2 dataset which included images labeled as Recycleable, Non-Recyclable or Organic. As the Waste Classification dataset was primarily gathered via web-crawling we curated a subset of the images and placed them within the (existing) trash category or in the (newly created) organic category.

Given the limited availability of annotated recyclables datasets, we will also look to provide ongoing enhancement of training data via incorporating a “user upload” capability into the application that allows end-users to directly annotate and submit images.

DatasetLabelsQuality
TrashNetcardboard, glass, metal, paper, plastic, trashGood
Waste Classification v2recyclable, non-recylcable, organicAverage/Poor
User Providedcardboard, glass, metal, paper, plastic, trash, organic & otherUnknown

Model Selection

Several different transfer models were assessed for both size and performance. Best performing models are listed below:

experiment results

Additionally several hyper-parameters were modified through-out experimentation including:

ParameterOptimized Value
Training/Validation Split80/20
Decay Rate0.5
Learning Rate0.01
Max Epochs (w/ Early Stopping)15
Kernel Weight0.0001
Drop Out Weight0.2

Model build pipeline utilizes both Early Stopping as well as a LearningRateScheduler to achieve maximum accuracy without overfitting.

optimizer = keras.optimizers.SGD(learning_rate=learning_rate)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
es = EarlyStopping(monitor="val_accuracy", verbose=1, patience=3)
lr = keras.callbacks.LearningRateScheduler(
    lambda epoch: learning_rate / (1 + decay_rate * epoch)
)

Top layers added to downloaded transfer model were defined as:

model layers

  1. Transfer Layer Base (non-trainable)
  2. Dense Layer (124, relu)
  3. Dropout Layer
  4. Dense Layer (64, relu)
  5. Dropout Layer
  6. Dense Layer (# classes, none)

Current production model was selected to be https://tfhub.dev/google/imagenet/efficientnet_v2_imagenet21k_ft1k_b1/classification/2 based on performance and size characteristics. In particular the smaller size of the efficientnet models allow for a more cost effective architecture were model inference could be hosted on small resources due to lower memory requirements. This ultimately saves platform cost but also improves inference speed for a better use-experience.

Model Pipeline

An end-to-end training pipeline was implemented on the Luigi framework with the following DAG structure:

pipeline

A Tensorflow based pipeline is used to process the TFRecords, augment, normalize, build and fit model. Model build results are stored in MLFlow via experimentation API and model registry API in order to allow a central application appraoch for model lifecycle management:

mlflow.log_param("model_origin", "efficientnet_v2_imagenet21k_ft1k_b1")
mlflow.log_param("decay_rate", decay_rate)
mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("num_classes", num_classes)

history = training_results.history
mlflow.log_metric("accuracy", history["accuracy"][-1])
mlflow.log_metric("val_loss", history["val_loss"][-1])
mlflow.log_param("epochs", len(history["accuracy"]))

# Log label mapping for retrieval with model:
mlflow.log_artifact(label_mapping)

mlflow.keras.log_model(
    model, "model", registered_model_name="ciri_trashnet_model"
)

The Model Pipeline is deployed as a Kubernetes CronJob scheduled to run weekly. Models runs will be logged as subsequent model versions which can optionally be updated to “production” as desired by Application admins:

cronjob

Deployment Architecture

ciri-architecture

The CIRI Application is composed of multiple Services deployed on Kubernetes. By deploying on Kubernetes the CIRI application has significant resiliency and scalability “built-in” with an automated monitoring of Deployment health, and ability to auto-scale replicas of Deployments as needed.

ServicesTypeDescription
mlflowLoadBalancerProvides intra-cluster and external access to the MLFlow Experiment Tracking and Model Repository services.
apiNodePortProvides backend APIs that enable IO operations on training data-set as well as execution of predictions.
uiNodePortProvides an HTML based front-end for user interaction with application.
DeploymentsDescription
mlflowDeployment of ciri_mlflow:latest Docker container.
apiDeployment of ciri_apis:latest Docker container.
uiDeployment of ciri_frontend:latest Docker container.

Additionally a CronJob is deployed to Kubernetes to run the model-pipeline weekly. Alternatively a model-pipeline “run once” Pod can be launched if an adhoc training run is desired:

PodsDescription
model-pipelineDeployment of model training pipeline via ciri_model_pipeline:latest Docker container. Pod runs one time and executes download of raw training images from Cloud Storage, processing/transformation of image files, training of model and registration of model in MLFlow model registry.

Kubernetes Infrastructure

Ingress to the Kubernetes-hosted application is provided via Kubernetes ingress-nginx controller. Controller mappings are provided via the following definitions:

Rule-PathService
/*ui:8080
/api/*api:8080

Access to mlflow services is provided via external_ip:5000 as a hosted LoadBalancer component.

Application currently is designed for 2 node-pools to reflect a varied compute requirement for “always-on” application hosting vs. more intensive training:

Node PoolDescription
default-poolAlways on pool that runs application services including mlflow, ui and api via default e2-medium instances.
training-poolAuto-scaling pool of higher memory machines (e2-highmem-4) to execute training pipeline.

The application also leverages two cloud storage resources:

Cloud StorageNameDescription
Image Storecanirecycleit-dataProvides storage of raw images used as input for training where “folders” reflect the classification.
Artifact Storecanirecycleit-artifactstoreStores serialized metadata and model files from execution of model-pipeline.

Initial Kubernetes (GKE) cluster provisioning can be executed via Ansible or shell scripts to enable an automated approach to infrastructure provisioning.

Application Deployment & Updating

Individual deployment containers as well as deployment of Kubernetes application are automatically executed via GitHub Actions. K8s deployment Action can be found here.

For individual Dockerized components (e.g. ciri_apis:latest the docker container is built on all merges of new code to master branch and made available via the GitHub Container Registry as a public image (no secrets or confidential information is stored within the Image).

For deployment of Kubernetes application (ciri_app) a request for deployment is generated with all changes merged to master branch.

deployment

Requests must be approved by an repository team member that is autorized for approval of the production environment. When approved, the Deployment Action will either deploy all components or patch existing components depending on if components already exist or not within cluster.

deployment

Deployments are set to pull new images so latest component Docker containers will be used on Deploy or Patch. Service Account (SA) secrets for deployment operations are stored within GitHub Environment Secrets for security purposes.

Application Usage & Screenshots

Application Home Page

A very simple MVP homepage that provides access to the core functionality to take a picture/upload an image for classification. Additional navigation links are available to Upload a new image (particularly helpful if image has been miscategorized) and to learn more About the application - including links to the application GitHub repository.

homepage

Example of an Organic item is not recylcable:

organic-no

Example of a Cardboard item that is recycable:

cardboard-yes

Application Upload Page

Given that recyclables-annotated image data is relatively limited and of varied quality we determined that it was necessary to provide an ongoing way for users to enrich the annotated data-set so a basic upload form is provided to add new images to the CIRI Image data-store. The form provides a drop-down to select from all defined classification categories as well as a catch-all “other” category.

upload

success

Models are retrained weekly to take advantage of any newly uploaded images that have been added to the data-store.

Model Management (MLFlow)

Weekly (or manually triggered) model builds are stored using MLFlow for both model build experiment tracking as well as model serialization as a model repository. This allows continual training and model performance metric capture but an operationally controllable approach for “promoting” experimental models into Production. CIRI administrators can log into the MLFlow interface, view models that are available for use and then move that Model through a lifestage (staging, production, archival) as part of the application management process - all without having to redeploy any application code.

History of model run experiments:

mlflow

Current model registry versions in model lifecycle stages:

model registry