Building a Dataset: Image Classifier
Building a Dataset: Image Classifier
Building a Dataset: Image Classifier
Building a Dataset: Image Classifier

Image Classifier

Now that you have identified that an Image Classifier is the right option for you and your use case, you have to collect and prepare your data to train an Image Classifier. For the below steps, we will use the example of classifying art supplies from the Use Cases & AI Architectures document.

Define the Classes

Start by identifying the classes you want your model to predict.

The first goal when training your model is the prove it out with a smaller set of data, then get more granular as your iterate the model with additional training runs. Each training run will inform what additional data you will need.

  • Broad classes, such as Item Use - Drawing supply, Painting supply, Sculpting supply ...

  • Granular classes, such as Item Type - Paint Brush, Pencil, Paint, Canvas, Marker, Clay ...

  • More Granular classes, such as Paint Brush Type - Flat, Round, Filbert, Fan, Mottler, Oval ...

Collect the Data

Now you will need to collect your data. If you already have a dataset collected, then use this time to review your dataset and make sure it is ready to use.

Before you start capturing images, you will want to make sure the dataset you are curating represents real-world scenarios for your use case. This is extremely important for your model performance. If you already have a dataset of images, review them to make sure they represent real-world scenarios.

With collecting data, you will need to take into account numerous variables in order to ensure that your model will have the best data to train on.

  • Lighting settings

  • Image resolution quality

  • Camera angles

  • Backgrounds

  • Product models

  • Product add-on options

Audit the Data

With auditing your dataset, you are checking for formatting, size, and duplicates. Performing this step further ensures your data will give the best opportunity for your AI models performance.

  • Start by removing duplicate images and images that are closely related.

    • Duplicate images are considered images that are identical or images that have similar features that aren't distinct from one another.

    • For example, if you have two paint brushes that have the same brush tip and same color body, but are two different brands, then remove one of those images from the dataset.

  • Check the format of your images and make sure they are all consistent. If you have a dataset of JPEG's, but you find a couple of PNG's, then you will need to convert those PNG's or remove them from the dataset.

  • Ensure your images are roughly the same size for height and width

Structure the Dataset

Your classes have been defined, images collected, and dataset reviewed. Now you need to structure your dataset to train the Image Classifier.

  • Divide and distribute your data into 3 sets:

    • 70% Training images

    • 15% Validation images

    • 15% Testing images

  • You will need to ensure a balanced distribution of classes within each of the above sets. This is important for optimum model performance and to prevent model bias. For example, using the Item Type classes:

    • 20 images of paint brushes

    • 20 images of pencils

    • 20 images of canvases

    • 20 images of paint ...

Organize the Dataset

Last step to building a great Image Classifier dataset is organizing your dataset.

To organize your dataset, you will need to place the images in various folders that correlate with the structure of your dataset as well as the classes you want to train on.

  • Training images set - Name the main folder as "training" so you can identify that is your training dataset.

    • Inside your training folder, your training images should be organizes in folders associated with their class label. For example all Pencil images will be in a folder labeled "pencils".

  • Validation images set - Follow the same structure as your training images folder except your main folder will be names "validation" to identify this as the validation set.

  • Test images set - This can be a general folder of images without ant structure as this will be used to test the model. Just name the main folder as "test" so you can identify these images as the test set.