September 22, 2022

Building a Good Data Set for webAI™

Building a Good Data Set for webAI™

Written by David Stout

We know that an AI model is only as good as the data used to train it. The capabilities of webAI™ make this particularly true, so using a large number of data points with a mixture of poor-quality or poorly-labeled data points to train your model won’t help you achieve high-performance results. Building a good data set that can effectively train a model means collecting high-quality images and accurately labeling them; quantity should not be the focus.

Key Considerations

Use Case

Understanding your use case is key to determining what modality of data you should use and how many representations are needed to build & train your model accurately.

A Classifier learns everything it sees in the images, including the background, and identifies the vector (image) according to the label you provide.

When training a Classifier:

  • Provide a variety of images (different backgrounds, environments, etc.) where the most consistent item is the one you want the model to learn
  • Crop images to focus on the part of the image you want the model to identify.
  • You will need a minimum of 2 “classes” (2 distinct labeled items) as well as a 3rd “nothing class” (random images) to train the model when utilizing webAI™.

An Object Detector learns based on the area of interest that you identify with a bounding box.

When training an Object Detector:

  • Provide various images (different backgrounds, environments, etc.) - the more variable the scenes, the better the model learns what’s within the bounding box.
  • Plan for a higher volume of images than a Classifier (Quantity can vary based on use case).
  • You will not need a “nothing class” of images.

When collecting & labeling data, it is also helpful to think about the context of the use case; think of realistic scenarios when capturing/curating your data.

Example: in retail, you may not want a person who sees a portion of a yellow box to decide that the box contains “Cheerios” until they’ve seen more of the box (e.g. the logo). Likewise, your model shouldn’t identify a portion of a yellow box as “Cheerios” either.

Data Usage in Training & Testing a Model

A complete data set will contain enough images to train and test your model effectively; webAI™ will then split your data into a training and validation set, typically with an 80/20 ratio (note this is customizable).

  • ~80% of the data will be used for training
  • ~20% of the data will be used for validation

Data Labeling

Remember that often issues arise from the dataset, which makes assessing a model very difficult. As such, you should consider the following best practices for accurate data labeling:

  • Double or triple check work (one person will always make mistakes)
  • Ensure your data reflects varying environments.
  • Use a tool to lighten the workload.
  • Make sure the tool exports to the data format and file type that your model will train with, or make sure you can convert to them(data format, e.g., COCO, and file type XML, JSON)
  • Make sure that what you’re labeling makes sense to you (i.e., you have enough information to make a labeling determination)
  • If a human couldn’t make a reasonable determination with that much information, an AI/ML model may not be able to make a accurate determination either.

High-Quality Data

Building a model for a use case allows you to focus on a specific domain. High-quality images reflect the model's environment, including sensor/camera quality, pixel density, lighting conditions, etc. Good data will also allow some augmentations (rotation, darkening/brightening, etc.) to generalize a network better.

Remember, when training neural networks they resize images, so it is important to remember that your model is working with a ratio of an image. As a result, higher resolution doesn't necessarily dictate higher quality.