Computer vision is becoming essential to successful digital transformation efforts with many compelling uses and many accessible solution options for properly building it.
In this blog post, you will learn the basics of computer vision technology and how it fits into the world of artificial intelligence (AI), machine learning (ML) and deep learning. You will also learn how computer vision works, its inputs, outputs, and common technologies. We will explain these concepts from the top down, starting with AI and going into more complex technologies along the way.
Artificial Intelligence - machines “thinking” like humans
Any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals is exhibiting AI. In short, AI is a machine attempting to mimic a human. A car that stops when commanded to do so by a human operator applying pressure to the brake pedal does not exhibit artificial intelligence. A vehicle that senses the presence of an obstacle ahead and applies the brakes exhibits some artificial intelligence because it responded in a human-like way.
Note that the definition of artificial intelligence does not address how the machine achieved this state. That task is the job of its subsets, such as machine learning.
Machine Learning - structuring data for learning
Artificial intelligence enables a machine to mimic human abilities. Machine learning is a subset of AI that uses data to train a machine on how to learn. Computer systems perform a specific task without coded instructions, relying on patterns and inference instead. Machine learning algorithms are used where it is challenging to develop a conventional coded algorithm for performing the task.
Before the advent of machine learning, programmers would code in machine responses to specific stimuli. A problem occured if the machine encountered a new set of stimuli, at which point a human coder would have to revise the code to deal with the new inputs. This is not scalable in the age of big data.
Machine learning uses data, rather than code, to learn. The “training data” contains a broad set of inputs and corresponding outputs, or labels, pre-assigned by humans. The machine uses this data to learn named features, or what the expected output is for each input value. Using its newly acquired background knowledge, it can make educated guesses on how to handle a new set of data.
So what about deep learning makes it special compared to traditional machine learning?
Deep Learning - finding latent features with unstructured data
Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, and convolutional neural networks apply to computer vision. These architectures have produced results comparable to and in some cases, superior to human experts.
Many machine learning algorithms rely on clean structured training data. The training data is “labeled” with its expected output from the machine learning engine. The training consists of tweaking the machine’s parameters until receiving the expected output. Actual data should then supply trusted output. One of the key differences between traditional supervised machine learning and deep learning is that traditional supervised machine learning training data requires feature engineering. Deep learning however, often does not require this step.
Deep learning can consume unstructured training data, facilitating a new method of learning.
Enter the neural network. A neural network is loosely based on neuron structures in the human brain. These deep learning networks are called artificial neural networks (ANN). Deep learning training does not always require pre-labeled training data. The ANN can scan the training data for patterns that may be understood by humans, such as color, or not apparent to humans. The latter are called latent features.
ANNs resemble but are not identical to, the human brain functions. The set of inputs sets off a chain reaction of neurons in the hidden layers that cause outputs to be selected. By changing weights and biases within the hidden layer, the ANN is “trained” to output the correct answer by matching numerous latent features. The mechanics of training is beyond the scope of this blog post, so I will provide a link to an independent article with all the fine-grain details. When adequately trained, an ANN can correctly output the proper response to inputs not previously seen.
Neural networks are designed to latent features. They interpret sensory data through a kind of machine perception, labeling, or raw clustering input. So how does this all relate to computer vision?
What is computer vision?
Computer vision allows a machine to “see” objects and their environment, much like a human sees. Machines mimic human visual perception and reasoning skills. Computer vision is an image or video analysis; it creates knowledge inference from digital images or videos. However, what is trivial for humans can be extremely complicated for machines.
For example, identifying one object like a car and another as a bike is a trivial task for a human who can draw from years (literally an entire life) of background information. A machine starts as a blank slate and must be trained to make these distinctions. Training makes use of deep learning techniques, as discussed in the section above.
How computer vision works
Three basic steps in the computer vision process
All CV projects follow three necessary steps:
Acquire the image data
Images, even large sets, can be acquired in real-time through video, photos, or 3D technology for analysis.
Vectorize the image
Deep learning models automate much of this process, but the models are often trained by first being fed thousands of labeled or pre-identified images.
Understand the image
The final step is the interpretative step to identify or classify the object.
The types of computer vision
All computer vision use cases can be broken down into smaller sub-uses described below. Many CV projects may employ more than one of these sub uses:
Edge detection is a technique used to determine the outside edge of an object or landscape to distinguish better what is in the image.
Image segmentation partitions an image into multiple regions or pieces to be examined separately.
Object detection identifies a specific object in an image. Advanced object detection recognizes many objects in a single image: a football field, an offensive player, a defensive player, a ball, and so on. These models use an X, Y coordinate to create a bounding box and identify everything inside the box.
Pattern detection is a process of recognizing repeated shapes, colors, and other visual indicators in images.
Image classification groups images into different categories.
Feature matching is a type of pattern detection that matches similarities in pictures to help classify them.
Facial recognition is an advanced type of object detection that not only recognizes a human face in an image but identifies a specific individual.
How computer vision is used by e-commerce
Narrowing our focus on the types of data encountered in e-commerce, we see four broad categories where computer vision is used as data input:
Documents - beyond paper
“Smart” OCR - preserving tables, for example
Images, pictures or videos
Classifying data of incoming documents
Tracking paper such as invoices, contracts, or receipts
Paper documents still play a significant role in contemporary business. Most data is now electronic, but not all of it. Computer vision can be used to put data sourced by paper documents into the electronic realm.
Products - physical objects and electronic tracking
Products exist mostly in the physical world. Tracking these products is best done electronically. The interface between the physical products and electronic tracking is an ideal use for computer vision.
Equipment - keeping the lines humming
Monitor for failures, imminent failures
Equipment are physical objects that need tracking and maintenance. Much like products in the section above, the interaction between the physical world and the electronic world is computer vision.
People - the unpredictable element
Assembly line optimization
Facial recognition - eliminates buddy punching (When an employee does not clock-into work themselves)
The most unreliable and unpredictable object in any enterprise flow is people. They do not follow a neat set of rules and algorithms which can be tracked using secondary indicators. Monitoring people’s movements in real-time is an excellent use for computer vision.
Fixing issues caused by mislabeled products in the catalog
Creating systems that allow users to find items that look visually similar rather than just have similar attributes
Advanced computer vision systems may even be able to help people see how two pieces of an outfit would look together
Common technologies for CV development
While computer vision technology development can be complicated, you do not have to do it all on your own. CV development is considerably simplified through cloud APIs of the three largest cloud vendors.
Google Vision AI: Google Cloud offers two computer vision products, AutoML Vision and Vision API, that use machine learning to help you understand your images with industry-leading prediction accuracy.
Amazon Rekognition: Rekognition provides a number of computer vision capabilities, which can be divided into two categories: Algorithms that are pre-trained on data collected by Amazon or its partners, and algorithms that a user can train on a custom dataset.
Microsoft Computer Vision: The cloud-based Computer Vision API provides developers with access to advanced algorithms for processing images and returning information. By uploading an image or specifying an image URL, Microsoft Computer Vision algorithms can analyze visual content in different ways based on inputs and user choices.
Several resources exist to aid in the development of CV applications. Some are APIs while others are open-source. Listed below is a shortlist of some of the more popular resources available. In my blog “What computer vision services, platforms & solutions are on the market,” I present a survey of service companies who can assist your CV development.
Matlab: A numerical computing environment and programming language. MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages
SimpleCV: An open-source framework for building computer vision applications. With it, you get access to several high-powered computer vision libraries such as OpenCV – without having to first learn about bit depths, file formats, color spaces, buffer management, eigenvalues, or matrix versus bitmap storage.
GPUImage: A BSD-licensed iOS library that lets you apply GPU-accelerated filters and other effects to images, live camera video, and movies. GPUImage allows you to write your own custom filters, supports deployment to iOS 4.0, and has a simpler interface.
In this blog, we introduced the basics of computer vision technology and how it fits into the world of artificial intelligence, machine learning, and deep learning. We also discussed how computer vision works, its inputs, outputs, and common technologies. These concepts have become an important part of e-commerce and other industries. For a more in-depth survey on computer vision in e-commerce, read my blog post, “10 ways computer vision is transforming digital retail.”