What is computer vision and what can it do?
In this blog post, you will learn the basics of computer vision technology and how it fits into the world of artificial intelligence (AI), machine learning (ML) and deep learning. You will also learn how computer vision works, its inputs, outputs, and common technologies. We will explain these concepts from the top down, starting with AI and going into more complex technologies along the way.
Artificial Intelligence - machines “thinking” like humansAny device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals is exhibiting AI. In short, AI is a machine attempting to mimic a human. A car that stops when commanded to do so by a human operator applying pressure to the brake pedal does not exhibit artificial intelligence. A vehicle that senses the presence of an obstacle ahead and applies the brakes exhibits some artificial intelligence because it responded in a human-like way.
Note that the definition of artificial intelligence does not address how the machine achieved this state. That task is the job of its subsets, such as machine learning.
Machine Learning - structuring data for learning
Artificial intelligence enables a machine to mimic human abilities. Machine learning is a subset of AI that uses data to train a machine on how to learn. Computer systems perform a specific task without coded instructions, relying on patterns and inference instead. Machine learning algorithms are used where it is challenging to develop a conventional coded algorithm for performing the task.
Before the advent of machine learning, programmers would code in machine responses to specific stimuli. The problem is if the machine encountered a new set of stimuli, a human coder would have to revise the code to deal with the new inputs. This was not scalable in the age of big data.
Machine learning uses data, rather than code, to learn. The “training data” contains a broad set of inputs and corresponding outputs, or labels, pre-assigned by humans. The machine uses this data to learn named features, or what the expected output is for each input value. Using its newly acquired background knowledge, it can make educated guesses on how to handle a new set of data.
So what is the difference between machine learning and deep learning?
Deep Learning - finding latent features with unstructured data
Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, and convolutional neural networks apply to computer vision. These architectures have produced results comparable to and in some cases, superior to human experts.
Machine learning relies on structured training data. The training data is “labeled” with its expected output from the machine learning engine. The training consists of tweaking the machine’s parameters until receiving the expected output. Actual data should then supply trusted output. One of the key differences between machine learning and deep learning is, machine learning training data is pre-labeled by human interaction. Deep learning does not require this step.
Deep learning can consume unstructured training data, necessitating a new method of learning.
Enter the neural network. A neural network is loosely based on neuron structures in the human brain. These deep learning networks are called artificial neural networks (ANN). Deep learning training does not require pre-labeled training data. The ANN scans the training data for patterns that may be understood by humans, such as color, or not apparent to humans. The latter are called latent features.
ANNs resemble but are not identical to, the human brain functions. The set of inputs sets off a chain reaction of neurons in the hidden layers that cause outputs to be selected. By changing weights and biases within the hidden layer, the ANN is “trained” to output the correct answer by matching numerous latent features. The mechanics of training is beyond the scope of this blog, so I will provide a link to an independent article with all the gory details. Adequately trained, the ANN can correctly output the proper response to inputs not previously seen.
Neural networks are designed to latent features. They interpret sensory data through a kind of machine perception, labeling, or raw clustering input. So how does this all relate to computer vision?
What is computer vision?
Computer vision allows a machine to “see” objects and their environment, much like a human sees. Machines mimic human visual perception and reasoning skills. Computer vision is an image or video analysis; it creates knowledge inference from digital images or videos. However, what is trivial for humans can be extremely complicated for machines.
For example, identifying one object as a car and another as a bike is a trivial task for a human who can draw from years, literally their entire life, of background information. A machine starts as a blank slate and must be trained to make these distinctions. Training makes use of deep learning techniques, as discussed in the section above.
How computer vision works
Three basic steps in the computer vision process
All CV projects follow three necessary steps:
- Acquire the image data
Images, even large sets, can be acquired in real-time through video, photos, or 3D technology for analysis.
- Vectorize the image
Deep learning models automate much of this process, but the models are often trained by first being fed thousands of labeled or pre-identified images.
- Understand the image
The final step is the interpretative step to identify or classify the object.
The types of computer vision
Al computer vision use cases can be broken down into smaller sub-uses described below. Many CV projects may employ more than one of these sub uses:
- Image segmentation partitions an image into multiple regions or pieces to be examined separately.
- Object detection identifies a specific object in an image. Advanced object detection recognizes many objects in a single image: a football field, an offensive player, a defensive player, a ball, and so on. These models use an X, Y coordinate to create a bounding box and identify everything inside the box.
- Facial recognition is an advanced type of object detection that not only recognizes a human face in an image but identifies a specific individual.
- Edge detection is a technique used to determine the outside edge of an object or landscape to distinguish better what is in the image.
- Pattern detection is a process of recognizing repeated shapes, colors, and other visual indicators in images.
- Image classification groups images into different categories.
- Feature matching is a type of pattern detection that matches similarities in pictures to help classify them.
How computer vision is used by e-commerceNarrowing our focus on the types of data encountered in e-commerce, we see four broad categories where computer vision is used as data input:
Documents - beyond paper
- “Smart” OCR - preserving tables, for example
- Images, pictures or videos
- Classifying data of incoming documents
- Tracking paper such as invoices, contracts, or receipts
Paper documents still play a significant role in contemporary business. Much data is now electronic, but not all of it. Computer vision can be used to put data sourced by paper documents into the electronic realm.
Products - physical objects and electronic tracking
- Product QA
- Inventory control
Products exist mostly in the physical world. Tracking these products is best done electronically. The interface between the physical products and electronic tracking is an ideal use for computer vision.
Equipment - keeping the lines humming
- Monitor for failures, imminent failures
- Flow optimization
Equipment is physical objects that need tracking and maintenance. Much like products in the section above, the interaction between the physical world and the electronic world is computer vision.
People - the unpredictable element
- Worker Safety
- Assembly line optimization
- Facial recognition - eliminates buddy punching
The most unreliable and unpredictable object in any enterprise flow is people. They do not follow a neat set of rules and algorithms which can be tracked using secondary indicators. Monitoring people’s movements in real-time is an excellent use for computer vision.
Common technologies for CV developmentWhile computer vision technology development can be complicated, you do not have to do it all yourself. CV development is considerably simplified through cloud APIs of the three largest cloud vendors.
- Google Vision AI: Google Cloud offers two computer vision products, AutoML Vision and Vision API, that use machine learning to help you understand your images with industry-leading prediction accuracy.
- Amazon Rekognition: Rekognition provides a number of computer vision capabilities, which can be divided into two categories: Algorithms that are pre-trained on data collected by Amazon or its partners, and algorithms that a user can train on a custom dataset.
- Microsoft Computer Vision: The cloud-based Computer Vision API provides developers with access to advanced algorithms for processing images and returning information. By uploading an image or specifying an image URL, Microsoft Computer Vision algorithms can analyze visual content in different ways based on inputs and user choices.
Several resources exist to aid in the development of CV applications. Some are APIs while others are open-source. Listed below is a shortlist of some of the more popular resources available. In my blog “What computer vision services, platforms & solutions are on the market,” I present a survey of service companies who can assist your CV development.
- TensorFlow: A free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library and is also used for machine learning applications such as neural networks
- Matlab: A numerical computing environment and programming language. MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages
- OpenCV: A library of programming functions mainly aimed at real-time computer vision. The library is cross-platform and free for use under the open-source BSD license.
- SimpleCV: An open-source framework for building computer vision applications. With it, you get access to several high-powered computer vision libraries such as OpenCV – without having to first learn about bit depths, file formats, color spaces, buffer management, eigenvalues, or matrix versus bitmap storage.
- GPUImage: A BSD-licensed iOS library that lets you apply GPU-accelerated filters and other effects to images, live camera video, and movies. GPUImage allows you to write your own custom filters, supports deployment to iOS 4.0, and has a simpler interface.
In this blog, I introduced the basics of computer vision technology and how it fits into the world of artificial intelligence, machine learning, and deep learning. I also discussed how computer vision works, its inputs, outputs, and common technologies. These concepts have become an important part of e-commerce and other industries. For a more in-depth survey on computer vision in e-commerce, read my blog post, “10 ways computer vision is transforming digital retail.”
Machine Learning and Artificial Intelligence