Edge AI for techies

30 October 2018

Artificial Intelligence (AI) is about to enter the traditional industry segments. When this happens we can start to see the real gains with AI, where the industry becomes radically more efficient and the environment much cleaner. But currently, many are struggling to bring the intelligence to all the small devices that together form the industry, in other words, to bring AI to the edge.

AI keeps entering new fields with great promises at a high pace. So far, the connected and already digital industries such as media and advertising, finance and retail have been exploited the most. There is no doubt that AI has created real value in these segments: There are plenty of convenient services and functions nowadays that make our lives a little bit easier and smoother. However, the big and important problems are still ahead of us. The solution to climate- and environmental problems is to remove old, dirty and inefficient technology and replace it with clean energy and an efficient industry, the latter commonly referred to as Industry 4.0.

A crucial component in Industry 4.0 is by introducing intelligence “on the edge”. The edge is referred to as the part of our world that is outside the range of high speed and large bandwidth connections, which, frankly speaking, is most of our world at the moment. Intelligence on the edge means that even the smallest devices and machines around us are able to sense their environment, learn from it and react on it. This allows for instance the machines in some factory to take higher level decisions, act autonomously and to feedback important flaws or improvements to the user or the cloud. More specifically, we are here focusing on motion intelligence, which constitutes a large part of the industry.

Practically, the sensing part is achieved by having a motion sensor of some sort (e.g. accelerometer, gyro or magnetometer) connected to a small microcontroller unit (MCU). The MCU is loaded with some software that has been pre-trained on the typical scenarios that the device will encounter i.e. some data that has been collected beforehand and fed to the software. This is the learning part, which can also be a continuous process so that the device learns as soon as it encounters new things. The last part, the reaction of the AI, can be some physical actuation on its immediate environment, or it can be a signal to a human being or the cloud for further action and assistance.

To be able to learn from data, the software is based on some machine learning technique. However, some of the most promising machine learning models widely applied in the cloud and digital industries, of the type “deep neural networks”, still use too much computing power and memory to be able to fit the small MCUs on the market.

Deep learning and the time complexity problem
During the last couple of years there has been a large focus on deep learning and some impressive results have been published, especially in the areas of image and speech recognition. The main attraction of deep learning lies in its ability to learn complex patterns without the need of manual feature engineering. In other words, the deep artificial neural networks (DNNs) are designed to take raw data as input and perform the feature extraction automatically in the first few layers of the network. Their major drawback is that the models are usually massive, with memory footprints of hundreds of megabytes in the area of image recognition.

The application of deep learning to time series classification is relatively new but there already exist a fair amount of interesting and promising studies. The DNN architectures that are usually considered to be state-of-the-art for time series classification such as activity or motion recognition are the ones which combine convolutional neural networks (CNNs) with recurrent neural networks (RNNs). To design such a neural network, we need to know what problem we are facing. For motion analysis and classification, we have some type of motion data produced by a sensor and sent for processing to the MCU. Typically, this motion data is accelerometer and gyro data from a microelectromechanical system (MEMS) tri-axial inertial measurement unit (IMU). The data is sampled at approximately 50 Hz or more, which means that we get at least 50 new data points to process from each spatial direction or “channel” every second, 3×50=150 points in total (or 300 when including the gyro data).

The stream of data is usually processed as a batch of some size, which translates to a time interval or a “window” that saves the last n number of points that has arrived. This enables the algorithm to see and learn the relationship between the data points that are local in time and the interactions between the various sensor channels. The window sizes differ a lot but is typically on the order of a second in size. The various windows overlap, often 50 % or so, which means that for the algorithm to run with near real-time performance the MCU has to process more than 100 datapoints in less than a second. With an input of roughly 150 points (or more) as in the example above, the networks have at least on the order of 1 M floating-point parameters, which translates to roughly 4 MB of memory using single precision (1).

If we consider running our machine learning algorithm on an ARM M0 MCU, which is one of the smallest MCUs on the market with a clock frequency at maximum 50 MHz and no hardware acceleration for the floating point operations, it would take far from real-time to execute a model with millions of floating-point weights. Worse yet, the RAM memory of such a device is at a maximum 32 kB (in practice we typically do not have more than roughly 8 kB or so at our disposal) which would be on the order of 500 times too little.

Possible solutions to the problem
Currently, many see the potential edge computing can bring to the traditional industry and other areas of our society. To enable intelligence on the edge, a few different strategies have emerged. In short, a selection of the ongoing trends are:

Exporting Tensorflow graphs
This is not really a solution for edge computing since it requires relatively large computing resources and is instead a popular approach for mobile apps. The deep learning model is prototyped in a deep learning framework such as Keras, Caffe or Tensorflow and then trained on a powerful machine, preferably with many GPUs. Finally, the finished model, or graph, is exported to the device. The model is usually of a size that fits the hardware of a powerful mobile phone. In other words, this is not a viable solution today if one wants to run machine learning models on the smallest MCUs on the market.

New memory design
One of the problems of deep learning on edge devices is not only the memory consumption but also the heavy memory access, which not only slows the system down (2) but also drains the battery. One way forward is to rethink the hardware design and build hardware that suits the software architecture. Companies like Syntiant and Mythic have invented creative approaches to use the levels of charge in flash memory to represent the weights in a neural network (3). Whether or not this will be a fruitful approach still remains to be seen.

Lowering the numerical precision
In their quest for making the best dedicated machine learning chip the chip-maker companies Intel and Arm take on the route of lowering the precision in the neural networks (4) (5), basically by quantizing the weights in a neural network and performing integer or fixed-point arithmetic. By this, the memory footprint is decreased, and memory bandwidth is increased which accelerates the computations but also reduces the power consumption. The cost is, they claim, little or no loss in accuracy of the model. At present, it is however unlikely that an off-the-shelf deep learning model by lowering its numerical precision can fit the smallest MCUs.

Redesign networks after training (pruning)
In some sense the neural network architecture is somewhat crude. Its size is set up by some rules of thumb and guesses and if it turns out to work well after training we are usually happy. But quite often, there are many connections in the network that are not used and if those are removed we end up with a smaller model. This, and related techniques, are called pruning. While certainly helpful in many cases, we are usually talking about small gains and not orders of magnitude in terms of savings.

Working with bit representations and binary operations
There have been some recent successful attempts to recast some of the machine learning models into bit representations. One example is XNOR-Net (6) where the weights used in an artificial neural network are not floating-point values, but instead binary weights and the convolutions are approximated by binary operations. While being a clever and efficient approach to some problems that can not work well without deep neural networks such as image recognition, it may still be overkill to design neural networks for other tasks in the area of edge computing.

Inventing new algorithms with memory and CPU usage in mind
As this list shows, there is no doubt that there are many creative ideas and initiatives going on to make edge AI real. Some attempts are based on tuning or redesigning hardware, others do the same with existing algorithms.

Surprisingly few, however, try to radically redesign the algorithms with CPU, memory consumption and accuracy in mind. For instance, one can note that for many problems more traditional machine learning approaches such as ensemble learning (random forests among others) perform as good as, and sometimes even better, than deep neural networks (7), especially if little training data is used (8). (For random forest approaches, the memory consumption can be large if the trees are not made shallow, which has recently been shown to work very well (9).

The Imagimob approach to the edge AI challenge is to take inspiration from the entire machine learning field and use this inspiration in combination with our own innovations to design the best possible algorithm for a certain application. The aim is always low CPU and memory consumption with small MCUs in mind. We have already from the start worked with bit representations and binary operations, but instead of using unnecessary large neural networks for problems where this is not needed we borrow ideas from e.g. ensemble learning, dimensionality reduction, clustering techniques and deep learning. This approach has proven very fruitful so far in a wide range of AI projects on edge devices.

In the next coming years, we will probably see many remarkable leaps forward in the area of machine learning. Imagimob will continue to stay on top of research, think big and think new.

Johan Malm, Ph.D. and specialist in numerical analysis, computational physics and algorithm development. Johan works with AI research at Imagimob.

_____________________________________________________________________________________________
[1] https://www.mdpi.com/1424-8220/16/1/115
[2] https://www.eetindia.co.in/news/article/18070903-memory-crucial-for-edge-ai
[3] https://spectrum.ieee.org/tech-talk/computing/embedded-systems/two-startups-use-processing-in-flash-memory-for-ai-at-the-edge
[4] https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training
[5] https://arxiv.org/pdf/1801.06601.pdf
[6] https://arxiv.org/pdf/1603.05279.pdf
[7] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5298639/
[8] https://arxiv.org/pdf/1702.08835.pdf
[9] http://manikvarma.org/pubs/kumar17.pdf