Select Page

Bringing ML to Embedded Systems (Literature Survey)

INTRODUCTION

The digital revolution is becoming more evident every single day. Everything is going digital and becoming internet-connected.During this embedded world is also needed evolving for acquiring all the recent techniques and advancement in computation.This project mostly focuses on bringing machine learning technique to End devices in embedded systems. There are numerous applications in which we can see are in the different fields such as ​
1.Medical Diagnosis​ ​
2.Virtual personal assistant​ ​
3.Video surveillance​
4.Industry 4.0​ ​
5.Agriculture​ ​
6.Autonomous driving ​
7.Social networking ​

NEED OF ML ON END DEVICES​

The fields of environmental and agricultural monitoring are areas of application for ML model implementation in the end-devices. Because of the harsh conditions these devices have to endure, the size, materials, and energy consumption matter most. Furthermore, because of the remote locations, where most of them are placed, communication is not possible or too unstable to be reliable. The area those devices cover is generally big, which means that multiple devices have to be spread across it. To make such systems economically viable to be deployed, the devices have to have reduced costs.
Another field generally forgotten is underwater surveillance. The use of intelligent systems to study the seas is essential, to find new species, study environmental changes, currents, and monitor the water quality. One of the most significant constraints in this field is that underwater wireless communication does not work . Therefore, there is no Cloud, fog-computing, or mesh available, which means the entire ML model has to be implemented in a single end-device. The creation of underwater surveillance devices capable of driving themselves across the sea, make predictions, and know when coming to the surface to send the collected data can improve battery life and make those systems able to cover larger areas.
There are many areas of application that can benefit from implementing the ML layer on the end-device, therefore decentralizing the ML layer. The IoT networks’ end-devices are usually resource-constrained, with memories in the orders of Kilobytes (KB), low-clock frequencies, and low-power consumption.Resource constraints MCUs have​ low processing capabilities​ low frequency clock​, small size of FLASH memory & RAM , lack of caching systems​, lack of resources to handle complex mathematical operation​ .They do not have FPU , graphics accelerator, vision accelerator
ML and its requirements : ​
The first concept to have in mind when discussing the ML theme is that ML is not a single algorithm. To build an ML model, we must choose from a set of algorithms the one to use. Moreover, besides the multiple projects and areas where they were applied to already, we cannot make any assumption on how the ML algorithm will perform’. From all the algorithms, the most commonly used/known are support-vector machines (SVM), nearest neighbors, artificial neural networks (ANN), naive Bayes, and decision trees. Another key concept is: data is the backbone of any ML model. The model can only perform as good as the quality of the data we fed to it during the learning phase. Therefore, since the quality of the data is crucial for the algorithm performance, most of the model development effort is towards the data extraction and cleaning. Building and optimizing a model is just a small part of the whole process. Machine learning’s development is divided into two main blocks. The first block is the model building, where data is fed into the ML algorithm, and a model is built from the data. The second block is the inference, where new data is given to the model, and the model provides an output. In the learning phase, the computational resources necessary to achieve the goal are high. The number of calculations is massive; the memory consumption is large (increases with the data complexity); the processing power and time to finish the learning phase have inversely proportional curves. The important part from embedded perspective is we need to know about Exporters and Transpilers . These are softwares which convert from higher level language to lower languages. As C/C++ are the languages used to program most resource-scarce MCUs and FPGAs, because these languages allow access to the hardware and manipulate it directly. Moreover, these languages have less overhead than others, and their performance tends to be better. Therefore, there is a discrepancy between the programming languages used to build ML models, and the ones used to program the end-device. and thus because of the use of high-level languages in the Cloud, and low-level languages in the end-devices, there is the need for transpilers/exporters , to convert high-level code/objects to low-level. Some examples are
Sklearn-porter​,
Weka-Porter​,
MATLAB-coder
​But there are some limitations of transpilers, those are transpilers are Generic in nature​, and they lack in optimization. We have mostly worked on optimization techniques in our project.

OPTIMIZATION TECHNIQUES:​

Machine Learning Algorithms and Tools for Embedded Systems:​
1. Proto NN : ProtoNN is a novel algorithm to replace kNN in resource scarce MCUs. This algorithm stores the entire training dataset and use it during the inference phase; this technique would not be feasible in memory space-limited MCUs. Because kNN calculates the distance between the new data point to the previous ones; ProtoNN has achieved optimization by techniques such as space low-d projection, prototypes, and joint optimization. Therefore, during optimization, the model is constructed to fit the maximum size, instead of being pruned after construction. ​
2. Bonsai tree : Bonsai is a novel algorithm based on decision trees. The algorithm reduces the model size by learning a sparse, single shallow tree. This tree has nodes that improve prediction accuracy and can make non-linear predictions. The final prediction is the sum of all the predictions each node provides. ​
3. SeeDot : SeeDot is a domain-specific language (DSL) created to overcome the problem of resource-scarce MCUs not having a floating-point unit (FPU). Since most of the ML algorithms rely on the use of doubles and floats, making calculations without an FPU is accomplished by simulating IEEE-754 floating-point through software. This simulation can cause overhead and accuracy loss when making calculations; this can be deadly for any ML algorithm, especially on models exported from a different platform. The goal of SeeDot is to turn those floating-points into fixed-points, without any accuracy loss. ​
4. CMSIS NN : CMSIS-NN is a software library for neural networks developed for Cortex-M processor cores. The software is provided by Keil . Neural networks generated by the use of this tool can achieve about 4x improvements in performance and energy efficiency. Furthermore, it minimizes the neural network’s memory footprint. Some of the optimizations made to make the implementation possible are: fixed-point quantization; improvement in data transformation (converting 8-bits to 16-bits data type); the re-use of data, and reduction on the number of load instructions, during matrix multiplication. ​

CASE STUDY​

1.Implementation of SVM for decision making in communication controller (MANET’s) mobile ad hoc networks that automatically learns the relationships among configuration parameters of a mobile ad hoc network (MANET) to maintain near-optimal configurations automatically in highly dynamic environments ​
2. Target Platforms : ARMv7 & PPC440, not having any special hardware accelerators & FPU ​
3. Package used : Weka ​
4. Language : C++ ​
5. Baseline version : WekaC++ with a translation of WekaJava items that were not already in the C++ version ​
6. After optimization, able to achieve 20x improvement over original implementation ​

Selection of framework & language : There are many available implementations SVMs,. We considered language (preference for C++) , licenses, memory usage, and compilation effort. lists some of the options that are available in C++ with licenses appropriate for our work. We did not consider Java implementations because of its higher memory requirements and because Java is not compatible with other software components on our target platform.We also eliminated several packages that rely heavily on malloc() calls, as dynamic memory usage is both slow and likely to cause runtime errors on our RAM-limited hardware. To select the specific package from which to continue development, we ran the suite of benchmark datasets listed and picked the fastest. ​

OPTIMIZATIONS​

1. Function inlining: ​
i. Baseline Weka makes over billion calls to the 8 most functions ​
ii. Small and most frequently called functions are forced to be inlined ​
iii. Avoiding stacking and unstacking operation ​
iv. Run time reduction by 20% ​
2. Numerical representation :
SVM relies on many floating point operations build a SVM, the code repeatedly computes a predicted value and its corresponding error, and stops the algorithm when error is sufficiently low. It is therefore critical to find a numerical representation that can be computed quickly and still meet accuracy requirements. Floating point operations are costly in time as compared to integer. Our approach was to scale all of the values and kernel parameters by a scaling factor F, and scale the data (or target values) and error by F^2. we scale the and kernel value k separately, and then de-scale the sum outside the loop. This single floating point division outside the loop is much cheaper than many inner-loop floating-point multiplies. While some might expect Integer to perform consistently fastest because there are never any floating point operations, building a strictly integer SVR model may require more overall computations to converge. Yes there is harm to accuracy.
3. Kernel implementation:
Removing small inefficiencies in kernel.
This function requires calls to pow() and sqrt() in the code, both of which are expensive floating point computations. We can simplify first equation by removing the call to square root and hence the subsequent floating point multiply: While we expected that removing these functions would reduce computation time, the increased convergence time might yield slower overall performance.
4. Memory Vs Computation :
Use of n x n triangular matrix to cache values. For caching of dot products using triangular matrix there is a slight performance improvement; that is, when there are more computations to make, caching the dot product results can reduce overall compute time. However the cost to memory is probably not worth the small timing improvements.

OPTIMIZATION SUMMARY

1. Removing double saved at least 50% of runtime on both platform.
2. Disabling exceptions (or function call ) saves at least 50% of runtime.
3. Inlining functions saves about 20% of runtime.
4. Removing sqrt() and pow() does not necessarily reduce runtime or improve accuracy, due to characteristics of the dataset.
5. Adding a cache of the dot product calculations can improve performance but memory cost is required.

USE OF FPGA

1. The use of FPGA can create more dynamic, scalable, and flexible systems, that will be able to reshape as the target application needs to change.
2. Many hardware technologies can be used in accelerating machine learning algorithms, such as Graphics Processing Unit (GPU) and Field-Programmable Gate Array (FPGA). In particular, FPGA is a promising technology due to its reconfigurability, and real-time processing capability.
3. The FPGA can be used to implement anything from glue logic, to custom IP, to accelerators for computationally intensive algorithms. By taking on some of the processing tasks, FPGAs help to improve system performance, thereby freeing up the MCU from cycle-intensive tasks. FPGAs also provide excellent performance characteristics and lots of flexibility to accommodate changing standards.
4. There are two typical approaches to enhance the performance of an FPGA design: increasing the level of parallelism and revising the bit width of the data representation. To increase the level of parallelism, multiple processing units for the same functionality can be adopted. Regarding the bit width of the data representation, shorter bit width will lead to less resource usage of a building block and higher level of parallelism, but can also lead to unpredictable effects on the accuracy of the machine learning algorithm. Thus, there is usually a tradeoff between having higher accuracy or higher processing rate.
5. Neither standard product microcontrollers nor FPGAs were developed to communicate with each other efficiently. They even use different languages. Thus, interfacing the two can be a challenge. FPGAs do not have any dedicated logic that communicates with microcontrollers. This logic module must be designed from scratch. Second, the communication between the microcontroller and FPGA is asynchronous. Special care is needed to resynchronize the MCU to the FPGA clock domain. Finally, there is an issue of bottlenecks, both at the interface and on the MCU bus. Transferring information between the MCU and the FPGA usually requires cycles on the MCU bus and usually ties up the resource (PIO or EBI) used to effect the transfer. Care must be taken to avoid bottlenecks with external SRAM or Flash and on the MCU bus.

REFERENCES

1.Machine Learning in Resource-Scarce Embedded Systems, FPGAs, and End-Devices: A Survey by Sérgio Branco , André G. Ferreira  and Jorge Cabral
2.Haigh, K.Z.; Mackay, A.M.; Cook, M.R.; Lin, L.G. Machine Learning for Embedded Systems : A Case Study; BBN Technologies: Cambridge, MA, USA, 2015.
3. Nadeski, M. Bringing Machine Learning to Embedded Systems; White Paper; Texas Intruments: Dallas, TX, USA, 2019. Available online:
4. Zhang, C.; Patras, P.; Haddadi, H. Deep Learning in Mobile and Wireless Networking: A Survey. IEEE Commun. Surv. Tutorials 2019, 21, 2224–2287.
5.Praveen Kumar, D.; Amgoth, T.; Annavarapu, C.S.R. Machine learning algorithms for wireless sensor networks: A survey. Inf. Fusion 2019, 49.