Where End To End Machine Learning Fails

Stop-to-end learning, the (well-nigh) every purpose ML method

Can E2E be used to solve every Machine Learning problem?

One of the about of import skills for those who work with Automobile Learning is to know which method is the right choice for a given trouble. Some choices are trivial (e.g. supervised or unsupervised, regression or classification) because they are related to the problem formulation itself. However, fifty-fifty after defining what yous are trying to solve, in that location is normally a myriad of algorithms that can be used.

For example, imagine you want to develop a system able to predict a categorical variable. To solve this problem either Nomenclature Tree, K-nearest neighbors, or even Artificial Neural Networks tin can be used. Of course, there is a reason for many different algorithms to exist, fifty-fifty when they solve like bug: each one has its particularities from which we can do good.

What makes the chore even harder is that for solving some problems, like speech recognition and autonomous driving, an architecture consisting of many layers is necessary (e.g. preprocessing, characteristic extraction, optimization, prediction, decision making). For each layer, many different algorithms may be used.

The issue is: for achieving better results, changes in the inner layers and its corresponding algorithms have to be applied. Still, as each layer is responsible to solve particular tasks, it becomes actually difficult to determine how such changes will affect the organisation every bit a whole.

Finish-to-end (E2E) learning refers to preparation a maybe circuitous learning organisation represented by a single model (specifically a Deep Neural Network) that represents the complete target system, bypassing the intermediate layers normally present in traditional pipeline designs.

End-to-end learning

End-to-end learning is a hot topic in the Deep Learning field for taking advantage of Deep Neural Network's (DNNs) structure, composed of several layers, to solve complex problems. Similar to the man encephalon, each DNN layer (or group of layers) can specialize to perform intermediate tasks necessary for such problems. Tobias Glasmachers evidentiate how E2E is framed in the Deep Learning context [1]:

"This elegant although straightforward and somewhat animate being-force technique [E2E] has been popularized in the context of deep learning. It is a seemingly natural consequence of deep neural architectures blurring the classic boundaries between learning machine and other processing components by casting a possibly complex processing pipeline into the coherent and flexible modeling language of neural networks. "

That alternative approach has been successfully practical to solve many complex problems. Below you tin discover how E2E is applied for Spoken language Recognition and Democratic Driving bug.

Speech Recognition

The traditional approach design for a spoken linguistic communication understanding arrangement is a pipeline construction with several dissimilar components, exemplified past the following sequence:

Audio (input) -> feature extraction -> phoneme detection -> word composition -> text transcript (output).

A clear limitation of this pipelined architecture is that each module has to be optimized separately nether different criteria. The E2E arroyo consists in replacing the aforementioned chain for a single Neural Network, allowing the use of a single optimization benchmark for enhancing the organisation:

Sound (input) — — — (NN) — → transcript (output)

Mike Lewis et al. introduce an E2E learning approach for tongue negotiations [2]. The resulting system is a dialogue agent based on a single Neural Network able to negotiate to achieve an understanding. This was washed by grooming the NN using information from a large dataset of human being-man negotiation records containing a diversity of dissimilar negotiation tactics.

Another do good of the E2E arroyo is that it is possible to design a model that performs well without deep knowledge about the problem, despite its complexity. Ronan Collobert et al. explain how a unified Neural Network architecture and an appropriate learning algorithm for Natural Language Processing (NLP) can be used to avert chore-specific engineering and lots of prior knowledge [3]:

"[…] nosotros effort to excel on multiple benchmarks while avoiding chore-specific engineering. Instead we use a single learning system able to find acceptable internal representations. […] Our want to avoid task-specific engineered features prevented us from using a big body of linguistic cognition. Instead we reach good performance levels in most of the tasks by transferring intermediate representations discovered on large unlabeled data sets. Nosotros call this approach "almost from scratch" to emphasize the reduced (but still important) reliance on a priori NLP knowledge. "

Autonomous driving

Autonomous driving systems can be classified every bit a remarkable example of complex systems composed of many layers. Following the architecture proposed by Alexandru Serban et al., nosotros can design an autonomous driving arrangement using v unlike layers [4]:

The input data comes from several sensors (cameras, LIDAR, radars, etc.) that are processed in the sensor fusion layer to extract the relevant features (e.g. object detection). With all the information processed and the relevant features extracted, a "earth model" is created in the 2d layer. That model comprises the complete moving-picture show of the surrounding environment together with the vehicle internal land.

From this model, the organisation must choose which decisions to brand in the behavior layer. According to the vehicle'due south goals, it raises multiple beliefs options based on the organization policy and selects the best i by applying some optimization criterion.

With the decisions taken the system determines the maneuvers the vehicle must execute to satisfy the chosen behavior in the planning layer and, finally, the control values are sent to the actuator interface modules in the vehicle control layer.

In the paper "End to Stop Learning for Self-Driving Cars", Mariusz Bojarski et al. propose an E2E system capable to control an autonomous car directly from the pixels provided by the embedded cameras [five]. The arrangement was able to learn internal representations of intermediate steps, such as detecting useful road features, with only the homo steering angle as the training signal. The usage of Convolutional Neural Networks (CNNs) plays an important role in the proposed organization for its chapters of extracting useful features from prototype data:

"The breakthrough of CNNs is that features are learned automatically from preparation examples. The CNN arroyo is especially powerful in paradigm recognition tasks considering the convolution operation captures the 2D nature of images."

The designed CNN goes beyond pattern recognition to learn the entire processing pipeline needed to steer an automobile. The network compages consists of ix layers, including a normalization layer, five convolutional layers, and 3 fully connected layers. The organization was trained using real driving recorded data collected in central New Jersey, Illinois, Michigan, Pennsylvania, and New York. The following figure shows the cake diagram of the training system pattern:

With approximately 72 hours of driving data, the system was able to larn how to steer the car in unlike route types and weather conditions:

"A pocket-size amount of training data from less than a hundred hours of driving was sufficient to train the automobile to operate in diverse atmospheric condition, on highways, local and residential roads in sunny, cloudy, and rainy atmospheric condition. The CNN is able to learn meaningful road features from a very sparse preparation point (steering alone). The system learns for instance to notice the outline of a road without the need of explicit labels during grooming."

Limitations of E2E

If using a unmarried DNN between input and output works for the aforementioned examples, why non use it as a general arroyo for solving every Machine Learning problem?

Many are the reasons that brand E2E an infeasible option in unlike cases:

A huge corporeality of data is necessary: The incorporation of some prior knowledge into the training is considered a key element that will allow an increase in performance in many applications. For E2E learning not integrating this prior knowledge, more grooming examples must be provided.
Difficult to better or modify the system: If some structural alter must exist applied (east.g. increasing the input dimensions past calculation more than features) the old model has no apply and the pigsty DNN has to be replaced and trained all over again.
Highly efficient available modules cannot be used: Many techniques are efficient to solve some tasks. As an example, country-of-the-fine art object recognition systems are largely distributed, but as soon as it is integrated into an E2E system, it cannot be considered E2E anymore.
Difficult to validate: If a loftier level of validation is necessary, E2E may go infeasible. Due to the circuitous compages, the potential number of input/output pairs can be big enough to make the validation impossible. This is particularly important for some sectors like the automotive industry.

On top of these bug, E2E may not work for some applications, as shown in [1]:

"We have demonstrated that finish-to-terminate learning tin be very inefficient for grooming neural network models composed of multiple non-trivial modules. End-to-end learning tin can even suspension down entirely; in the worst case none of the modules manages to learn. In contrast, each module is able to learn if the other modules are already trained and their weights frozen. This suggests that grooming of complex learning machines should proceed in a structured fashion, training elementary modules offset and contained of the rest of the network. "

Conclusion

Eastwardnd-to-finish is indisputably a great tool for solving elaborate tasks. The idea of using a single model that can specialize to predict the outputs straight from the inputs allows the evolution of otherwise extremely complex systems that tin can be considered land-of-the-art. However, every enhancement comes with a toll: while consecrated in the academic field, the industry is all the same reluctant to use E2E to solve its problems due to the need for a large corporeality of grooming data and the difficulty of validation.

References

[1] Glasmachers, Tobias. "Limits of end-to-stop learning." arXiv preprint arXiv:1704.08305 (2017).

[2] Lewis, Mike, et al. "Deal or no bargain? cease-to-terminate learning for negotiation dialogues." arXiv preprint arXiv:1706.05125(2017).

[3] Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of car learning enquiry 12.Aug (2011): 2493–2537.

[4] Serban, Alexandru Constantin, Erik Poll, and Joost Visser. "A Standard Driven Software Architecture for Fully Democratic Vehicles." 2018 IEEE International Conference on Software Architecture Companion (ICSA-C). IEEE, 2018.

[5] Bojarski, Mariusz, et al. "End to end learning for self-driving cars." arXiv preprint arXiv:1604.07316 (2016).