Page 52 - Proceedings of the 2018 ITU Kaleidoscope
P. 52
2018 ITU Kaleidoscope Academic Conference
management areas, can improve the resilience of the system. in telco cloud deployments, more fault management
A demonstrator evaluating the presented concepts is functions can be placed to more problematic areas.
introduced in section 7 and in section 8 we present our
conclusion and discuss future work. Diverse at the edge, but simple at the core: Resilient
systems may be very complex, but share simple common
2 ROBUSTNESS AND RESILIENCY IN MOBILE properties. The Internet, for example, consists of a very large
NETWORKS number of very diverse services, some of them extremely
complex, but they all communicate and interface with a
Resiliency is the capability of a system to recover to a stable, simple set of shared protocols. In future mobile networks, a
functioning state after failure or adverse events [3]. It is not common data plane can provide such simplicity at the core.
the same as robustness. A robust system is strongly designed The data providers and consumers may operate in vastly
to withstand any foreseen problems or failures, but may be different scopes and time durations, but be able to
too rigid and fail to survive and adapt in case of unforeseen communicate with each other using the Service-Oriented
circumstances, which are inevitably bound to happen in Architecture (SOA) principles in a cloud-native architecture
complex systems. For example, a farmer may prepare his utilizing a common data sharing bus.
crop against fire and flooding and local pests, but the crop
can be destroyed by a foreign plant virus introduced in the 3 SELF-HEALING IN MOBILE NETWORKS
environment. Paradoxically, a very robust system can be
more susceptible to failure due to its increased rigidness and The simplest self-healing solutions are rule-based systems,
complexity [3]. Modern telephone networks are often said to where specified automated corrective workflows are
be (together with electric power grids) among the largest and triggered, when given trigger conditions are fulfilled. Such
most complex human-created systems and their distributed systems, however, can reliably work only on anticipated
nature makes them even more complex to manage and problems and typically fail to perform well in completely
predict. Therefore, simple robust design principles unforeseen circumstances. Furthermore, the creation and
(redundancy etc.) are not sufficient to ensure the ultra- maintenance of the rule base is expensive and laborious. It
reliable highly-available network performance required for may even make the system more rigid and thus less resilient.
many critical future use cases, for example remote surgery.
The rules, which corrective actions to trigger, could be
Resilient system, on the other hand, follow principles that learned using machine learning, as a classification problem.
allow them to recover even in case of completely unforeseen Each state is classified either as normal or to a degraded state
disastrous events. For example, by diversifying the crop, a connected to one of the corrective workflows. However,
farmer can ensure that a new plant virus will not be able to since the anomalous states are, by definition, rare, the
wipe out all the production. Typically, resilient systems detection model is learned on a skewed training dataset. It
follow a number of design principles [3]: may also fail to recognize new, unforeseen problematic
states. Another problem is the availability of such labelled
Monitoring and adaptation: Resilient systems must be training datasets.
responsive to change, and for that they need to monitor the
system and detect changes early. An automatic anomaly Therefore, self-healing functions are often implemented as a
detection system can profile and learn normal behavior at four-stage process: profiling the normal states of the system,
runtime and detect deviations from it, giving an early detecting deviations from the normal (anomalies), diagnosis
warning even in case of unforeseen circumstances. If and acting. The advantage of learning the normal behavior is
connected to a diagnosis function, it can also trigger that any deviations from it, even unforeseen ones, can be
automatic corrective or mitigating actions. SON self-healing detected. On the other hand, not all deviations are
function based on anomaly detection and diagnosis are degradations and so a diagnosis function is required to
discussed in the following chapters. diagnose the detected anomalies and connect them to
possible corrective actions. Additionally, to adapt to trend
Redundancy, decoupling and modularity: In addition to and seasonal changes in the normal network behavior, for
duplicating capacity for redundancy, resilient systems often example to the evolution in the network traffic
have a decoupled and decentralized structure. In 5G RAN, characteristics, the profiles for the normal states need to be
one such approach is the RAN multi-connectivity. It is often continuously updated.
utilized to increase the throughput, but can be also used to
exploit the inherent macro-diversity effect of multiple In Radio Access Networks (RANs), resources are typically
simultaneous connections, such that the probability that at more scarce and it is often not possible to achieve desired
least one connection is sufficiently strong is increased [6]. level of resilience simply by means of overprovisioning of
resources. The available spectrum, for example, is limited
Focusing: When changes are detected, resilient systems may and cannot be extended. Therefore, in addition to methods
focus on the problematic area to respond to a problem or a like multi-connectivity [5], self-healing solutions can be
change. In network management, excess resources can be especially important in RAN to enable the required level of
deployed where unexpected events are detected. Especially reliability.
– 36 –