Advanced Anomaly Detection capabilities and components of the CyberSANE System

Cyberattacks, as anything we could think of, can be described in two ways. It’s evident that we can define a cyberattack by a list of its attributes, saying for example that “it affects TCP ports 139 and 445” or that “it can be performed by an uncredentialed attacker”. Classical pattern-based Intrusion Detection Systems (IDS) work on this basis, using a set of standardized and well-defined attributes to detect cyberattacks and stop them or alert the security analyst.

On the contrary, definitions can be constructed in a negative way, following the medieval apophatic logic or ‘via negationis’. In other words, saying what the defined element is not. From a good definition of how legit actions in our network should be, we can assume that actions that deviates from this definition are part of a cyberattack.

Following this second approach, an IDS could trigger an alert when abnormal behaviour is detected, resulting in the ability of detecting previously unseen cyberattacks just because they include a non-expected chain of actions. This kind of detection mechanism, called anomaly detection, has been thoroughly studied and applied in the industry, now in combination with modern techniques such as deep learning or adversarial machine learning.

Anomaly detection is one of the key approaches used in HybridNet, the component of the CyberSANE Platform in charge of the security analysis. Instead of only relying on the excess of a defined numerical threshold or on the identification of actions outside a static whitelist, as many classical anomaly detection methods do, the anomaly detection engine in HybridNet is based on advanced techniques, such as machine learning, and on automatic profiling of normal network behaviour and adaptivity to future changes.

Three are the components of HybridNet working on anomaly detection: L-ADS (Live Anomaly Detection System), CARMEN and SiVi. Each of them applies their own set of learning algorithms to data with the same objective of presenting valuable alert information to the security analyst, avoiding the presence of false positives.

The first tool, L-ADS, uses a predictive strategy based on non-supervised machine learning methods. L-ADS allows the construction of a model of users and applications, configuring behaviour patterns that will later be used to analyse and identify anomalies. The analysis is done within the protected network, being L-ADS able to monitor where all data is sent and from where it comes, configuring a global image of the network state. This image is then compared with upcoming data to perform attack detection.

On its side, CARMEN performs anomaly detection in two different engines. The first one is based on the application of a set of classification algorithms on data about connections, whereas the second one works by the application of time series analysis to numerical aggregated information from the network. These approaches work separately but communicate between them and with the rest of the modules in HybridNet, incorporating their perspective to the whole picture about the state of the network.

The first one of CARMEN’s engines has as input a continuous flow of URLs and domains to which users within the network connect. A first extractor module pre-processes input URLs and domains to extract a set of attributes and calculate new metrics from them. There are both inherent and external attributes, being the first kind directly extracted from the URLs or domains collected, and the second kind, related to the metrics of other activities registered by CARMEN. For instance, inherent attributes would be the ratio between numbers or the presence of special characters. An example of external attribute would be the number of requests made during last month or the amount of sent or received data.

All the collected metrics then feed a combination of statistical and machine learning models to determine for each URL the probability of being related to an already known APT group. Using several combined models helps improving the accuracy of the analysis, being the results from the algorithms conveniently weighted and aggregated. Once the models have been trained using offline data, the analysis of new incoming URLs can be carried out in real time, using a sliding window approach. Models can be periodically retrained to adapt to new paradigms of APT campaigns when they are discovered by security analysts. Apart from its direct function in HybridNet as an anomaly-based threat detector, this engine can be applied to threat hunting, as the analyst can use it offline and feed it with a list of suspicious domains for analysing the results.

The second engine in CARMEN is, as it was said, based on time series analysis. The power of this engine is that it can analyse any kind of time series, as soon as the parameters of the engine are adjusted in consequence to the characteristic of the series. This allows to adapt the engine to the information coming from different elements in HybridNet. Moreover, data can be creatively aggregated to form a heterogeneous time series. Combining data coming from different sources, such as log activity and network traffic, can take the detection capabilities of the CyberSANE platform closer to an holistic perspective in which the whole network and their assets are considered at a whole. Under this perspective, any event deviating from the normal behaviour would be considered as harming for the system.

The architecture of CARMEN’s second engine is very similar to that of the first engine, with an initial extraction phase, followed by the application of the trained models. Apart from the collection of data, the extractor also performs temporal aggregation, according to the intervals configured in the engine. Aggregation of time series in time slots could result in a more stable signal, although it is important to be careful about not missing the occurrence of short-time anomalies that could affect the final system. The engine has been extensively tested on failed login attempt events and on other firewall alerts.

Finally, SiVi is the engine for anomaly detection in HybridNet that is closer to the human analyst. It is a visual-based system that is capable of monitoring and detecting cyberattacks in real time, reporting results by a set of visualization graphs that provide a quick and accurate overview of the current network state. Their functioning is also based on machine learning algorithms, which performs detection on real time data and feed the results into a visual analytics component, the key element in SiVi. This component applies to the detection results different methods to show the security state of the network to the security analyst through a set of interactive visualization elements, such as graphs, activity gauges, dependency wheels, etc. Having this information in real time allows the analyst to quickly respond to the detected anomalies and to perform the adequate actions to mitigate their effects.

All these engines together constitute the core of advanced anomaly detection in the HybridNet, defining a global strategy on the CyberSANE Platform of being able to detect the unknown, the set of cyberattacks that, having perhaps certain aspects in common with known ones, has never been seen by security analyst and, as such, cannot be detected by the definition of rules describing their attributes.

Defining something by using only attributes that it does not have, is as old as human thinking, the same as anomaly detection is as old as Cybersecurity as a discipline, starting with the pioneering work of Dorothy E. Denning in the 1980s. In the last two decades, anomaly detection has been enriched by the new developments on machine learning and data analysis, making the detection of the non-self more reliable, up to the extent of becoming a practical reality on working systems. CyberSANE partners have focused their research effort on the development of new advanced anomaly detection methods, integrated on the HybridNet component and incorporating innovative ways of profiling the normal behaviour of a network and detecting the actions deviating from it.