Web crawlers, data mining and extraction of knowledge from news articles and Dark Web

There are thousands of news articles written and published every day by various news publishers across the world. These articles discuss various topics, but most of them describe situations and events that are currently occurring in the world.

Among all these events, some of them are related with cyber-attacks, malicious activities, and breached data, which could be interesting for information security analysts, since it could provide additional insights about cyber incidents, attacks or breached datasets to the participating CyberSANE organisations, but are hidden like a needle in a haystack. Moreover, these articles are also written in different languages.

Our goal in the project is to develop a system for searching, extraction and analysis of this information. We have developed tools for monitoring and articles collection from more than 75.000 news sources, including not only established media news sources, but also blogs. Our tools are collecting and automatically analysing articles in more than 40 languages.

When articles are analysed, we try to extract events (we define an event as any significant happening in the world that was reported by a sufficient number of news publishers), and from each identified group of articles in one event, we extract available event information: time, location, entities (people and organisations), and category. Since we are using semantic processing technologies, we can analyse articles in different languages.

Collecting and processing of these data allow us to perform different analysis and visualisations, which can help information security analyst to search information about cybersecurity incidents and cybersecurity-related reports in the news articles and blog posts published on the internet.

However, our work is also focused on the Dark Web, the part of the internet that it is not visible to search engines and requires the use of special software to be accessed. While not everything on the Dark Web is illegal, unfortunately, it is in many cases a gathering place for classic and cybercrime activities. This motivated us to develop tools for crawling and analysis of Dark Web content.

We have developed tools for crawling Dark Web content, which allows us to collect data about cybersecurity events (i.e. hacks, SQL injection, DDoS attacks, etc.), leaked data (i.e. pawned email accounts for breached personal data, job titles, names, phone numbers, physical addresses, social media profiles, etc.) and blacklisted email servers.

Dark Web crawling is not an easy process, because Dark Web content is usually not very openly advertised. Collected data are then processed with Machine Learning algorithms such as clustering and classification and we also perform semantic annotation of content. This data can also help information security analyst to search information about cybersecurity incidents in the Dark Web, but also allows us to notify analyst about possible data breach or preparation of illegal activities on the Dark Web.

During the project, we have developed and implemented methods and algorithms for knowledge extraction, clustering, and text classification of data from unstructured and structured sources. Our tools are collecting most of the global news stories in different languages in real-time, semantically enrich them and automatically process them, and we developed API to collect data from news articles in order to identify and automatically process news articles reporting about cyber-attacks, malicious activities, breached data and other types of cybersecurity incidents.

We also developed APIs to collect data from the Dark Web, identify pawned emails, breached data from personal accounts and blacklisted mail servers.

Integration of our tools into CyberSANE platform will help us to generate reports and perform analysis of global cybercrime activity and to provide these reports and analysis to other CyberSANE components.

While the internet and Dark Web is a mess of a lot of information (structured, unstructured, in different languages), the purpose of our tools is to extract meaningful knowledge. This knowledge could be used for the analysis of cybersecurity incidents and to identify potential threats for the critical infrastructures.