Clustering large amount of email with Minhash: an open-source Locality sensitive hash
Key | Action |
---|---|
K or space | Play / Pause |
M | Mute / Unmute |
C | Select next subtitles |
A | Select next audio track |
V | Show slide in full page or toggle automatic source change |
left arrow | Seek 5s backward |
right arrow | Seek 5s forward |
shift + left arrow or J | Seek 10s backward |
shift + right arrow or L | Seek 10s forward |
control + left arrow | Seek 60s backward |
control + right arrow | Seek 60s forward |
shift + down arrow | Decrease volume |
shift + up arrow | Increase volume |
shift + comma | Decrease playback rate |
shift + dot or shift + semicolon | Increase playback rate |
end | Seek to end |
beginning | Seek to beginning |
Share this media
HLS video stream
You can use an external player to play this stream (like VLC).
HLS video streamWhen subscribed to notifications, an email will be sent to you for all added annotations.
Your user account has no email address.
Information on this media
Links:
Number of views:
15Creation date:
July 4, 2023Speakers:
Nicolas BerveglieriLicense:
CC BY-SA v4Description
In the last decades, the world connectivity has increased exponentially, and emails is one of the key indicator of this connectivity. In 2022, more than 340 billions emails were sent on average each day, an increase of about 5% in comparison to the preview year. Because the reach of emails is so broad, they have been in the recent years used more and more to perform a wide variety of cyber security attacks. On the one side, targeted attack such as spear-phishing or Business Email Comprise (BEC) can be disastrous for companies and are responsible for millions of dollar loss each year. These kind of attacks are usually fine tuned to deceive the victim, and thus very hard to detect with automation. Furthemore they are really sparse in comparison to other types of email attacks (1 in 100 000 emails). On the other side, spam and phishing campaigns are broad attacks that usually target large group of email address. Campaign attacks are typically composed of bulks of email sharing a similar template and sent en masse in the hope of hitting just a small fraction of their targets, prioritizing quantity of attack sent over quality of the attack (about 80% of emails sent every day are spam emails). For cybersecurity providers such as Vade, a challenge is to detect and block these campaigns as fast as possible. While emails in a campaign used to be the exact same and thus relatively easy to catch, attackers have been more and more keen to add noise and tricks to fool detection algorithms, while still maintaining the visual aspect of the email. This evolution has seen, as a consequence, an increase in interest for the nearest neighbor problem. The nearest neighbor problem (nnp) is an optimization problem that arise for many kind of data driven tools. In particular, detecting duplicate or near-duplicate document is a critical application of the nnp. A similarity search problem usually involves a large collection of object, each characterized by a set of features and re-presentable as points in high-dimensional attribute space. Given a document, we are queried to find its most similar documents in the database. This problem has been shown to be NP-complete, and as such is still unfeasible to solve in reasonable time
In this presentation, we will present a full pipeline of clusturisation of email sent in a continuous flow, from the email to the clusters, using minhash (https://en.wikipedia.org/wiki/MinHash), an open source locality sensitive hashing algorithm. The presentation will be conducted as follow:
- Explain how to extract key data from the email and remove the content added to fool the clustering algorithm.
- Explain normalization through open source tools such as "https://www.npmjs.com/package/sanitize-html". This helps reducing the noise to info ratio in the email.
- Present Locality sensitive hashing through the open source algorithm minhash, which creates fingerprints that will collide for similar email.
- Present the "Bucketization" technique to cluster the fingerprints.
- Present results on real email data.
PhD from "Université de Lille", INRIA (french) and MODO (Japanese) Lab, specialized in large scale optimization assisted by machine learning tools.
Now working at Vade as research engineer.
Other media in the channel "2023"
- 98 views, 98 this year, 1 this monthWhy cyberoffense will never be regulatedJuly 5th, 2023
- 25 views, 25 this yearUsing Suricata to detect lateral movement in Windows environmentJuly 5th, 2023
- 100 views, 100 this yearDecrypt Kerberos/NTLM “encrypted stub data” in WiresharkJuly 5th, 2023
- 16 views, 16 this yearHow to survive to STIX parsing?July 5th, 2023
- 15 views, 15 this yearASN.1 templating for fun and profitJuly 5th, 2023
- 5 views, 5 this yearzekrom: an open-source library of arithmetization-oriented constructions for zkSNARK circuitsJuly 5th, 2023