Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection

2026-04-10Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors developed a system to collect and analyze data from the Telegram app while following privacy laws like GDPR. They turned audio messages into text using a tool called Parakeet, which worked best. Then, they used different methods, including Microsoft's Presidio and special AI models, to find and hide sensitive information in the data. They also created ways to check that the data still makes sense after hiding personal details. This helps cybercrime research stay both useful and legal.

TelegramGDPRspeech-to-textNamed Entity RecognitionParakeetMicrosoft Presidiotransformer modelsdata anonymizationcybersecurityaudio transcription
Authors
Carlos Jimeno Miguel, Raul Orduna, Francesco Zola
Abstract
This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.