Integrating Advanced OCR and NLP Techniques for Enhanced Text Extraction and Image Plagiarism Detection
Main Article Content
Abstract
This study targets the problem of digital content misuse and impersonification, both for text and images. This paper presents a new way to discover misuses of images by first leveraging OCR to make sure the text present in the image is extracted. The extracted Text is then processed to determine the originality of the content using advanced Natural Language Processing (NLP) techniques, more recently Transformer based models like BERT. It enhances the detection of potential misuse by comparing the extracted text with databases at scale. In addition, the study investigates how Attentional Generative Adversarial Network (AttnGAN) visually imagines descriptions, expanding our understanding of text to image generation. Result analysis indicates that the incorporation of OCR with NLP enhances accuracy in determining image abuse where BERT allows to get further knowledge about content originality. Furthermore, AttnGAN has demonstrated the ability to generate high-quality images from text input efficiently; therefore, promoting the understanding of digital content creation and originality. In this work, we introduced a novel approach for content detection based on OCR, NLP and image generation (detected contents) as well as conscious sharing practices in academia, law and authorship.