Researchers at Seoul National University of Science and Technology have developed PV2DOC, a revolutionary tool to transform presentation videos into summarized, structured documents, enhancing accessibility and efficiency.
Researchers at Seoul National University of Science and Technology led by Hyuk-Yoon Kwon, an associate professor in the Department of Industrial & Information Systems Engineering, have announced a pioneering tool that could revolutionize how we consume and manage presentation-style video content. Named PV2DOC, this innovative software converts lengthy presentation videos into concise, structured documents, enabling users to access and comprehend critical information more efficiently.
Presentation videos combining slides, graphics and spoken explanations have surged in popularity, especially during the COVID-19 pandemic. While engaging, these videos are often cumbersome, requiring viewers to sit through entire recordings to glean specific details and occupying significant storage space.
PV2DOC addresses these pain points by transforming video data into organized PDFs, effectively consolidating both audio and visual elements. Unlike existing summarizers that need a transcript and become ineffective without one, PV2DOC excels by extracting and merging data directly from the video itself.
“For users who need to watch and study numerous videos, such as lectures or conference presentations, PV2DOC generates summarized reports that can be read within two minutes,” Kwon said in a news release. “Additionally, PV2DOC manages figures and tables separately, connecting them to the summarized content so users can refer to them when needed.”
PV2DOC operates through a multi-step process involving advanced image and audio processing techniques. The tool captures video frames at one-second intervals, identifying unique visuals using the structural similarity index. It then applies object detection models — Mask R-CNN and YOLOv5 — to recognize figures, tables and other key elements. Any fragmented images are combined using a figure merge technique.
For text extraction, the software leverages Google’s Tesseract engine for optical character recognition (OCR), organizing the extracted text into structured formats with headings and paragraphs. Simultaneously, audio content is transcribed using the Whisper model, an open-source speech-to-text tool. The transcribed text is then summarized using the TextRank algorithm.
The result is a Markdown document convertible into a PDF, presenting the video’s information in a clear, accessible manner that aligns with the video’s original structure.
“This software simplifies data storage and facilitates data analysis for presentation videos by transforming unstructured data into a structured format, thus offering significant potential from the perspectives of information accessibility and data management,” Kwon added. “It provides a foundation for more efficient utilization of presentation videos.”
Looking ahead, the research team plans to enhance PV2DOC further by training a large language model, akin to ChatGPT. The goal is to offer a question-answering service, allowing users to interact with the video content more dynamically and obtain accurate, contextually-relevant responses to their queries.
The development of PV2DOC marks a significant step forward in information technology, promising to streamline the consumption and storage of presentation videos. Its capacity to transform vast amounts of unstructured video data into manageable, searchable documents could have extensive applications in educational, corporate and research settings worldwide.