Using GPT-4o for Web Archiving

Generative AI offers unprecedented efficiency in metadata creation, yet the human touch still leads in quality—will this innovation reshape digital preservation?

Research: Web Archives Metadata Generation with GPT-4o: Challenges and Insights. Image Credit: Shutterstock AIResearch: Web Archives Metadata Generation with GPT-4o: Challenges and Insights. Image Credit: Shutterstock AI

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

A research paper recently posted on the arXiv preprint* server explored using generative artificial intelligence (AI), specifically generative pre-trained transformer 4 (GPT-4o), to automate metadata generation for web archives (WARCs). The researchers in Singapore focused on the Web Archive Singapore (WAS) initiative, aiming to meet the growing demand for efficient and cost-effective metadata creation in the expanding digital landscape. They highlighted the potential for substantial cost savings, efficiency gains, and the critical limitations of AI-generated metadata, including accuracy issues compared to human-curated content.

Large Language Models (LLMs) for Digital Preservation

The rapid evolution of the digital landscape necessitates effective methods for preserving online heritage. The Resource Discovery department of the National Library Board Singapore (NLB), responsible for cataloging collections, including WAS, faces the challenge of managing a rapidly expanding web archive. Traditional manual metadata creation is labor-intensive and resource-heavy, making it unsustainable given the large volume of data. This challenge has prompted the exploration of advanced technologies, such as LLMs like GPT-4o, to automate this costly and time-consuming process. These LLMs, with advanced natural language processing capabilities, offer a new potential for tasks such as summarization and content generation but face obstacles in terms of reliability and precision.

Automated Metadata Generation Using GPT-4o

In this paper, the authors developed and evaluated an automated system for generating titles and abstracts as metadata for WAS. They aimed to improve metadata creation efficiency while ensuring the accuracy of the automated outputs. Using 112 WARC files from WAS, the researchers adopted a systematic approach, including data collection, preparation, and token reduction through three key heuristics. To minimize processing costs with GPT-4o, they employed data reduction techniques such as prioritizing content from "About" pages, selecting content from the shortest URLs, and applying regex filtering to limit token count. HTML content was extracted from the WARC files using Python libraries like WARCIO and BeautifulSoup, which facilitated the capture of relevant metadata while excluding unnecessary elements.

The study implemented specific heuristics to optimize content extraction: prioritizing "About" pages, focusing on content from the shortest URL, and using regex filtering, all inspired by professional cataloging practices. Additionally, prompt engineering techniques were used, crafting prompts both with and without specialized rules for different types of websites (e.g., corporate sites, personal blogs, and property listings). This allowed the researchers to compare results for varied types of web content.

Evaluation and Analysis of Automated Metadata

Both automated and manual evaluation methods were employed to assess the quality of the generated metadata. For automated evaluation, metrics such as Levenshtein Distance and BERTScore were used to assess similarity and quality, while manual evaluation involved eight trained catalogers who compared AI-generated metadata with human-created metadata, utilizing McNemar's test to measure accuracy.

The automated approach achieved a remarkable 99.9% reduction in token count and associated costs compared to processing entire WARC files. However, manual evaluation using Cochran's Q and McNemar's tests revealed statistically significant differences (p = 0.02) between LLM-generated and human-generated metadata, indicating that human-created metadata exhibited higher accuracy and relevance. The analysis also highlighted several challenges associated with LLM-generated content, including frequent accuracy issues, hallucinations, and translation errors. A significant percentage, approximately 19.6%, of AI-generated titles and abstracts contained inaccuracies, compared to only 6.3% in human-generated metadata. This discrepancy underscores the need for ongoing refinement of LLMs to mitigate errors and enhance content reliability. Despite these challenges, the authors emphasized the potential of LLMs in archiving, stating that the technology can help streamline workflows and allow human catalogers to focus on more complex tasks requiring expertise.

Key Applications and Implications

The findings of this research have significant implications that extend beyond web archiving. The presented techniques and insights could benefit various fields requiring large-scale data management and metadata generation, such as digital libraries, museums, and educational institutions. By integrating generative AI into their workflows, these organizations could streamline operations, reduce costs, and enhance access to digital content. This automation, when complemented by human oversight, offers a promising pathway to achieving both efficiency and reliability in large-scale metadata management.

Conclusion and Future Directions

In summary, this study represents a significant advancement in applying AI-driven solutions to WARCs, showcasing both the potential and limitations of GPT-4o for automated metadata generation. While the approach offers considerable efficiency and cost savings, it highlights the essential role of human oversight in maintaining metadata quality and accuracy. Future directions include refining prompt engineering methods, improving data reduction heuristics, and considering smaller, specialized models to address privacy concerns. A collaborative approach that combines the strengths of LLMs and human expertise will also be crucial for achieving accurate digital preservation and reliable metadata generation.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Source:
Journal reference:
  • Preliminary scientific report. Huang, A. Y., Nair, A., Goh, Z. R., & Liu, T. (2024). Web Archives Metadata Generation with GPT-4o: Challenges and Insights. ArXiv. https://arxiv.org/abs/2411.05409
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, November 14). Using GPT-4o for Web Archiving. AZoAi. Retrieved on December 12, 2024 from https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx.

  • MLA

    Osama, Muhammad. "Using GPT-4o for Web Archiving". AZoAi. 12 December 2024. <https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx>.

  • Chicago

    Osama, Muhammad. "Using GPT-4o for Web Archiving". AZoAi. https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx. (accessed December 12, 2024).

  • Harvard

    Osama, Muhammad. 2024. Using GPT-4o for Web Archiving. AZoAi, viewed 12 December 2024, https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.