ATTENTION: This is a web archive! The IMS Group was split up in 2018 and does not exist anymore. Recent work of former members can be found at the VR/AR Group and the Computer Vision Group.

Interactive Media Systems, TU Wien

pdf2table: A Method to Extract Table Information from PDF Files

By Burcu Yildiz, Katharina Kaiser, and Silvia Miksch

Abstract

Tables are a common structuring element in many documents, such as PDF files. To reuse such tables, appropriate methods need to be develop, which capture the structure and the content information. We have developed several heuristics which together recognize and decompose tables in PDF files and store the extracted data in a structured data format (XML) for easier reuse. Additionally, we implemented a prototype, which gives the user the ability of making adjustments on the extracted data. Our work shows that purely heuristic-based approaches can achieve good results, especially for lucid tables.

Reference

B. Yildiz, K. Kaiser, S. Miksch: "pdf2table: A Method to Extract Table Information from PDF Files"; Talk: Indian International Conference on Artificial Intelligence (IICAI), India; 12-20-2005 - 12-22-2005; in: "Proceedings of the 2nd Indian International Conference on Artificial Intelligence", (2005), ISBN: 0-9727412-1-6; Paper ID 441.

BibTeX

Click into the text area and press Ctrl+A/Ctrl+C or ⌘+A/⌘+C to copy the BibTeX into your clipboard… or download the BibTeX.