Interactive Media Systems, TU Vienna

Accelerating Structured Web Crawling without Losing Data

By Boutros El-Gamil and W. Winiwarter


Size of retrieved data versus crawling time formulate a well- known dilemma in the structured Web crawling community. The real challenge within this dilemma is to optimize the settings of a given wrapper to obtain maximum available data in shortest possible time. In this paper, we try to tune these settings, by introducing a threaded algorithm that guarantees accessing all available detail pages within crawling scope; and using this algorithm, we try to reduce the time consumed by the crawler, via simple adjustments of sleeping time after each detail page visit.


Boutros El-Gamil, W. Winiwarter: "Accelerating Structured Web Crawling without Losing Data"; in: "Proceedings of the International Conference on Information Integration and Web-based Applications", ACM, 2013, ISBN: 978-1-4503-2113-6, 5 pages.