Research Repository

CSM-399 - Providing Robust Access to Data in Web Pages

Robinson, Jerome (2004) CSM-399 - Providing Robust Access to Data in Web Pages. Technical Report. CSM-399, University of Essex, Colchester.

[img]
Preview
Text
csm-399.PDF

Download (365kB) | Preview

Abstract

Much useful e-commerce information is available on web pages, especially those created by queries to web servers. The problem for programs to use that information is how to ‘screen-scrape’ the data off the web page into machineusable data structures. Wrappers for web data sources use knowledge of the page layout in order to extract data accurately. So they fail if page format changes. This paper describes a fast method for wrapper production and also a method to automatically detect page format change, before it causes data access to fail. The method works for pages that contain collections of items, such as lists, tables and hierarchical structures. It uses a representation of html documents, which makes repetitive features apparent. This provides fully automatic wrapper production for a class of web pages, and rapid interactive production for others.

Item Type: Monograph (Technical Report)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User: Julie Poole
Date Deposited: 27 Feb 2014 11:52
Last Modified: 27 Feb 2014 11:52
URI: http://repository.essex.ac.uk/id/eprint/8682

Actions (login required)

View Item View Item