Robinson, Jerome (2004) CSM-399 - Providing Robust Access to Data in Web Pages. Technical Report. CSM-399, University of Essex, Colchester.
Robinson, Jerome (2004) CSM-399 - Providing Robust Access to Data in Web Pages. Technical Report. CSM-399, University of Essex, Colchester.
Robinson, Jerome (2004) CSM-399 - Providing Robust Access to Data in Web Pages. Technical Report. CSM-399, University of Essex, Colchester.
Abstract
Much useful e-commerce information is available on web pages, especially those created by queries to web servers. The problem for programs to use that information is how to ‘screen-scrape’ the data off the web page into machineusable data structures. Wrappers for web data sources use knowledge of the page layout in order to extract data accurately. So they fail if page format changes. This paper describes a fast method for wrapper production and also a method to automatically detect page format change, before it causes data access to fail. The method works for pages that contain collections of items, such as lists, tables and hierarchical structures. It uses a representation of html documents, which makes repetitive features apparent. This provides fully automatic wrapper production for a class of web pages, and rapid interactive production for others.
Item Type: | Monograph (Technical Report) |
---|---|
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Divisions: | Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
Depositing User: | Julie Poole |
Date Deposited: | 27 Feb 2014 11:52 |
Last Modified: | 27 Feb 2014 11:52 |
URI: | http://repository.essex.ac.uk/id/eprint/8682 |
Available files
Filename: csm-399.PDF