Robinson, Jerome (2004) CSM-398 - Data Extraction from Web Data Sources. Technical Report. CSM-398, University of Essex, Colchester.
Robinson, Jerome (2004) CSM-398 - Data Extraction from Web Data Sources. Technical Report. CSM-398, University of Essex, Colchester.
Robinson, Jerome (2004) CSM-398 - Data Extraction from Web Data Sources. Technical Report. CSM-398, University of Essex, Colchester.
Abstract
This paper provides an explanation of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by web sites in response to user qeries via web page forms. The key structure called a tpGrid is a representation of the web page, which is easier to analyse than the raw html code. The analysis looks for repetition patterns of sets of tagSets, which are defined in the paper.
Item Type: | Monograph (Technical Report) |
---|---|
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Divisions: | Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
Depositing User: | Julie Poole |
Date Deposited: | 27 Feb 2014 11:51 |
Last Modified: | 27 Feb 2014 11:51 |
URI: | http://repository.essex.ac.uk/id/eprint/8681 |
Available files
Filename: csm-398.PDF