A typical modern newspaper recognition system operates in distinct phases: i) page segmentation (also called page decomposition or zoning), that is the process of decomposing a page into its structural and logical units (called regions or zones), ii) region (or zone) labeling, where the previously identified units are labeled according to their types (title, text, images, and lines), iii) article identification (or tracking or clustering), in which all the units that belong to a single article are clustered together, and iv) read order identification, in which each item in an article is assigned its reading order inside the article. So far, in the literature, several works appeared describing algorithms and metrics for the first two phases, i.e. page segmentation and region labeling, that indeed play a crucial role in the whole process, however, few results focused on article identification, that is a difficult task mainly due to the rich and complex variety of newspapers layouts. In this paper we propose a methodology to evaluate news-papers article identification algorithms, our approach is based on well-established tools from graph theory: in particular, we reduce the newspaper article clustering problem to a specific graph clustering problem, that is therefore evaluated using the appropriate coverage and performance measures. The advantages of our approach are twofold: on one side, the proposed measures correctly detects that not all the errors are equals, i.e. some errors are worse than others, and the scores are assigned properly. On the other side, we show how to reverse the reduction, in order to exploit the large number of graph clustering algorithm available: indeed, given a graph clustering algorithm, to obtain a full working newspaper article identification algorithm we only need to define a similarity measure between units in the article. We provide some examples, using a specifically designed dataset. Finally, we would like to point out that both our dataset, together with its ground-truth base, and the software tool, that implements the proposed approach, are freely available.

Performance Evaluation of Algorithms for Newspaper Article Identification.

LAURA, Luigi
2011-01-01

Abstract

A typical modern newspaper recognition system operates in distinct phases: i) page segmentation (also called page decomposition or zoning), that is the process of decomposing a page into its structural and logical units (called regions or zones), ii) region (or zone) labeling, where the previously identified units are labeled according to their types (title, text, images, and lines), iii) article identification (or tracking or clustering), in which all the units that belong to a single article are clustered together, and iv) read order identification, in which each item in an article is assigned its reading order inside the article. So far, in the literature, several works appeared describing algorithms and metrics for the first two phases, i.e. page segmentation and region labeling, that indeed play a crucial role in the whole process, however, few results focused on article identification, that is a difficult task mainly due to the rich and complex variety of newspapers layouts. In this paper we propose a methodology to evaluate news-papers article identification algorithms, our approach is based on well-established tools from graph theory: in particular, we reduce the newspaper article clustering problem to a specific graph clustering problem, that is therefore evaluated using the appropriate coverage and performance measures. The advantages of our approach are twofold: on one side, the proposed measures correctly detects that not all the errors are equals, i.e. some errors are worse than others, and the scores are assigned properly. On the other side, we show how to reverse the reduction, in order to exploit the large number of graph clustering algorithm available: indeed, given a graph clustering algorithm, to obtain a full working newspaper article identification algorithm we only need to define a similarity measure between units in the article. We provide some examples, using a specifically designed dataset. Finally, we would like to point out that both our dataset, together with its ground-truth base, and the software tool, that implements the proposed approach, are freely available.
2011
978-0-7695-4520-2
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14086/380
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact