ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL

Arslan, Ahmet

doi:10.18038/estubtda.615103

ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL

Ahmet ARSLAN (Eskişehir Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Eskişehir, Türkiye)

Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering

2 0

Yıl: 2020 Cilt: 21 Sayı: 1 Sayfa Aralığı: 182 - 198 Metin Dili: İngilizce DOI: 10.18038/estubtda.615103 İndeks Tarihi: 03-08-2021

ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL

Öz:

Web retrieval studies have mostly used URL, title, body, and anchor text fields to represent Web documents. On the otherhand, HTML standards provide a rich set of elements to define different parts of a Web page. For example, meta elements areused to provide structured metadata about a Web page not to end users, but instead to browsers or crawlers. However, it isunclear whether meta tags are or are not useful for Web retrieval, as most of the previous studies leveraged URL, title, body,and anchor text fields. In this work, we examine the usefulness of two meta tags, namely keywords and description, based onad-hoc tasks of previous TREC studies. Through experiments on the standard TREC Web datasets and several query sets, ourresults using the state-of-the-art term-weighting models show that the utilization of description field systematically increasesthe retrieval effectiveness, to a statistically significant degree most of the time. By contrast, the employment of keywords fieldmay cause a significant deterioration in retrieval effectiveness for certain term-weighting models.

Anahtar Kelime:

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık

[1] Robertson S, Zaragoza H, Taylor M. Simple BM25 Extension to Multiple Weighted Fields, in Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 42-49.
[2] Croft WB. "Combining Approaches to Information Retrieval," W. B. Croft, Ed., ed: Springer US, 2000, pp. 1-36.
[3] Turner TP, Brackbill L. Rising to the top: evaluating the use of the HTML meta tag to improve retrieval of World Wide Web documents through Internet search engines. Library Resources & Technical Services 1998; 42: 258-271.
[4] Hiemstra D, Hauff C, "MapReduce for Information Retrieval Evaluation: “Let's Quickly Test This on 12 TB of Data”," in Multilingual and Multimodal Information Access Evaluation, M. Agosti, N. Ferro, C. Peters, M. de Rijke, and A. Smeaton, Eds., ed: Springer Berlin Heidelberg, 2010, pp. 64-69.
[5] Mao J, Sakai T, Luo C, Xiao P, Liu Y, Dou Z. Overview of the NTCIR-14 we want web task. 2019; 455-467.
[6] Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 1998; 30: 107-117.
[7] Ounis I, Amati G, Plachouras V, He B, Macdonald C, Johnson D. Terrier Information Retrieval Platform, in Advances in Information Retrieval, pp. 517-519.
[8] Yang P, Fang H, Lin J. Anserini: Reproducible Ranking Baselines Using Lucene. J. Data and Information Quality 2018; 10: 16:1-16:20.
[9] Verma M, Yilmaz E, Mehrotra R, Kanoulas E, Carterette B, Craswell N, et al. Overview of the TREC Tasks Track 2016. 2016.
[10] Sanderson M, Croft WB. The History of Information Retrieval Research. Proceedings of the IEEE 2012; 100: 1444-1451.
[11] Craswell N, Hawking D, Robertson S. Effective Site Finding Using Link Anchor Information, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; New Orleans, Louisiana, USA; 2001, pp. 250-257.
[12] Eiron N, McCurley KS, "Analysis of anchor text for web search," presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada, 2003.
[13] Kraft R, Zien J, "Mining anchor text for query refinement," presented at the Proceedings of the 13th international conference on World Wide Web, New York, NY, USA, 2004.
[14] Dang V, Croft BW, "Query reformulation using anchor text," presented at the Proceedings of the third ACM international conference on Web search and data mining, New York, New York, USA, 2010.
[15] Anh VN, Moffat A. The Role of Anchor Text in ClueWeb09 Retrieval. 2010.
[16] Macdonald C, Santos RLT, Ounis I. The whens and hows of learning to rank for web search. Information Retrieval 2013; 16: 584-628.
[17] Kang I-H, Kim G. Query Type Classification for Web Document Retrieval, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 64-71.
[18] Song R, Wen J-R, Shi S, Xin G, Liu T-Y, Qin T, et al. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. 2004.
[19] Ogilvie P, Callan J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding. 2003.
[20] Westerveld T, Kraaij W, Hiemstra D. Retrieving web pages using content, links, urls and anchors. 2001.
[21] Chibane I, Doan B-L. A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 817-818.
[22] Craswell N, Hawking D. Overview of the TREC-2004 Web Track. 2004.
[23] Zheng G, Callan J, "Learning to Reweight Terms with Distributed Representations," presented at the Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 2015.
[24] Qin T, Liu T-Y, Xu J, Li H. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 2010; 13: 346-374.
[25] Macdonald C, Santos RLT, Ounis I, He B. About Learning Models with Multiple Query-dependent Features. ACM Trans. Inf. Syst. 2013; 31: 11:1-11:39.
[26] Collins-Thompson K, Ogilvie P, Zhang Y, Callan J. Information filtering, novelty detection, and named-page finding. 2002.
[27] Ogilvie P, Callan J, Callan J. Combining Document Representations for Known-item Search, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 143-150.
[28] Savoy J, Rasolofo Y. Report on the TREC 11 experiment: Arabic, named page and topic distillation searches. 2002.
[29] Zhou Z, Guo Y, Wang B, Cheng X, Xu H, Zhang G. TREC 2004 Web Track Experiments at CASICT. 2004.
[30] Tomlinson S. Robust, web and terabyte retrieval with hummingbird searchserver at TREC 2004. 2004.
[31] Wen J-R, Song R, Cai D, Zhu K, Yu S, Ye S, et al. Microsoft Research Asia at the Web Track of TREC 2003. 2003.
[32] Roy D, Mitra M, Ganguly D. To Clean or Not to Clean: Document Preprocessing and Reproducibility. J. Data and Information Quality 2018; 10: 18:1-18:25.
[33] Gadge J, Bhirud S. Contextual weighting approach to compute term weight in layered vector space model. Journal of Information Science 2019; DOI: 10.1177/0165551519860043.
[34] Spirin N, Han J. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl. 2012; 13: 50-64.
[35] Lewandowski D. Web searching, search engines and Information Retrieval. Inf. Serv. Use 2005; 25: 137-147.
[36] Craven TC. Variations in use of meta tag descriptions by Web pages in different languages. Information Processing & Management 2004; 40: 479-493.
[37] Craven TC. Variations in Use of Meta Tag Keywords by Web Pages in Different Languages. Journal of Information Science 2004; 30: 268-279.
[38] Zhang J, Jastram I. A study of metadata element co-occurrence. Online Information Review 2006; 30: 428-453.
[39] Alimohammadi D. Meta-tags: still a matter of opinion. The Electronic Library 2005; 23: 625-631.
[40] Clarke C, Craswell N, Soboroff I. Overview of the TREC 2004 Terabyte Track. 2004.
[41] Callan J, Hoy M, Yoo C, Zhao L. (2009, The ClueWeb09 Dataset. Available: http://boston.lti.cs.cmu.edu/classes/11-742/S10-TREC/TREC-Nov19-09.pdf
[42] Callan J. (2012, The Lemur Project And its ClueWeb12 Dataset. Available: http://opensearchlab.otago.ac.nz/SIGIR12-OSIR-callan.pdf
[43] Luo C, Sakai T, Liu Y, Dou Z, Xiong C, Xu J. Overview of the NTCIR-13 we want web task. 2017; 394-401.
[44] Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends(®) in Information Retrieval 2009; 3: 333-389.
[45] Zhai C, Lafferty J. A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Trans. Inf. Syst. 2004; 22: 179-214.
[46] Kocabaş İ, Dinçer BT, Karaoğlan B. A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Information Retrieval 2014; 17: 153-176.
[47] Clinchant S, Gaussier É. Information-based Models for Ad Hoc IR, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 234-241.
[48] Białecki A, Muir R, Ingersoll G. Apache Lucene 4, in Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval, pp. 17-24.
[49] Azzopardi L, Crane M, Fang H, Ingersoll G, Lin J, Moshfeghi Y, et al. The Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017, in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1429-1430.
[50] McCandless M, Hatcher E, Gospodnetic O. Lucene in Action, Second Edition: Covers Apache Lucene 3.0, Manning Publications Co., 2010.
[51] Krovetz R. Viewing Morphology As an Inference Process, in Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191-202.
[52] Carterette B, Pavlu V, Fang H, Kanoulas E. Million Query Track 2009 Overview. 2009.
[53] Järvelin K, Kekäläinen J. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 2002; 20: 422-446.
[54] Khan MNA, Mahmood A. A distinctive approach to obtain higher page rank through search engine optimization. Sādhanā 2018; 43: p. 43.
[55] Aslam JA, Montague M. Models for Metasearch, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 276-284.
[56] Montague M, Aslam JA. Condorcet Fusion for Improved Retrieval, in Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 538-548.
[57] Macdonald C, Plachouras V, He B, Lioma C, Ounis I, "University of Glasgow at WebCLEF 2005: Experiments in Per-Field Normalisation and Language Specific Stemming," in Accessing Multilingual Information Repositories, C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al., Eds., ed: Springer Berlin Heidelberg, 2006, pp. 898-907.
[58] Zamani H, Mitra B, Song X, Craswell N, Tiwary S. Neural Ranking Models with Multiple Document Fields, in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining; Los Angeles, California, USA; 2018, pp. 700-708.
[59] Plachouras V, Ounis I, Cacheda F. Selective Combination of Evidence for Topic Distillation Using Document and Aggregate-level Information, in Proceedings of the RIAO 2004 - Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval, pp. 610-622.
[60] Plachouras V, Cacheda F, Ounis I. A decision mechanism for the selective combination of evidence in topic distillation. Information Retrieval 2006; 9: 139-163.

APA	Arslan A (2020). ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. , 182 - 198. 10.18038/estubtda.615103
Chicago	Arslan Ahmet ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. (2020): 182 - 198. 10.18038/estubtda.615103
MLA	Arslan Ahmet ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. , 2020, ss.182 - 198. 10.18038/estubtda.615103
AMA	Arslan A ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. . 2020; 182 - 198. 10.18038/estubtda.615103
Vancouver	Arslan A ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. . 2020; 182 - 198. 10.18038/estubtda.615103
IEEE	Arslan A "ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL." , ss.182 - 198, 2020. 10.18038/estubtda.615103
ISNAD	Arslan, Ahmet. "ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL". (2020), 182-198. https://doi.org/10.18038/estubtda.615103

APA	Arslan A (2020). ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering, 21(1), 182 - 198. 10.18038/estubtda.615103
Chicago	Arslan Ahmet ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering 21, no.1 (2020): 182 - 198. 10.18038/estubtda.615103
MLA	Arslan Ahmet ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering, vol.21, no.1, 2020, ss.182 - 198. 10.18038/estubtda.615103
AMA	Arslan A ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering. 2020; 21(1): 182 - 198. 10.18038/estubtda.615103
Vancouver	Arslan A ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering. 2020; 21(1): 182 - 198. 10.18038/estubtda.615103
IEEE	Arslan A "ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL." Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering, 21, ss.182 - 198, 2020. 10.18038/estubtda.615103
ISNAD	Arslan, Ahmet. "ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL". Eskişehir Technical University Journal of Science and and Technology A- Applied Sciences and Engineering 21/1 (2020), 182-198. https://doi.org/10.18038/estubtda.615103