Abstract
Traditional information extraction methods mainly rely on visual feature assisted techniques; but without considering the hierarchical dependencies within the paragraph structure, some important information is missing. This paper proposes an integrated approach for extracting academic information from conference Web pages. Firstly, Web pages are segmented into text blocks by applying a new hybrid page segmentation algorithm which combines visual feature and DOM structure together. Then, these text blocks are labeled by a Tree-structured Random Fields model, and the block functions are differentiated using various features such as visual features, semantic features and hierarchical dependencies. Finally, an additional post-processing is introduced to tune the initial annotation results. Our experimental results on real-world data sets demonstrated that the proposed method is able to effectively and accurately extract the needed academic information from conference Web pages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tang, J., Zhang, J., Zhang, D., Yao, L., Zhu, C., Li, J.Z.: Arnetminer: An expertise oriented search system for web community. In: Semantic Web Challenge. CEUR Workshop Proceedings, vol. 295 (2007)
Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: SIGIR, pp. 245–254 (2011)
Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006)
Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: 2d conditional random fields for web information extraction. In: ICML. ACM International Conference Proceeding Series, vol. 119, pp. 1044–1051 (2005)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based web search. In: SIGIR, pp. 456–463 (2004)
Duan, K.-B., Keerthi, S.S.: Which is the best multiclass SVM method? An empirical study. In: Oza, N.C., Polikar, R., Kittler, J., Roli, F. (eds.) MCS 2005. LNCS, vol. 3541, pp. 278–285. Springer, Heidelberg (2005)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
Bradley, J.K., Guestrin, C.: Learning tree conditional random fields. In: ICML, pp. 127–134 (2010)
Tang, J., Hong, M., Li, J., Liang, B.: Tree-structured conditional random fields for semantic annotation. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 640–653. Springer, Heidelberg (2006)
Heckerman, D.: A tutorial on learning with bayesian networks. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Bayesian Networks. SCI, vol. 156, pp. 33–82. Springer, Heidelberg (2008)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003)
Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: WWW, pp. 203–211 (2004)
Wainwright, M.J., Jaakkola, T., Willsky, A.S.: Tree-based reparameterization for approximate inference on loopy graphs. In: NIPS, pp. 1001–1008 (2001)
Xiao, Y., Wei, Z., Wang, Z.: A limited memory bfgs-type method for large-scale unconstrained optimization. Computers & Mathematics with Applications 56(4), 1001–1009 (2008)
Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In: HLT-NAACL (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
You, Y., Xu, G., Cao, J., Zhang, Y., Huang, G. (2013). Leveraging Visual Features and Hierarchical Dependencies for Conference Information Extraction. In: Ishikawa, Y., Li, J., Wang, W., Zhang, R., Zhang, W. (eds) Web Technologies and Applications. APWeb 2013. Lecture Notes in Computer Science, vol 7808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37401-2_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-37401-2_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37400-5
Online ISBN: 978-3-642-37401-2
eBook Packages: Computer ScienceComputer Science (R0)