Course:LIBR557/2020WT2/markup language

From UBC Wiki

Markup Language

Markup language is a system for noting and distinguishing the attributes of a document. It’s name originates from the act “marking up” paper manuscripts with revisions for both content and structure of document In the context of information retrieval, markup languages are meta-languages that describe data in meaningful ways across applications (Lee et al., 2021; Winters, 2005) by separating a document’s content from structure (Combs, 2012; Lee et al., 2021).

Markup languages are made of elements (also known as “tags”) and attributes that help to distinguish between document content and specify the “purpose and meaning” of specific texts in the document (Eito-Brun, 2018). Markup languages usually use angle brackets (< and >) to indicate an element. For example, the paragraph of the following sentence:

The quick brown fox jumped over the lazy dog

Using markup language, describing the sentence as a single paragraph would use the element of <p> and </p> as seen below:

<p>The quick brown fox jumped over the lazy dog</p>

Commonly Used Markup Languages

There are many different types of markup languages used for different purposes and by different users. A few of the most commonly used markup languages within information retrieval includes HTML and XML.

HTML

HTML, or Hypertext Markup Language, is a markup language designed in the 1990s by   Tim Berners-Lee. It specialized in describing how documents are to be structured when displayed as a webpage using a set of predefined tags (elements) (Eito-Brun, 2018; Winters, 2005). The World Wide Web Consortium (W3C) recommends Unicode Standard and Unicode Transformation Formats such as UTF-8 to be used in HTML encoding. Unfortunately, HTML is not able to describe the content of a document, only the structure of the document. It cannot describe the information inside the document, which constrains information retrieval.

XML

XML, or eXtensible Markup Language, is a markup language that was developed by various IT companies, including Microsoft and International Business Machines Corporation (IBM), under the direction of the World Wide Web Consortium (W3C) in 1998 (Eito-Brun, 2018)

XML is able to describe and encode the contents of a document and can describe what type of document it is by specifying its Document Type Definition (DTD). DTDs are models or schemas of tagging for documents that have similar purposes and structures, while containing the same type of information. Invoices, medical records and articles are all different document types (Eito-Brun, 2018). Indicating a document type allows users to use a tag set or schema that the documents of that commonly share characteristics to describe and encode the content of a document. XML also allows users the flexibility to use tags outside of the schema depending on information management needs (Eito-Brun, 2018; Lee et al., 2021)

Markup Language in Information Retrieval

XML is a popular markup language for document retrieval within information management systems. Due to its flexibility in tagging and its ability to describe type of information within the contents of a document, XML has become a standard for data communication (Bertino & Catania, 2001). By being able to convey the type of information a document has along with what type of document it is, encoding metadata about documents is much more convenient.

XML has also been used for XML indexing. In XML indexing, tags are “inserted” into an XML document to signal indexable terms and topics (Combs, 2012). These indexable terms and topics can then be used to index the document within a database for later recall when a query uses related index terms inside a search interface. Such indexing is visible in modes of search like faceted searches.

Strengths and Weaknesses

Markup languages are useful for providing structural data for documents across multiple applications. HTML is particularly good for the digital environment of the Internet as it is specialised to allow webpages to be displayed as the coder wishes across application platforms. XML has much utility thanks to its tagging flexibility and interoperability across information management systems.

Some weaknesses are that HTML’s predefined set of elements is restrictive for users, and that it is not able to describe the content of a document such as the type of information it carries. XML, while being less restrictive and being able to encode metadata, has been found to be too “verbose” which leads to XML documents being weighed down by nonessential structural data and metadata (Lee et al., 2021).

Bibliography

Bertino, E., & Catania, B. (2001). Integrating XML and databases. IEEE Internet Computing, 5(4), 84–88.

https://doi.org/10.1109/4236.939454

Combs, M. (2012). XML indexing. Indexer, 30(1), 47–52. https://doi.org/10.3828/indexer.2012.9

Eito-Brun, R. (2018). Chapter 1 - XML: The Basis of the Language. In R. Eito-Brun (Ed.), XML-based Content Management (pp. 1–30). Chandos Publishing. https://doi.org/10.1016/B978-0-08-100204-9.00001-9

Lee, J., Anjos, E., & Satti, S. R. (2021). SJSON: A succinct representation for JSON documents. Information Systems, 97, 101686. https://doi.org/10.1016/j.is.2020.101686

Winters, R. (2005). XML Marks the Future for Electronic Records. Information Management Journal, 39(6), 64–68.