Python > Working with Data > File Formats > XML (Extensible Markup Language) - parsing with `xml.etree.ElementTree`, `lxml`

Parsing XML with xml.etree.ElementTree

This snippet demonstrates how to parse XML data using Python's built-in xml.etree.ElementTree module. This module provides a simple and efficient way to navigate and extract data from XML documents. We'll cover basic parsing, attribute access, and text extraction.

Basic XML Parsing with ElementTree

This code snippet first imports the xml.etree.ElementTree module. It then defines an XML string representing a bookstore. The ET.fromstring() function parses the XML string into an ElementTree object. We then use root.findall('book') to find all 'book' elements within the document. For each book, we extract the text content of the 'title', 'author', 'year', and 'price' elements using the find() method and the .text attribute. Finally, we access the 'category' attribute of the 'book' element using the get() method. The extracted data is then printed to the console.

import xml.etree.ElementTree as ET

xml_data = '''
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>J.K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
</bookstore>
'''

root = ET.fromstring(xml_data)

for book in root.findall('book'):
    title = book.find('title').text
    author = book.find('author').text
    year = book.find('year').text
    price = book.find('price').text
    category = book.get('category') # Accessing attribute

    print(f"Category: {category}, Title: {title}, Author: {author}, Year: {year}, Price: {price}")

Concepts Behind the Snippet

XML Structure: XML documents have a hierarchical structure with elements, attributes, and text content. Elements are enclosed in angle brackets (e.g., <book>). Attributes provide additional information about elements (e.g., category="cooking"). Text content is the data contained within an element (e.g., Everyday Italian).

ElementTree: ElementTree represents the XML document as a tree structure. The root element is the top-level element of the tree.

Finding Elements: The findall() method searches for all elements with a given tag name under a specific element. The find() method searches for the first element with a given tag name under a specific element.

Accessing Attributes: The get() method retrieves the value of an attribute of an element.

Accessing Text Content: The .text attribute retrieves the text content of an element.

Real-Life Use Case

This technique is widely used for reading configuration files, processing data from web services (APIs that return XML), and parsing data from legacy systems that use XML for data exchange. For instance, many scientific instruments or financial systems output data in XML format, which can then be processed using Python and xml.etree.ElementTree.

Best Practices

Error Handling: Wrap your parsing code in a try...except block to handle potential errors such as malformed XML or missing elements. This will prevent your program from crashing.

Namespaces: If your XML document uses namespaces, you'll need to account for them when searching for elements. This typically involves specifying the namespace in the tag names.

Large Files: For very large XML files, consider using iterative parsing (xml.etree.ElementTree.iterparse) to avoid loading the entire document into memory at once.

When to Use ElementTree

xml.etree.ElementTree is a good choice for relatively small to medium-sized XML files where simplicity and ease of use are important. It's part of Python's standard library, so you don't need to install any external dependencies. However, for very large files or complex XML structures, lxml often provides better performance and more features.

Memory Footprint

ElementTree loads the entire XML document into memory as a tree structure. This can be a concern for very large files. For large files, use iterative parsing or consider lxml, which is generally more memory efficient.

Alternatives

lxml: A third-party library that provides a faster and more feature-rich XML processing experience. It supports XPath and XSLT, and offers better performance for large XML files. We show a code sample in the next snippet.

minidom (xml.dom.minidom): Another built-in Python module for XML processing. It is less commonly used than xml.etree.ElementTree and lxml.

Beautiful Soup: Primarily used for parsing HTML, but can also be used to parse XML. It's particularly useful for handling malformed or inconsistent XML.

Pros

  • Part of Python's standard library (no external dependencies)
  • Simple and easy to use
  • Suitable for small to medium-sized XML files

Cons

  • Can be slow for large XML files
  • Limited support for advanced XML features (e.g., XPath, XSLT) compared to lxml
  • Loads the entire XML document into memory

FAQ

  • How do I handle namespaces in XML with ElementTree?

    When dealing with XML documents that use namespaces, you need to specify the namespace when searching for elements. You can do this by creating a dictionary that maps namespace prefixes to their corresponding URIs, and then using this dictionary in your find() and findall() methods. For example: namespaces = {'prefix': 'http://example.com/namespace'} root.find('prefix:element_name', namespaces)
  • How do I handle errors when parsing XML?

    Wrap your parsing code in a try...except block to catch potential xml.etree.ElementTree.ParseError exceptions. This will allow you to gracefully handle malformed XML documents.