Python > Working with Data > File Formats > XML (Extensible Markup Language) - parsing with `xml.etree.ElementTree`, `lxml`
Parsing XML with xml.etree.ElementTree
This snippet demonstrates how to parse XML data using Python's built-in xml.etree.ElementTree
module. This module provides a simple and efficient way to navigate and extract data from XML documents. We'll cover basic parsing, attribute access, and text extraction.
Basic XML Parsing with ElementTree
This code snippet first imports the xml.etree.ElementTree
module. It then defines an XML string representing a bookstore. The ET.fromstring()
function parses the XML string into an ElementTree object. We then use root.findall('book')
to find all 'book' elements within the document. For each book, we extract the text content of the 'title', 'author', 'year', and 'price' elements using the find()
method and the .text
attribute. Finally, we access the 'category' attribute of the 'book' element using the get()
method. The extracted data is then printed to the console.
import xml.etree.ElementTree as ET
xml_data = '''
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J.K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
'''
root = ET.fromstring(xml_data)
for book in root.findall('book'):
title = book.find('title').text
author = book.find('author').text
year = book.find('year').text
price = book.find('price').text
category = book.get('category') # Accessing attribute
print(f"Category: {category}, Title: {title}, Author: {author}, Year: {year}, Price: {price}")
Concepts Behind the Snippet
XML Structure: XML documents have a hierarchical structure with elements, attributes, and text content. Elements are enclosed in angle brackets (e.g., ElementTree: ElementTree represents the XML document as a tree structure. The Finding Elements: The Accessing Attributes: The Accessing Text Content: The <book>
). Attributes provide additional information about elements (e.g., category="cooking"
). Text content is the data contained within an element (e.g., Everyday Italian
).root
element is the top-level element of the tree.findall()
method searches for all elements with a given tag name under a specific element. The find()
method searches for the first element with a given tag name under a specific element.get()
method retrieves the value of an attribute of an element..text
attribute retrieves the text content of an element.
Real-Life Use Case
This technique is widely used for reading configuration files, processing data from web services (APIs that return XML), and parsing data from legacy systems that use XML for data exchange. For instance, many scientific instruments or financial systems output data in XML format, which can then be processed using Python and xml.etree.ElementTree
.
Best Practices
Error Handling: Wrap your parsing code in a Namespaces: If your XML document uses namespaces, you'll need to account for them when searching for elements. This typically involves specifying the namespace in the tag names. Large Files: For very large XML files, consider using iterative parsing (try...except
block to handle potential errors such as malformed XML or missing elements. This will prevent your program from crashing.xml.etree.ElementTree.iterparse
) to avoid loading the entire document into memory at once.
When to Use ElementTree
xml.etree.ElementTree
is a good choice for relatively small to medium-sized XML files where simplicity and ease of use are important. It's part of Python's standard library, so you don't need to install any external dependencies. However, for very large files or complex XML structures, lxml
often provides better performance and more features.
Memory Footprint
ElementTree loads the entire XML document into memory as a tree structure. This can be a concern for very large files. For large files, use iterative parsing or consider lxml
, which is generally more memory efficient.
Alternatives
lxml: A third-party library that provides a faster and more feature-rich XML processing experience. It supports XPath and XSLT, and offers better performance for large XML files. We show a code sample in the next snippet. minidom (xml.dom.minidom): Another built-in Python module for XML processing. It is less commonly used than Beautiful Soup: Primarily used for parsing HTML, but can also be used to parse XML. It's particularly useful for handling malformed or inconsistent XML.xml.etree.ElementTree
and lxml
.
Pros
Cons
lxml
FAQ
-
How do I handle namespaces in XML with ElementTree?
When dealing with XML documents that use namespaces, you need to specify the namespace when searching for elements. You can do this by creating a dictionary that maps namespace prefixes to their corresponding URIs, and then using this dictionary in yourfind()
andfindall()
methods. For example:namespaces = {'prefix': 'http://example.com/namespace'} root.find('prefix:element_name', namespaces)
-
How do I handle errors when parsing XML?
Wrap your parsing code in atry...except
block to catch potentialxml.etree.ElementTree.ParseError
exceptions. This will allow you to gracefully handle malformed XML documents.