5. xml files

XML stands for eXtensible Markup Language.
XML is a markup language.
XML was designed to store and transport data in a software- and hardware-independent way.

Warning

The XML modules are not secure against erroneous or maliciously constructed data.
If you need to parse untrusted or unauthenticated data use the defusedxml Package.
defusedxml is a pure Python package that prevents any potentially malicious operation.

5.1. xml structure

XML stores data in a structured way; the author must define both the tags and the document structure.
XML is a tree of elements.
In XML, an element is a pair of tags; a start tag and an end tag, usually with text between them. e.g. <name>Joe</name>
Elements can contain attributes within the start tag as key value pairs. e.g <name id=”1”>Joe</name>
Elements can contain text between the tags and other elements. e.g <name id=”1”>Joe<nickname>Joey</nickname></name>
The examples used will not include attributes (in order to simplify the coding).
An example of xml for employees is below.
<employees> is the root tag.
<office_worker> is a child tag of the root tag.
<firstName> is a child tag of the office_worker tag.
<?xml version="1.0" encoding="UTF-8" ?>
<employees>
    <office_worker>
        <firstName>John</firstName>
        <lastName>Doe</lastName>
        <gender>Male</gender>
    </office_worker>
    <office_worker>
        <firstName>Peter</firstName>
        <lastName>Jones</lastName>
        <gender>Male</gender>
    </office_worker>
    <writer>
        <firstName>Anna</firstName>
        <lastName>Smith</lastName>
        <gender>Female</gender>
    </writer>
</employees>

5.2. Parse xml file

Import a part of the xml library using: import xml.etree.ElementTree as ET
Use ET.parse to parse an XML source into an element tree.
ET.parse(source, parser=None)
Parameters:
  • source – a filename or file object containing XML data.

  • parser – an optional parser instance. If not given, the standard XMLParser parser is used.

Returns an ElementTree instance.

Use getroot() to get the root of the element tree object.
ET.getroot()
Returns the root element for the tree, ET.

ET.findall(match, namespaces=None)
Parameters:
  • match – may be a tag name or a path

  • namespaces – is an optional mapping from namespace prefix to full name. Pass ‘’ as prefix to move all unprefixed tag names in the expression into the given namespace.

Finds all matching subelements, by tag name or path, starting at the root of the tree.
Returns a list containing all matching elements in document order.

The code below opens the xml file, tree = ET.parse(xml_file_path) and parses it into the variable tree.
found_elememnts = tree.findall("office_worker") is used to get a list containing the office_worker data as a list of element objects.
for elem in found_elememnts: iterates over each office_worker list of elements so the text of each element can be obtained.
e.g. elem.find("firstName").text
import xml.etree.ElementTree as ET

xml_file_path = "files/employees.xml"
tree = ET.parse(xml_file_path)


employee_type = "office_worker"
found_elememnts = tree.findall(employee_type)
for elem in found_elememnts:
    f_name = elem.find("firstName").text
    l_name = elem.find("lastName").text
    gen = elem.find("gender").text
    print(f'{f_name} {l_name} is a {gen.lower()} {employee_type}')

employee_type = "writer"
found_elememnts = tree.findall(employee_type)
for elem in found_elememnts:
    f_name = elem.find("firstName").text
    l_name = elem.find("lastName").text
    gen = elem.find("gender").text
    print(f'{f_name} {l_name} is a {gen.lower()} {employee_type}')
The print output is below:
John Doe is a male office_worker
Peter Jones is a male office_worker
Anna Smith is a female writer

Tasks

  1. Avoid code duplication above by using a list of employee types and iterating over them.

Avoid code duplication above by using a list of employee types and iterating over them.

import xml.etree.ElementTree as ET

xml_file_path = "files/employees.xml"
tree = ET.parse(xml_file_path)


employee_types = ["office_worker", "writer"]
for employee_type in employee_types:
    found_elememnts = tree.findall(employee_type)
    for elem in found_elememnts:
        f_name = elem.find("firstName").text
        l_name = elem.find("lastName").text
        gen = elem.find("gender").text
        print(f'{f_name} {l_name} is a {gen.lower()} {employee_type}')

5.3. Edit tag text

Elements can be referred to by numeric indices from the root or from the parent.
e.g. The text “Peter”, the firstName of the second employee (who is at at index 1 of the root), is at root[1][0].text.
<?xml version="1.0" encoding="UTF-8" ?>
<employees>
    ...
    <office_worker>
        <firstName>Peter</firstName>
        ...
    </office_worker>
</employees>
The code below changes “Peter” to “Pete” and saves the change to the file employees2.xml
import xml.etree.ElementTree as ET

xml_file_path = "files/employees.xml"
xml_file_path2 = "files/employees2.xml"
tree = ET.parse(xml_file_path)
root = tree.getroot()

root[2][0].text = "Pete"
tree.write(xml_file_path2)

5.4. Fixing indenting

ET.indent(tree, space='  ', level=0)
Parameters:
  • tree – can be an Element or ElementTree.

  • space – space is the whitespace string that will be inserted for each indentation level, two space characters by default

  • level – For indenting partial subtrees inside of an already indented tree, pass the initial indentation level as level.

Appends whitespace to the subtree to indent the tree visually. This can be used to generate pretty-printed XML output.

5.5. New Elements

ET.Element(tag, attrib={}, **extra)
Parameters:
  • tag – tag is the element name

  • attrib – attrib is an optional dictionary, containing element attributes.

  • extra – extra contains additional attributes, given as keyword arguments.

Appends whitespace to the subtree to indent the tree visually. This can be used to generate pretty-printed XML output.
Just use: ET.Element(tag).

5.6. Append element

Elements can be appended to the root.
e.g. root.append(emp4_xml)
employer4 = ET.Element("writer") creates an element for the new employee.
child = ET.Element("firstName") creates an element for the firstName.
child.text = "Jane" adds the text value to firstName element.
employer4.append(child) appends the firstName element to the employer4 element.
The other child elements are then built and appended to the employer4 element befor eit is finally appended to the root.
import xml.etree.ElementTree as ET

xml_file_path = "files/employees.xml"
xml_file_path2 = "files/employees2.xml"
tree = ET.parse(xml_file_path)
root = tree.getroot()

employer4 = ET.Element("writer")

child = ET.Element("firstName")
child.text = "Jane"
employer4.append(child)

child = ET.Element("lastName")
child.text = "Austin"
employer4.append(child)

child = ET.Element("gender")
child.text = "Female"
employer4.append(child)

root.append(employer4)
ET.indent(tree, space='    ', level=0)
tree.write(xml_file_path2)
The code above appends the xml below:
<writer>
    <firstName>Jane</firstName>
    <lastName>Austin</lastName>
    <gender>Female</gender>
</writer>

Tasks

  1. Write a definition to append the employee data from a python dictionary to the xml and use it to add the employee dictionary: {“firstName”:”Jane”,”lastName”:”Austin”,”gender”:”Female”}

Write a definition to append the employee data from a python dictionary to the xml and use it to add the employee dictionary: {“firstName”:”Jane”,”lastName”:”Austin”,”gender”:”Female”} using a tag of “writer”

import xml.etree.ElementTree as ET

xml_file_path = "files/employees.xml"
xml_file_path2 = "files/employees2.xml"
tree = ET.parse(xml_file_path)
root = tree.getroot()

def append_dict_to_xml(tag, d):
    # create new element
    elem = ET.Element(tag)
    for key, val in d.items():
        # create child element
        child = ET.Element(key)
        child.text = val
        elem.append(child)
    return elem

emp4_dict = {"firstName":"Jane","lastName":"Austin","gender":"Female"}
tag = "writer"
emp4_xml = append_dict_to_xml(tag, emp4_dict)
root.append(emp4_xml)

ET.indent(tree, space='    ', level=0)
tree.write(xml_file_path2)
<employees>
    <office_worker>
        <firstName>John</firstName>
        <lastName>Doe</lastName>
        <gender>Male</gender>
    </office_worker>
    <office_worker>
        <firstName>Peter</firstName>
        <lastName>Jones</lastName>
        <gender>Male</gender>
    </office_worker>
    <writer>
        <firstName>Anna</firstName>
        <lastName>Smith</lastName>
        <gender>Female</gender>
    </writer>
    <writer>
        <firstName>Jane</firstName>
        <lastName>Austin</lastName>
        <gender>Female</gender>
    </writer>
</employees>