How to optimize and parse xml files with python

Reading guide

This article is intended to demonstrate how to parse xml files using python. The background of this article is that in the process of program development, the consistent practice is "agreement is better than configuration", but how to detect whether it is done according to the agreement? Take Maven as an example. Maven provides a maven enforcer plugin plug-in, which can be used to customize a series of rules. So what we need to do is to use Python to insert Maven enforcer plugin configuration into the pom file.

python environment: 3.8

This article was first published in https://russellgao.cn/python-maven-enforcer-plugins/ For reprint, please indicate the source of the original text.

target

Get a POM After the XML file, the python implementation inserts Maven enforcer plugin into it, and then MVN validates it. pom. Take the open source project arthas as as an example: https://raw.githubusercontent.com/alibaba/arthas/master/pom.xml

Maven enforcer plugin is configured as follows:

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-enforcer-plugin</artifactId>
<version>3.0.0-M3</version>
<executions>
  <execution>
    <id>enforce-no-snapshots</id>
    <goals>
      <goal>enforce</goal>
    </goals>
    <configuration>
      <rules>
        <requireReleaseDeps>
          <message>No Snapshots Allowed!</message>
        </requireReleaseDeps>
      </rules>
      <fail>true</fail>
    </configuration>
  </execution>
</executions>
</plugin>

What needs to be realized is to integrate the content of Maven enforcer plugin into POM XML file. Next, it mainly introduces how to realize automation with python.

analysis

Parsing mainly uses the xml library for parsing. For detailed usage, please refer to Official documents

Load data:

import xml.etree.ElementTree as ET
tree = ET.parse('pom.xml')
root = tree.getroot()

perhaps

import xml.etree.ElementTree as ET
tree = ET.fromstring(open('pom.xml').read())
root = tree.getroot()

Iteration child node:

for child in root :
    print(child.tag)

### The output is as follows: 
{http://maven.apache.org/POM/4.0.0}modelVersion
{http://maven.apache.org/POM/4.0.0}parent
{http://maven.apache.org/POM/4.0.0}licenses
{http://maven.apache.org/POM/4.0.0}scm
{http://maven.apache.org/POM/4.0.0}developers
{http://maven.apache.org/POM/4.0.0}groupId
{http://maven.apache.org/POM/4.0.0}artifactId
{http://maven.apache.org/POM/4.0.0}version
{http://maven.apache.org/POM/4.0.0}packaging
{http://maven.apache.org/POM/4.0.0}name
{http://maven.apache.org/POM/4.0.0}description
{http://maven.apache.org/POM/4.0.0}url
{http://maven.apache.org/POM/4.0.0}modules
{http://maven.apache.org/POM/4.0.0}profiles
{http://maven.apache.org/POM/4.0.0}properties
{http://maven.apache.org/POM/4.0.0}dependencyManagement
{http://maven.apache.org/POM/4.0.0}build

You can see that there is an extra paragraph in front{ http://maven.apache.org/POM/4.0.0 }, what's going on?

The original pom file has a namespace:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">

xmlns specifies the namespace as: http://maven.apache.org/POM/4.0.0 Therefore, we need to bring a namespace when looking for child nodes.

Find child nodes (child elements): ET provides findall() and find() methods to find child elements in xml, where:

  • findall finds only the element with the specified label in the immediate child element of the current element
  • Find the first child with a specific label, and then use element Textaccess the text content of the element. Element. Textaccess the properties of the element
build_node = root.find('%sbuild' % pomNamespace)
plugins_node = build_node.find('{}plugins'.format(pomNamespace))
plugins_node_all = plugins_node.findall('{}plugin'.format(pomNamespace))

Add a new node:

def build_plugin_node(e: ET.Element, namespace: str) -> None:
    """
    Add child element
    :param e: Parent node
    :param namespace: namespace
    :return: 
    """
    node = ET.SubElement(e, '{}plugin'.format(namespace))
    ET.SubElement(node, "{}groupId".format(namespace)).text = 'org.apache.maven.plugins'
    ET.SubElement(node, "{}artifactId".format(namespace)).text = 'maven-enforcer-plugin'
    ET.SubElement(node, "{}version".format(namespace)).text = '3.0.0-M3'
    _n4 = ET.SubElement(node, "{}executions".format(namespace))
    _e = ET.SubElement(_n4, '{}execution'.format(namespace))
    _e_id = ET.SubElement(_e, "{}id".format(namespace)).text = 'enforce-versions'
    _e_phase = ET.SubElement(_e, "{}phase".format(namespace)).text = 'validate'
    _e_goals = ET.SubElement(_e, "{}goals".format(namespace))
    _e_cfg = ET.SubElement(_e, "{}configuration".format(namespace))
    _e_rules = ET.SubElement(_e_cfg, "{}rules".format(namespace))
    _e_fail = ET.SubElement(_e_cfg, "{}fail".format(namespace)).text = 'true'
    _e_requireReleaseDeps = ET.SubElement(_e_rules, "{}requireReleaseDeps".format(namespace))
    ET.SubElement(_e_requireReleaseDeps, '{}message'.format(namespace)).text = 'No Snapshots Allowed!'
    _e_excludes = ET.SubElement(_e_requireReleaseDeps, "{}excludes".format(namespace))
    ET.SubElement(_e_excludes, '{}exclude'.format(namespace)).text = '${project.groupId}:*'
    _e_goal = ET.SubElement(_e_goals, "{}goal".format(namespace)).text = 'enforce'
build_plugin_node(plugins_node, pomNamespace)

Write the modification to the new file:

tree.write('pom-new.xml')

When you write about the file, you will be surprised to find that it doesn't seem quite right

<ns0:plugin>
    <ns0:groupId>org.apache.maven.plugins</ns0:groupId>
    <ns0:artifactId>maven-enforcer-plugin</ns0:artifactId>
    <ns0:version>3.0.0-M3</ns0:version>
    <ns0:executions>
        <ns0:execution>
            <ns0:id>enforce-versions</ns0:id>
            <ns0:phase>validate</ns0:phase>
            <ns0:goals>
                <ns0:goal>enforce</ns0:goal>
            </ns0:goals>
            <ns0:configuration>
                <ns0:rules>
                    <ns0:requireReleaseDeps>
                        <ns0:message>No Snapshots Allowed!</ns0:message>
                        <ns0:excludes>
                            <ns0:exclude>${project.groupId}:*</ns0:exclude>
                        </ns0:excludes>
                    </ns0:requireReleaseDeps>
                </ns0:rules>
                <ns0:fail>true</ns0:fail>
            </ns0:configuration>
        </ns0:execution>
    </ns0:executions>
</ns0:plugin>

Why is it so much ns0? Or because of the namespace, you need to register the namespace:

ns_name = ""
ns_value = "http://maven.apache.org/POM/4.0.0"
ET.register_namespace(ns_name, ns_value)
tree.write('pom-new.xml')

Only then can we get a satisfactory result.

The overall script is as follows:

# !/usr/bin/env python

import xml.etree.ElementTree as ET

def build_plugin_node(e: ET.Element, namespace: str) -> None:
    """
    Add child element
    :param e: Parent node
    :param namespace: namespace
    :return:
    """
    node = ET.SubElement(e, '{}plugin'.format(namespace))
    ET.SubElement(node, "{}groupId".format(namespace)).text = 'org.apache.maven.plugins'
    ET.SubElement(node, "{}artifactId".format(namespace)).text = 'maven-enforcer-plugin'
    ET.SubElement(node, "{}version".format(namespace)).text = '3.0.0-M3'
    _n4 = ET.SubElement(node, "{}executions".format(namespace))
    _e = ET.SubElement(_n4, '{}execution'.format(namespace))
    _e_id = ET.SubElement(_e, "{}id".format(namespace)).text = 'enforce-versions'
    _e_phase = ET.SubElement(_e, "{}phase".format(namespace)).text = 'validate'
    _e_goals = ET.SubElement(_e, "{}goals".format(namespace))
    _e_cfg = ET.SubElement(_e, "{}configuration".format(namespace))
    _e_rules = ET.SubElement(_e_cfg, "{}rules".format(namespace))
    _e_fail = ET.SubElement(_e_cfg, "{}fail".format(namespace)).text = 'true'
    _e_requireReleaseDeps = ET.SubElement(_e_rules, "{}requireReleaseDeps".format(namespace))
    ET.SubElement(_e_requireReleaseDeps, '{}message'.format(namespace)).text = 'No Snapshots Allowed!'
    _e_excludes = ET.SubElement(_e_requireReleaseDeps, "{}excludes".format(namespace))
    ET.SubElement(_e_excludes, '{}exclude'.format(namespace)).text = '${project.groupId}:*'
    _e_goal = ET.SubElement(_e_goals, "{}goal".format(namespace)).text = 'enforce'


def main() :
    pomNamespace = '{http://maven.apache.org/POM/4.0.0}'
    ns_name = ""
    ns_value = "http://maven.apache.org/POM/4.0.0"
    ET.register_namespace(ns_name, ns_value)
    tree = ET.parse('pom.xml')
    root = tree.getroot()
    build_node = root.find('%sbuild' % pomNamespace)
    plugins_node = build_node.find('{}plugins'.format(pomNamespace))
    build_plugin_node(plugins_node, pomNamespace)
    tree.write('pom-new.xml')

if __name__ == "__main__" :
    main()

Got a new file POM new After XML, you can happily call the maven command to check, such as:

mvn validate -f pom-new.xml

This process is often used in CI process and can help check whether developers comply with relevant code specifications.

extend

Maven enforcer plugin common rules

Official references to common rules: http://maven.apache.org/enforcer/enforcer-rules/index.html

  • Is the SNAPSHOT version dependent in the relax version
  • java version
  • maven version
  • Check environment variables
  • ....

xml. etree. Other uses of elementtree

Modify element text field

plugins_node.text = 'plugins-test'

Add attributes to elements:

plugins_node.set('k1', 'v1')
plugins_node.set('k2', 4)
plugins_node.set('k3', True)

Delete element:

modelVersion = root.find('{}modelVersion'.format(pomNamespace))
root.remove(modelVersion)

reference resources

Tags: Apache Maven

Posted by renu on Mon, 11 Apr 2022 21:19:57 +0300