Python Lxml

broken image


In this article, you'll learn the basics of parsing an HTML document using Python and the LXML library. Introduction Data is the most important ingredient in programming. It comes in all shapes and forms. Sometimes it is placed inside documents such as CSV or JSON, but sometimes it is stored on the internet or in HTML Parsing using Python and LXML Read More ». Tutorial¶ This is a short tutorial for using xml.etree.ElementTree (ET in short). Welcome to Python 101!¶ Learn how to program with Python 3 from beginning to end. Python 101 starts off with the fundamentals of Python and then builds onto what you've learned from there. The audience of this book is primarily people who have programmed in the past but want to learn Python. In this tutorial, we will be performing web scraping using lxml in Python. Web Scraping is the process of scraping or retrieving information/data from different websites. Most of the websites have a basic structure of HTML elements and also contains CSS (Cascading. The lxml module has a module called objectify that can turn XML documents into Python objects. I find 'objectified' XML documents very easy to work with and I hope you will too. You may need to jump through a hoop or two to install it as pip doesn't work with lxml on Windows.

In this article, you'll learn the basics of parsing an HTML document using Python and the LXML library.

Introduction

Data is the most important ingredient in programming. It comes in all shapes and forms. Sometimes it is placed inside documents such as CSV or JSON, but sometimes it is stored on the internet or in databases. Some of it is stored/transferred or processed through the XML format, which is in many ways similar to HTML format, yet its purpose is to transfer and store data, unlike HTML, whose main purpose is to display the data. On top of that, the way of writing HTML and XML is similar. Despite the differences and similarities, they supplement each other very well.

Both Xpath and XML are engineered by the same company W3C, which imposes that Xpath is the most compatible Python module to be used for parsing the XML documents. Since one of the programing principals which would push you towards the programming success is to 'not reinvent the wheel', we are going to refer to the W3C (https://www.w3.org/) consortium document and sources in regarding the syntax and operators on our examples to bring the concept of XPath closer to the people wishing to understand it better and use it on real-life problems.

The IT industry has accepted the XML way of transferring data as one of its principles. Imagine if one of your tasks was to gather information from the internet? Copying and pasting are one of the simplest tools to use (as it is regularly used by programmers as well); it might only lead us to gather some simple data from the web, although the process might get painfully repetitive. Yet, in case if we have more robust data, or more web pages to gather the data from, we might be inclined to use more advanced Python packages to automate our data gathering.

Before we start looking into scraping tools and strategies, it is good to know that scraping might not be legal in all cases, therefore it is highly suggested that we look at the terms of service of a particular web site, or copyright law regarding the region in which the web site operates.

For purposes of harvesting the web data, we will be using several Python libraries that allow us to do just that. The first of them is the requests module. What it does is that it sends the HTTP requests, which returns us the response object. It only used if urge to scrape the content from the internet. If we try to parse the static XML file it would not be necessary.

There are many parsing modules. LXML, Scrapy and BeautifulSoup are some of them. To tell which one is better is often neglected since their size and functionality differs from one another. For example, BeautifulSoup is more complex and serves you with more functionality, but LXML and Scrapy comes lightweight and can help you traversing through the documents using XPath and CSS selectors.

There are certain pitfalls when trying to travel through the document using XPath. Common mistake when trying to parse the XML by using XPath notation is that many people try to use the BeautifulSoup library. In fact that is not possible since it does not contain the XPath traversing methods. For those purposes we shall use the LXML library.

The requests library is used in case we want to download a HTML mark-up from the particular web site.

The first step would be to install the necessary packages. Trough pip install notation all of the modules above could be installed rather easily.

Necessary steps:

  1. pip install lxml (xpath module is a part of lxml library)
  2. pip install requests (in case the content is on a web page)

The best way to explain the XML parsing is to picture it through the examples.

The first step would be to install the necessary modules. Trough pip install notation all of the modules above could be installed rather easily.

What is the XPath?

Lxml
Python

The structure of XML and HTML documents is structurally composed of the nodes (or knots of some sort), which is a broader picture that represents the family tree-like structure. The roof instance, or the original ancestor in each tree, is called the root node, and it has no superior nodes to itself. Subordinate nodes are in that sense respectively called children or siblings, which are the elements at the same level as the children. The other terms used in navigating and traversing trough the tree are the ancestors and descendants, which in essence reflect the node relationship the same way we reflect it in real-world family tree examples.

XPath is a query language that helps us navigate and select the node elements within a node tree. In essence, it is a step map that we need to make to reach certain elements in the tree. The single parts of this step map are called the location steps, and each of these steps would lead us to a certain part of the document.

The terminology used for orientation along the axis (with regards to the current node) is very intuitive since it uses regular English expressions related to real-life family tree relationships.

XPath Selector

XPath selector is the condition using which we could navigate through an XML document. It describes relationships as a hierarchical order of the instances included in our path. By combining different segment of XML syntax it helps us traverse through to the desired parts of the document. The selector is a part of the XPath query language. By simply adding different criteria, the XPath selector would lead us to different elements in the document tree. The best way to learn the XPath selector syntax and operators is to implement it on an example. In order to know how to configure the XPath selector, it is essential to know the XPath syntax. XPath selector is compiled using an etree or HTML module which is included within the LXML package. The difference is only if we are parsing the XML document or HTML.

The selector works similarly as a find method with where it allows you to select a relative path of the element rather than the absolute one, which makes the whole traversing less prone to errors in case the absolute path gets too complicated.

XPath Syntax

XPath syntax could be divided into several groups. To have an exact grasp of the material presented we are going to apply further listed expressions and functions on our sample document, which would be listed below. In this learning session, we are going to use a web site dedicated to scraping exercises.

Node selection:

Using '..' and '.' we can direct and switch levels as we desire. Two dot notations would lead us from wherever we are to our parent element, whereas the one dot notations would point us to the current node.

The way that we travel from the 'context node' (our reference node), which is the milestone of our search, is called 'axes', and it is noted with double slash //. What it does is that it starts traversing from the first instance of the given node. This way of path selection is called the 'relative path selection'. To be certain that the // (empty tag) expression would work, it must precede an asterisk (*) or the name tag. Trough inspecting the element and copying its XPath value we are getting the absolute path.

XPath Functions and Operators

here are 6 common operators which are used inside the XPath query. Operators are noted the same way as in plain Python and serve the same purpose. The functions are meant to aid the search of desired elements or their content.

To add more functionality to our XPath expression we can use some of LXML library functions. Everything that is written in-between the '[]' is called a predicate and it is used to closer describe the search path. Most frequently used functions are contains() and starts-with(). Those functions and their results would be displayed in the table below.

Going Up and Down the Axis

The conventional syntax used to traverse up and down the XPath axes is ElementName::axis.

Python

To reach the elements placed above or below our current axes, we might use some of the following axes.

A Simple Example

The goal of this scraping exercise is to scrape all the book genres placed at the left-hand side of the web site. It almost necessary to see the page source and to inspect some of the elements which are we aiming to scrape.

Resources:

I had to parse 400mb of XML for some client work and I tried a few different strategies. Here's what I ended up with.

Quick Jump: Following Along? Getting Set Up|xmltodict vs Python's Standard Library vs lxml

Not too long ago I was writing a Flask service for a client that had to interact with a SOAP API (gross, I know), and one of the goals of this service was to take a bunch of XML data and then compare -> manipulate -> save it to a database.

Most requests were less than 20MB in which case the first solution I used (which was the xmltodict Python library) was fine and dandy but once I had to deal with 400mb of data things got quite slow.

Suddenly it was taking 80 seconds to convert an XML string into a proper data structure that I could iterate over and access fields on. This was the main bottleneck of the service.

After I spent a few hours researching how to improve the parsing speed, I landed on using the lxml library and I was able to bring the parse time down from 80 seconds to 4 seconds which is a 20x improvement.

Following Along? Getting Set Up

This article will have a few code snippets and if you plan to follow along you will need to install the xmltodict library as well as the lxml library so we can compare both libraries.

Creating a directory to store a few files:

It doesn't matter where you create this directory but we will be creating a few Python files, an XML file and optionally a Dockerfile.

A Dockerfile that you can use:

Since I'm a big fan of Docker, here's a Dockerfile that you can use to get up and running quickly. If you're not using Docker and already have a Python 3.x development environment set up then you can install these packages on your system directly.

Create a new Dockerfile and make it look like this:

It's worth pointing out that the lxml library requires apt installing python3-lxml on Debian based systems. One of the reasons why lxml is so fast is because it uses that package's C code to do most of the heavy lifting for parsing XML.

The 2 Python libraries we're installing are pip install xmltodict0.12.0 lxml4.4.1.

Building the Docker image:

Now we need to build our Docker image from our Dockerfile.

It will take a few minutes to build and when it's done we'll have an image named pythonxml.

Creating a Python script to generate a ~250mb sample XML file:

Creating a large XML file by hand would be lame so I whipped up a simple script to generate a ~250mb file for us. This XML file will be the file we run our benchmarks on.

You'll want to create a new file called generatexml.py and put this in it:

If you're a Python developer I'm sure you can make sense of the above. How this script generates the sample file isn't too important. Just know it creates a sample.xml file in the current directory with 2 million entries.

The reason I generated so many is because there's very few XML attributes. In my real XML file I had almost 50 XML attributes and over 100,000+ items. I also had closer to a 400mb file, but I wanted to keep it a bit smaller for this isolated benchmark.

Running the Python script to generate a sample 250mb XML file:

Since I'm running everything in Docker I am running a Docker command but if you're not using Docker then you can just run python3 generatexml.py.

That command should finish running in less than a minute and produce similar output to:

It took a while for me since I'm running all of this inside of WSL (v1) with Docker for Windows and I didn't write it to my SSD. Have to protect those write cycles!

And if you look in your current directory, you should see:

In my case it generated a 243mb sample.xml file.

You can investigate it by running less sample.xml and paging up / down to view it. Press q to cancel the less tool:

Cool, so now we have our sample data. The next step is to run a few parsing benchmarks against it using 3 different XML parsing strategies.

Creating a Python script to parse the sample XML file:

The last thing we need to set up is the parsexml.py file to demonstrate how to parse the XML file and also benchmark it.

Python Lxml

The structure of XML and HTML documents is structurally composed of the nodes (or knots of some sort), which is a broader picture that represents the family tree-like structure. The roof instance, or the original ancestor in each tree, is called the root node, and it has no superior nodes to itself. Subordinate nodes are in that sense respectively called children or siblings, which are the elements at the same level as the children. The other terms used in navigating and traversing trough the tree are the ancestors and descendants, which in essence reflect the node relationship the same way we reflect it in real-world family tree examples.

XPath is a query language that helps us navigate and select the node elements within a node tree. In essence, it is a step map that we need to make to reach certain elements in the tree. The single parts of this step map are called the location steps, and each of these steps would lead us to a certain part of the document.

The terminology used for orientation along the axis (with regards to the current node) is very intuitive since it uses regular English expressions related to real-life family tree relationships.

XPath Selector

XPath selector is the condition using which we could navigate through an XML document. It describes relationships as a hierarchical order of the instances included in our path. By combining different segment of XML syntax it helps us traverse through to the desired parts of the document. The selector is a part of the XPath query language. By simply adding different criteria, the XPath selector would lead us to different elements in the document tree. The best way to learn the XPath selector syntax and operators is to implement it on an example. In order to know how to configure the XPath selector, it is essential to know the XPath syntax. XPath selector is compiled using an etree or HTML module which is included within the LXML package. The difference is only if we are parsing the XML document or HTML.

The selector works similarly as a find method with where it allows you to select a relative path of the element rather than the absolute one, which makes the whole traversing less prone to errors in case the absolute path gets too complicated.

XPath Syntax

XPath syntax could be divided into several groups. To have an exact grasp of the material presented we are going to apply further listed expressions and functions on our sample document, which would be listed below. In this learning session, we are going to use a web site dedicated to scraping exercises.

Node selection:

Using '..' and '.' we can direct and switch levels as we desire. Two dot notations would lead us from wherever we are to our parent element, whereas the one dot notations would point us to the current node.

The way that we travel from the 'context node' (our reference node), which is the milestone of our search, is called 'axes', and it is noted with double slash //. What it does is that it starts traversing from the first instance of the given node. This way of path selection is called the 'relative path selection'. To be certain that the // (empty tag) expression would work, it must precede an asterisk (*) or the name tag. Trough inspecting the element and copying its XPath value we are getting the absolute path.

XPath Functions and Operators

here are 6 common operators which are used inside the XPath query. Operators are noted the same way as in plain Python and serve the same purpose. The functions are meant to aid the search of desired elements or their content.

To add more functionality to our XPath expression we can use some of LXML library functions. Everything that is written in-between the '[]' is called a predicate and it is used to closer describe the search path. Most frequently used functions are contains() and starts-with(). Those functions and their results would be displayed in the table below.

Going Up and Down the Axis

The conventional syntax used to traverse up and down the XPath axes is ElementName::axis.

To reach the elements placed above or below our current axes, we might use some of the following axes.

A Simple Example

The goal of this scraping exercise is to scrape all the book genres placed at the left-hand side of the web site. It almost necessary to see the page source and to inspect some of the elements which are we aiming to scrape.

Resources:

I had to parse 400mb of XML for some client work and I tried a few different strategies. Here's what I ended up with.

Quick Jump: Following Along? Getting Set Up|xmltodict vs Python's Standard Library vs lxml

Not too long ago I was writing a Flask service for a client that had to interact with a SOAP API (gross, I know), and one of the goals of this service was to take a bunch of XML data and then compare -> manipulate -> save it to a database.

Most requests were less than 20MB in which case the first solution I used (which was the xmltodict Python library) was fine and dandy but once I had to deal with 400mb of data things got quite slow.

Suddenly it was taking 80 seconds to convert an XML string into a proper data structure that I could iterate over and access fields on. This was the main bottleneck of the service.

After I spent a few hours researching how to improve the parsing speed, I landed on using the lxml library and I was able to bring the parse time down from 80 seconds to 4 seconds which is a 20x improvement.

Following Along? Getting Set Up

This article will have a few code snippets and if you plan to follow along you will need to install the xmltodict library as well as the lxml library so we can compare both libraries.

Creating a directory to store a few files:

It doesn't matter where you create this directory but we will be creating a few Python files, an XML file and optionally a Dockerfile.

A Dockerfile that you can use:

Since I'm a big fan of Docker, here's a Dockerfile that you can use to get up and running quickly. If you're not using Docker and already have a Python 3.x development environment set up then you can install these packages on your system directly.

Create a new Dockerfile and make it look like this:

It's worth pointing out that the lxml library requires apt installing python3-lxml on Debian based systems. One of the reasons why lxml is so fast is because it uses that package's C code to do most of the heavy lifting for parsing XML.

The 2 Python libraries we're installing are pip install xmltodict0.12.0 lxml4.4.1.

Building the Docker image:

Now we need to build our Docker image from our Dockerfile.

It will take a few minutes to build and when it's done we'll have an image named pythonxml.

Creating a Python script to generate a ~250mb sample XML file:

Creating a large XML file by hand would be lame so I whipped up a simple script to generate a ~250mb file for us. This XML file will be the file we run our benchmarks on.

You'll want to create a new file called generatexml.py and put this in it:

If you're a Python developer I'm sure you can make sense of the above. How this script generates the sample file isn't too important. Just know it creates a sample.xml file in the current directory with 2 million entries.

The reason I generated so many is because there's very few XML attributes. In my real XML file I had almost 50 XML attributes and over 100,000+ items. I also had closer to a 400mb file, but I wanted to keep it a bit smaller for this isolated benchmark.

Running the Python script to generate a sample 250mb XML file:

Since I'm running everything in Docker I am running a Docker command but if you're not using Docker then you can just run python3 generatexml.py.

That command should finish running in less than a minute and produce similar output to:

It took a while for me since I'm running all of this inside of WSL (v1) with Docker for Windows and I didn't write it to my SSD. Have to protect those write cycles!

And if you look in your current directory, you should see:

In my case it generated a 243mb sample.xml file.

You can investigate it by running less sample.xml and paging up / down to view it. Press q to cancel the less tool:

Cool, so now we have our sample data. The next step is to run a few parsing benchmarks against it using 3 different XML parsing strategies.

Creating a Python script to parse the sample XML file:

The last thing we need to set up is the parsexml.py file to demonstrate how to parse the XML file and also benchmark it.

Create a new parsexml.py and make it look like this:

We'll go over this in a little more detail when comparing the results.

But the basic idea is we read in the sample.xml file and then parse it using 1 of the 3 strategies. We also use the default_timer function from Python's timeit module to track how long it took to do the work.

I know there's more robust ways to run benchmarks but this gets the job done for this use case.

A specific parsing strategy can be run depending on what command line argument we pass in, and those can be found near the bottom of the script.

xmltodict vs Python's Standard Library vs lxml

Now the fun part. Comparing the numbers:

Python Lxml Library

With a ~250mb sample it's not quite a 20x difference but it was 20x with a 400mb sample. Still even in this case it's about a 15x improvement which is a huge win.

What's interesting is both Python's standard library and lxml have an etree library and the lxml variant is pretty close to having the same API as the standard library except it's a bit more optimized.

If you look at the code in the parsexml.py file for both they are the same. The only difference is lxml expects your file to be sent in as bytes instead of a string.

It's also worth pointing out you can parse files directly with etree instead of first opening a file and passing in its value to etree.fromstring. For that, look in the docs for etree.parse or even etree.iterparse if you want to read the file in chunks instead of all at once.

Python Lxml Xml

Using iterparse could be handy for dealing with massive files that don't fit in memory or even reading it in from a stream using the requests library if it's the result of an API call.

How Do You Iterate over the XML with All 3 Strategies?

This is starting to get a bit beyond the scope of this blog post but here's the basics.

With xmltodict:

Produces this output:

Since it's a dictionary you can do whatever you can do with a Python dictionary. Attributes are nested dictionaries and all of this is included in xmltodict's docs.

With etree (both standard library and lxml):

Produces this output:

Here we can reach into the properties of the book and get anything we want. There is comprehensive documentation available in Python's docs as well as lxml's documentation.

So that's all there is to it. I chose to use lxml and so far it's working out great.

Python Lxml Parser

What are your favorite tips for parsing XML in Python? Let me know below.





broken image