Parsing XML in Python



XML: XML stands for Extensible Markup Language. It was designed for storing and transporting data. It was designed to be both human readable and machine readable. Therefore, XML design goals emphasize simplicity, versatility, and usability on the Internet. 
The XML file that will be parsed in this tutorial is actually an RSS feed.

RSS: RSS (Rich Site Summary, often called Really Simple Syndication) uses a family of standard feed formats to publish frequently updated information such as blog posts, news headlines, audio, video. RSS — it is plain text in XML format.

  • The RSS format itself is relatively easy to read by both automated processes and humans.
  • The RSS processed in this tutorial is the RSS feed of the main news from a popular news site. You can check it out here . Our goal — process this RSS feed (or XML file) and save it in some other format for future use.

Python module being used: this article will focus on focuses on using the built-in xml module in python to parse XML, and focuses on API to the ElementTree XML interface of this module.

Implementation:

 

# Python code to illustrate parsing XML files
# import required modules

import csv

import requests

import xml.etree.ElementTree as ET

  

def loadRSS ():

 

# RSS feed

url = ` http://www.hindustantimes.com/rss/topnews/rssfeed.xml `

 

# create an HTTP response object at a given URL

resp = requests.get (url)

 

  # save the XML file

  with open ( ` topnewsfeed.xml` , `wb` ) as f:

f.write (resp.content)

  

 

def parseXML (xmlfile):

  

  # create object tree element

tree = ET.parse (xmlfile)

 

  # get root element

root = tree.getroot ()

 

# create an empty news list

newsitems = []

  

  # repeat news

for item in root.findall ( `./channel / item` ):

 

  # empty news dictionary

news = {}

  

  # iterate through child elements of the element

for child in item:

 

  # special validation of the contents of the namespace object: media

  if child.tag = = `{ http://search.yahoo.com/mrss/ } content` :

news [ `media` ] = child.attrib [ `url` ]

else :

news [child.tag] = child.text.encode ( `utf8` )

 

# add news dictionary to news list

newsitems.append (news)

  

# return news list

return newsitems

 

 

def savetoCSV (newsitems, filename):

 

# specifying fields for the csv file

fields = [ ` guid` , `title` , `pubDate` , ` description` `link` , ` media` ]

 

# writing to CSV file

with open (filename, `w` ) as csvfile:

  

  # create csv dict writer object

writer = csv.DictWriter (csvfile, fieldnames = fields)

 

# writing headers (field names)

writer.writeheader ()

 

# writing data rows

writer.writerows (newsitems)

 

 

def main ():

  # download rss from the web to update an existing XML file

loadRSS ()

 

# parse xml file

newsitems = parseXML ( `topnewsfeed.xml` )

 

# store news in a CSV file

savetoCSV (newsitems, `topnews.csv` )

  

 

if __ name__ = = " __ main__ " :

 

  # calling the main function

main ()

The above code will be:

  • Download the rss feed from the specified url and save it as XML file.
  • Parse the XML file to save the news as a list of dictionaries, where each dictionary is a separate news item.
  • Save the news to a CSV file.
  • Let`s try to understand the code piece by piece:

    • Loading and saving an RSS feed
       def loadRSS (): # url of rss feed url = `http: // www.hindustantimes.com / rss / topnews / rssfeed.xml` # creating HTTP response object from given url resp = requests.get (url) # saving the xml file with open (` topnewsfeed.xml` , `wb`) as f: f.write (resp.content) 

      Here we first created an HTTP response object by sending an HTTP request to the RSS feed URL. The response content now contains the XML file data, which we save as topnewsfeed.xml in our local directory. 
      For more information on how the requests module works, follow this article:
      You can read more about the supported XPath syntax here .

       for item in root.findall (`. / channel / item`): # empty news dictionary news = {} # iterate child elements of item for child in item: # special checking for namespace object content: media if child.tag == `{http://search.yahoo.com/mrss/} content`: news [` media`] = child.attrib [`url`] else: news [child.tag] = child.text.encode (`utf8`) # append news dictionary to news items list newsitems.append (news) 

      Now we know we are looping through item, where each item element contains one news item. So, we create an empty news dictionary in which we will store all available news data. To iterate over each child of an element, we simply iterate over it, like this:

       for child in item: 

      Now notice the sample element element here:

      We`ll have to handle namespace tags separately when they are parsed to their original value. So we do something like this:

       if child.tag == `{http://search.yahoo.com/mrss/} content`: news [` media`] = child. attrib [`url`] 

      child.attrib — it is a dictionary of all the attributes associated with an element. Here we are interested in the attribute url media: tag name namespace. 
      Now for all other children, we just do:

       news [child.tag] = child.text.encode (`utf8`) 

      child.tag contains the name of the child.  child.text stores all text inside this child element. So finally, the sample element element is converted to a dictionary and looks like this:

       {`description`:` Ignis has a tough competition already, from Hyun ...., `guid`:` http: / /www.hindustantimes.com/autos/maruti-ignis-launch ...., `link`:` http: //www.hindustantimes.com/autos/maruti-ignis-launch ...., `media`: `http: //www.hindustantimes.com/rf/image_size_630x354/HT / ...,` pubDate`: `Thu, 12 Jan 2017 12:33:04 GMT`, `title`:` Maruti Ignis launches on Jan 13 : Five cars that threa .....} 

      Then we just add this dict element to the list of news elements. 
      Finally, this list is returned.

    • Saving the data to a CSV file
      Now we just save the news list to a CSV file so that it could be easily used or modified in the future with the savetoCSV () function. To learn more about writing dictionary elements to a CSV file, see this article:

    So this is how our formatted data looks like:

    As you can see , the hierarchical XML file data has been converted to a simple CSV file, so all news is saved as a table. It also makes it easier to expand the database. 
    You can also use JSON-like data directly in your applications! This is the best alternative for fetching data from websites that do not provide a public API, but do provide some RSS feeds.

    All the code and files used in the article above can be found here .

    What`s next?

    • You can take a look at the other RSS feeds of the news site used in the above example. You can try to create an extended version of the above example by analyzing other RSS feeds as well.
    • Are you a fan of cricket? Then this RSS feed should interest you! You can parse this XML file to collect real-time cricket information and use it to generate desktop notifications!

    HTML and XML Test

    This article is provided by Nikhil Kumar. If you like Python.Engineering and you want to contribute, you can also write an article and mail it to [email protected] See your article appearing on the Python.Engineering homepage and help other geeks.

    Please post comments if you find anything wrong or if you`d like to share more information on the topic under discussion