XML: XML stands for Extensible Markup Language. It was designed for storing and transporting data. It was designed to be both human readable and machine readable. Therefore, XML design goals emphasize simplicity, versatility, and usability on the Internet.
The XML file that will be parsed in this tutorial is actually an RSS feed.
RSS: RSS (Rich Site Summary, often called Really Simple Syndication) uses a family of standard feed formats to publish frequently updated information such as blog posts, news headlines, audio, video. RSS — it is plain text in XML format.
- The RSS format itself is relatively easy to read by both automated processes and humans.
- The RSS processed in this tutorial is the RSS feed of the main news from a popular news site. You can check it out here . Our goal — process this RSS feed (or XML file) and save it in some other format for future use.
Python module being used: this article will focus on focuses on using the built-in xml module in python to parse XML, and focuses on API to the ElementTree XML interface of this module.
Implementation:
|
The above code will be:
- Download the rss feed from the specified url and save it as XML file.
- Parse the XML file to save the news as a list of dictionaries, where each dictionary is a separate news item.
- Save the news to a CSV file.
Let’s try to understand the code piece by piece:
- Loading and saving an RSS feed
def loadRSS (): # url of rss feed url = ’http:// www.hindustantimes.com / rss / topnews / rssfeed.xml’ # creating HTTP response object from given url resp = requests.get (url) # saving the xml file with open (’ topnewsfeed.xml’ , ’wb’) as f: f.write (resp.content)
Here we first created an HTTP response object by sending an HTTP request to the RSS feed URL. The response content now contains the XML file data, which we save as topnewsfeed.xml in our local directory.
For more information on how the requests module works, follow this article:
tree = ET.parse (xmlfile)Here we create a ElementTree object by parsing the passed xmlfile.
root = tree .getroot ()
The getroot () function returns the root of the tree as a Element object.
for item in root.findall (’. / channel / item’):
Now, once you look at the structure of your XML file, you will notice that we are only interested in the item .
./channel/item is actually the syntax XPath (XPath & # 8212 ; it is a language for addressing parts of an XML document). Here we want to find all the child item elements child feeds root element (denoted with a ".").
You can read more about the supported XPath syntax here .for item in root.findall (’. / channel / item’): # empty news dictionary news = {} # iterate child elements of item for child in item: # special checking for namespace object content: media if child.tag == ’{http://search.yahoo.com/mrss/} content’: news [’ media’] = child.attrib [’url’] else: news [child.tag] = child.text.encode (’utf8’) # append news dictionary to news items list newsitems.append (news)
Now we know we are looping through item, where each item element contains one news item. So, we create an empty news dictionary in which we will store all available news data. To iterate over each child of an element, we simply iterate over it, like this:
for child in item:
Now notice the sample element element here:
We’ll have to handle namespace tags separately when they are parsed to their original value. So we do something like this:
if child.tag == ’{http://search.yahoo.com/mrss/} content’: news [’ media’] = child. attrib [’url’]
child.attrib — it is a dictionary of all the attributes associated with an element. Here we are interested in the attribute url media: tag name namespace.
Now for all other children, we just do:news [child.tag] = child.text.encode (’utf8’)
child.tag contains the name of the child. child.text stores all text inside this child element. So finally, the sample element element is converted to a dictionary and looks like this:
{’description’:’ Ignis has a tough competition already, from Hyun ...., ’guid’:’ http: / /www.hindustantimes.com/autos/maruti-ignis-launch ...., ’link’:’ http://www.hindustantimes.com/autos/maruti-ignis-launch ...., ’media’: ’http://www.hindustantimes.com/rf/image_size_630x354/HT / ...,’ pubDate’: ’Thu, 12 Jan 2017 12:33:04 GMT’, ’title’:’ Maruti Ignis launches on Jan 13 : Five cars that threa .....}
Then we just add this dict element to the list of news elements.
Finally, this list is returned. - Saving the data to a CSV file
Now we just save the news list to a CSV file so that it could be easily used or modified in the future with the savetoCSV () function. To learn more about writing dictionary elements to a CSV file, see this article:
So this is how our formatted data looks like:

As you can see , the hierarchical XML file data has been converted to a simple CSV file, so all news is saved as a table. It also makes it easier to expand the database.
You can also use JSON-like data directly in your applications! This is the best alternative for fetching data from websites that do not provide a public API, but do provide some RSS feeds.
All the code and files used in the article above can be found here .
What’s next?
- You can take a look at the other RSS feeds of the news site used in the above example. You can try to create an extended version of the above example by analyzing other RSS feeds as well.
- Are you a fan of cricket? Then this RSS feed should interest you! You can parse this XML file to collect real-time cricket information and use it to generate desktop notifications!
This article is provided by Nikhil Kumar. If you like Python.Engineering and you want to contribute, you can also write an article and mail it to [email protected] See your article appearing on the Python.Engineering homepage and help other geeks.
Please post comments if you find anything wrong or if you’d like to share more information on the topic under discussion