Pandas read_xml() method test strategies

StackOverflow

Currently, pandas I/O tools does not maintain a read_xml() method and the counterpart to_xml(). However, read_json proves tree-like structures can be implemented for dataframe import and read_html for markup formats.

If the pandas team does consider such a read_xml method for a future pandas version, what implementation would they pursue: parsing with built-in xml.etree.ElementTree with its iterfind() or iterparse() functions or the third-party module, lxml with its XPath 1.0 and XSLT 1.0 methods?

Below are my test runs for four method types on a simple, flat, element-centric XML input. All are set up for generalized parsing for any second level children of root and each method should yield exact same pandas dataframe. All but the last calls pd.Dataframe() on list of dictionaries. The XSLT method transforms XML to CSV for casted StringIO() in pd.read_csv().

Question (multi-part)

  • PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?

  • MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.

  • STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.

Input Data (Stack Overflow"s current top users by year of which our pandas friends are included)

<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
  <topusers>
    <user>Gordon Linoff</user>
    <link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
    <location>New York, United States</location>
    <year_rep>5,985</year_rep>
    <total_rep>499,408</total_rep>
    <tag1>sql</tag1>
    <tag2>sql-server</tag2>
    <tag3>mysql</tag3>
  </topusers>
  <topusers>
    <user>Günter Zöchbauer</user>
    <link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
    <location>Linz, Austria</location>
    <year_rep>5,835</year_rep>
    <total_rep>154,439</total_rep>
    <tag1>angular2</tag1>
    <tag2>typescript</tag2>
    <tag3>javascript</tag3>
  </topusers>
  <topusers>
    <user>jezrael</user>
    <link>http://www.stackoverflow.com//users/2901002/jezrael</link>
    <location>Bratislava, Slovakia</location>
    <year_rep>5,740</year_rep>
    <total_rep>83,237</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>VonC</user>
    <link>http://www.stackoverflow.com//users/6309/vonc</link>
    <location>France</location>
    <year_rep>5,577</year_rep>
    <total_rep>651,397</total_rep>
    <tag1>git</tag1>
    <tag2>github</tag2>
    <tag3>docker</tag3>
  </topusers>
  <topusers>
    <user>Martijn Pieters</user>
    <link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
    <location>Cambridge, United Kingdom</location>
    <year_rep>5,337</year_rep>
    <total_rep>525,176</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>T.J. Crowder</user>
    <link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
    <location>United Kingdom</location>
    <year_rep>5,258</year_rep>
    <total_rep>508,310</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>akrun</user>
    <link>http://www.stackoverflow.com//users/3732271/akrun</link>
    <location></location>
    <year_rep>5,188</year_rep>
    <total_rep>229,553</total_rep>
    <tag1>r</tag1>
    <tag2>dplyr</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>Wiktor Stribi?ew</user>
    <link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
    <location>Warsaw, Poland</location>
    <year_rep>4,948</year_rep>
    <total_rep>158,134</total_rep>
    <tag1>regex</tag1>
    <tag2>javascript</tag2>
    <tag3>c#</tag3>
  </topusers>
  <topusers>
    <user>Darin Dimitrov</user>
    <link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
    <location>Sofia, Bulgaria</location>
    <year_rep>4,936</year_rep>
    <total_rep>709,683</total_rep>
    <tag1>c#</tag1>
    <tag2>asp.net-mvc</tag2>
    <tag3>asp.net-mvc-3</tag3>
  </topusers>
  <topusers>
    <user>Eric Duminil</user>
    <link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
    <location></location>
    <year_rep>4,854</year_rep>
    <total_rep>12,557</total_rep>
    <tag1>ruby</tag1>
    <tag2>ruby-on-rails</tag2>
    <tag3>arrays</tag3>
  </topusers>
  <topusers>
    <user>alecxe</user>
    <link>http://www.stackoverflow.com//users/771848/alecxe</link>
    <location>New York, United States</location>
    <year_rep>4,723</year_rep>
    <total_rep>233,368</total_rep>
    <tag1>python</tag1>
    <tag2>selenium</tag2>
    <tag3>protractor</tag3>
  </topusers>
  <topusers>
    <user>Jean-François Fabre</user>
    <link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
    <location>Toulouse, France</location>
    <year_rep>4,526</year_rep>
    <total_rep>30,027</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>piRSquared</user>
    <link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
    <location>Bellevue, WA, United States</location>
    <year_rep>4,482</year_rep>
    <total_rep>41,183</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>CommonsWare</user>
    <link>http://www.stackoverflow.com//users/115145/commonsware</link>
    <location>Who Wants to Know?</location>
    <year_rep>4,475</year_rep>
    <total_rep>616,135</total_rep>
    <tag1>android</tag1>
    <tag2>java</tag2>
    <tag3>android-intent</tag3>
  </topusers>
  <topusers>
    <user>Quentin</user>
    <link>http://www.stackoverflow.com//users/19068/quentin</link>
    <location>United Kingdom</location>
    <year_rep>4,464</year_rep>
    <total_rep>509,365</total_rep>
    <tag1>javascript</tag1>
    <tag2>html</tag2>
    <tag3>css</tag3>
  </topusers>
  <topusers>
    <user>Jon Skeet</user>
    <link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
    <location>Reading, United Kingdom</location>
    <year_rep>4,348</year_rep>
    <total_rep>921,690</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>.net</tag3>
  </topusers>
  <topusers>
    <user>Felix Kling</user>
    <link>http://www.stackoverflow.com//users/218196/felix-kling</link>
    <location>Sunnyvale, CA</location>
    <year_rep>4,324</year_rep>
    <total_rep>411,535</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>asynchronous</tag3>
  </topusers>
  <topusers>
    <user>matt</user>
    <link>http://www.stackoverflow.com//users/341994/matt</link>
    <location></location>
    <year_rep>4,313</year_rep>
    <total_rep>220,515</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>xcode</tag3>
  </topusers>
  <topusers>
    <user>Psidom</user>
    <link>http://www.stackoverflow.com//users/4983450/psidom</link>
    <location>Atlanta, GA, United States</location>
    <year_rep>4,236</year_rep>
    <total_rep>36,950</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>r</tag3>
  </topusers>
  <topusers>
    <user>Martin R</user>
    <link>http://www.stackoverflow.com//users/1187415/martin-r</link>
    <location>Germany</location>
    <year_rep>4,195</year_rep>
    <total_rep>269,380</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>swift3</tag3>
  </topusers>
  <topusers>
    <user>Barmar</user>
    <link>http://www.stackoverflow.com//users/1491895/barmar</link>
    <location>Arlington, MA</location>
    <year_rep>4,179</year_rep>
    <total_rep>289,989</total_rep>
    <tag1>javascript</tag1>
    <tag2>php</tag2>
    <tag3>jquery</tag3>
  </topusers>
  <topusers>
    <user>Alexey Mezenin</user>
    <link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
    <location>??????</location>
    <year_rep>4,142</year_rep>
    <total_rep>31,602</total_rep>
    <tag1>laravel</tag1>
    <tag2>php</tag2>
    <tag3>laravel-5.3</tag3>
  </topusers>
  <topusers>
    <user>BalusC</user>
    <link>http://www.stackoverflow.com//users/157882/balusc</link>
    <location>Amsterdam, Netherlands</location>
    <year_rep>4,046</year_rep>
    <total_rep>703,046</total_rep>
    <tag1>java</tag1>
    <tag2>jsf</tag2>
    <tag3>servlets</tag3>
  </topusers>
  <topusers>
    <user>GurV</user>
    <link>http://www.stackoverflow.com//users/6348498/gurv</link>
    <location></location>
    <year_rep>4,016</year_rep>
    <total_rep>7,932</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>sql-server</tag3>
  </topusers>
  <topusers>
    <user>Nina Scholz</user>
    <link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
    <location>Berlin, Deutschland</location>
    <year_rep>3,950</year_rep>
    <total_rep>61,135</total_rep>
    <tag1>javascript</tag1>
    <tag2>arrays</tag2>
    <tag3>object</tag3>
  </topusers>
  <topusers>
    <user>JB Nizet</user>
    <link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
    <location>Saint-Etienne, France</location>
    <year_rep>3,923</year_rep>
    <total_rep>418,780</total_rep>
    <tag1>java</tag1>
    <tag2>hibernate</tag2>
    <tag3>java-8</tag3>
  </topusers>
  <topusers>
    <user>Frank van Puffelen</user>
    <link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
    <location>San Francisco, CA</location>
    <year_rep>3,920</year_rep>
    <total_rep>86,520</total_rep>
    <tag1>firebase</tag1>
    <tag2>firebase-database</tag2>
    <tag3>android</tag3>
  </topusers>
  <topusers>
    <user>dasblinkenlight</user>
    <link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
    <location>United States</location>
    <year_rep>3,886</year_rep>
    <total_rep>475,813</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>c++</tag3>
  </topusers>
  <topusers>
    <user>Tim Biegeleisen</user>
    <link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
    <location>Singapore</location>
    <year_rep>3,814</year_rep>
    <total_rep>77,211</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>Greg Hewgill</user>
    <link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
    <location>Christchurch, New Zealand</location>
    <year_rep>3,796</year_rep>
    <total_rep>529,137</total_rep>
    <tag1>git</tag1>
    <tag2>python</tag2>
    <tag3>git-pull</tag3>
  </topusers>
  <topusers>
    <user>unutbu</user>
    <link>http://www.stackoverflow.com//users/190597/unutbu</link>
    <location></location>
    <year_rep>3,735</year_rep>
    <total_rep>401,595</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>numpy</tag3>
  </topusers>
  <topusers>
    <user>Hans Passant</user>
    <link>http://www.stackoverflow.com//users/17034/hans-passant</link>
    <location>Madison, WI</location>
    <year_rep>3,688</year_rep>
    <total_rep>672,118</total_rep>
    <tag1>c#</tag1>
    <tag2>.net</tag2>
    <tag3>winforms</tag3>
  </topusers>
  <topusers>
    <user>Jonathan Leffler</user>
    <link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
    <location>California, USA</location>
    <year_rep>3,649</year_rep>
    <total_rep>455,157</total_rep>
    <tag1>c</tag1>
    <tag2>bash</tag2>
    <tag3>unix</tag3>
  </topusers>
  <topusers>
    <user>paxdiablo</user>
    <link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
    <location></location>
    <year_rep>3,636</year_rep>
    <total_rep>507,043</total_rep>
    <tag1>c</tag1>
    <tag2>c++</tag2>
    <tag3>bash</tag3>
  </topusers>
  <topusers>
    <user>Pranav C Balan</user>
    <link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
    <location>Ramanthali, Kannur, Kerala, India</location>
    <year_rep>3,604</year_rep>
    <total_rep>64,476</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>html</tag3>
  </topusers>
  <topusers>
    <user>Suragch</user>
    <link>http://www.stackoverflow.com//users/3681880/suragch</link>
    <location>Hohhot, China</location>
    <year_rep>3,580</year_rep>
    <total_rep>71,032</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>android</tag3>
  </topusers>
</stackoverflow>

Python Methods

import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet

def read_xml_iterfind():
    tree = et.parse("Input.xml")

    data = []
    inner = {}
    for el in tree.iterfind("./*"):
        for i in el.iterfind("*"):
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_iterparse():
    data = []
    inner = {}
    i = 1
    for (ev, el) in et.iterparse(path):
        if i <= 2:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text
    i += 1

    df = pd.DataFrame(data)    

def read_xml_lxml_xpath():     
    tree = lxet.parse("Input.xml")

    data = []
    inner = {}
    for el in tree.xpath("/*/*"):
        for i in el:
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_lxml_xsl():     
    xml = lxet.parse("Input.xml")

    xslstr = """
    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes"  method="text"/>
        <xsl:strip-space elements="*"/>

        <!-- HEADERS -->
        <xsl:template match = "/*">
            <xsl:for-each select="*[1]/*">
              <xsl:value-of select="local-name()" />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>,</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>   
            </xsl:for-each>
            <xsl:apply-templates/>
        </xsl:template>

        <!-- DATA ROWS (COMMA-SEPARATED) -->
        <xsl:template match="/*/*" priority="2">    
            <xsl:for-each select="*">
              <xsl:if test="position() = 1">
                   <xsl:text>&quot;</xsl:text>
              </xsl:if>
              <xsl:value-of select="." />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>&quot;,&quot;</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&quot;&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>
            </xsl:for-each>
        </xsl:template>

    </xsl:transform>
    """
    xsl = lxet.fromstring(xslstr)

    transform = lxet.XSLT(xsl)
    newdom = transform(xml)

    df = pd.read_csv(StringIO(str(newdom)))

Timings (with current XML and XML with 25 times the children (i.e., 900 StackOverflow user records)

# SHORTER FILE
python -mtimeit -s"import readxml_test_runs as test" "test.read_xml_iterfind()"
100 loops, best of 3: 3.87 msec per loop

python -mtimeit -s"import readxml_test_runs as test" "test.read_xml_iterparse()"
100 loops, best of 3: 5.5 msec per loop

python -mtimeit -s"import readxml_test_runs as test" "test.read_xml_lxml_xpath()"
100 loops, best of 3: 3.86 msec per loop

python -mtimeit -s"import readxml_test_runs as test" "test.read_xml_lxml_xsl()"
100 loops, best of 3: 5.68 msec per loop

# LARGER FILE
python -mtimeit -n"100" -s"import readxml_test_runs as test" "test.read_xml_iterfind()"
100 loops, best of 3: 36 msec per loop

python -mtimeit -n"100" -s"import readxml_test_runs as test" "test.read_xml_iterparse()"
100 loops, best of 3: 78.9 msec per loop

python -mtimeit -n"100" -s"import readxml_test_runs as test" "test.read_xml_lxml_xpath()"
100 loops, best of 3: 32.7 msec per loop

python -mtimeit -n"100" -s"import readxml_test_runs as test" "test.read_xml_lxml_xsl()"
100 loops, best of 3: 51.4 msec per loop




Get Solution for free from DataCamp guru