Decode HTML entities in Python string?

StackOverflow

I"m parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn"t automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

How can I decode the HTML entities in text to get "£682m" instead of "&pound;682m".

Answer rating: 610

Python 3.4+

Use html.unescape():

import html
print(html.unescape("&pound;682m"))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape("&pound;682m"))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape("&pound;682m"))
£682m




Get Solution for free from DataCamp guru