How to remove xa0 from string in Python?


I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I"m being left with a lot of xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u"xa0"," "), as suggested by another thread, but that changed the xa0"s to u"s, so now I have "u"s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u"xa0", " ").encode("utf-8"), but just doing .encode("utf-8") without replace() seems to cause it to spit out even weirder characters, xc2 for instance. Can anyone explain this?

xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u"xa0", u" ")

When .encode("utf-8"), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, xa0 is represented by 2 bytes xc2xa0.

Read up on

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

There"s many useful things in Python"s unicodedata library. One of them is the .normalize() function.


new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don"t get the results you"re after.

