How to remove xa0 from string in Python?

StackOverflow

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I"m being left with a lot of xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u"xa0"," "), as suggested by another thread, but that changed the xa0"s to u"s, so now I have "u"s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u"xa0", " ").encode("utf-8"), but just doing .encode("utf-8") without replace() seems to cause it to spit out even weirder characters, xc2 for instance. Can anyone explain this?

Answer rating: 334

xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u"xa0", u" ")

When .encode("utf-8"), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, xa0 is represented by 2 bytes xc2xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

Answer rating: 275

There"s many useful things in Python"s unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don"t get the results you"re after.





Get Solution for free from DataCamp guru