I am currently using Beautiful Soup to parse an HTML file and calling
get_text(), but it seems like I"m being left with a lot of xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
I tried using:
line = line.replace(u"xa0"," "), as suggested by another thread, but that changed the xa0"s to u"s, so now I have "u"s everywhere instead. ):
EDIT: The problem seems to be resolved by
str.replace(u"xa0", " ").encode("utf-8"), but just doing
replace() seems to cause it to spit out even weirder characters, xc2 for instance. Can anyone explain this?
xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
string = string.replace(u"xa0", u" ")
When .encode("utf-8"), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, xa0 is represented by 2 bytes xc2xa0.
Read up on http://docs.python.org/howto/unicode.html.
Please note: this answer in from 2012, Python has moved on, you should be able to use
There"s many useful things in Python"s
unicodedata library. One of them is the
new_str = unicodedata.normalize("NFKD", unicode_str)
Replacing NFKD with any of the other methods listed in the link above if you don"t get the results you"re after.
Cracking the Coding Interview PDF: 189 Programming Questions and Solutions, 6th Edition. I am not a recruiter. I am a software engineer. And as such, I know what it's like to be asked to create ing...
This encyclopedia will be an indispensable resource for our time as it reflects the fact that we are currently living in an expanding data-driven world. ...
Coding for Kids: Python - Learn to Code with 50 Awesome Games and Activities. Learning to code isn't as difficult as it sounds, you just have to get started! Coding for Kids: Python gets kids start...
Roger Jennings is an author and consultant specializing in Microsoft .NET n-tier database applications and data-intensive Windows Communication Foundation (WCF) Web services with SQL Server. He’s be...