Brett Code

Python 2.7.5

Stripping HTML of Unicode, UTF,

and  Special Characters



This is the code I came up with to strip random web content of unwanted garbage (html, unicode, utf, <. ?, and so on).
It may not be elegant (don't ask me how much time I wasted trying to be elegant), but it works.


#To be used to clean up Title and comment entries"
def scrubHTML(text):
   
    #common garbage on the pages I'm looking at,
    #add more entries as needed, or delte entirely if overkill for your application

    text = text.replace(";quot;", " ")
    text = text.replace(";#039;", " ")
    text = text.replace("&amp", " ")

    #text to lower makes the remainder easier to scrub
    text = text.lower()
   
    #list of letters & spaces
    #This is ALL that's getting through
    #everything else is blocked
    #THIS IS THE MAIN WORKING PART

    l1 = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k","l", "m"]
    l2 = ["n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", " "]
    letters = l1 + l2
    tempString = ""
    for bP in text:
        if bP in letters:
            tempString += bP
   
    #This is probably overkill, but computers are fast
    #Between the two routines, all leading, trailing, and double spaces are removed       
    for r in range(0,10):
        tempString = tempString.replace("  "," ")
    tempString = tempString.strip()

    return tempString
#end def scrubHTML(text):




Back to Brett Code Home
Copyright © 2014 Brett Paufler