HTMLParser Basics

Let’s assume we have some html text. For example we might’ve gotten it the following way.

import requests

url = 'https://en.wikipedia.org/wiki/Data_science'
text = requests.get(url).content.decode('utf-8')

# Let's see the beggining of this text
print(text[:1000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Data science - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-c

Now let’s say our goal is to extract all the text inside all tags except for those inside <script> and <style> tags.

One way to achieve this goal is to make a parser object. Then the idea is that if we have our parser parser then we can run parser.feed(text) and we can then get our result text from parser.res.

To create this object we will create a class that inherits from the class HTMLParser. It is from this class we will inherit the feed method. My understanding of what happens when it is called is that it goes through the text in order but in a chunked up way such that tags and data inside tags are seen as atomic elements. And it runs a corresponding function based on if that atomic element is a start tag, end tag or data. Also for each atomic tag element there seems to be the method lower available, which extracts the name of the tag in the form of a string.

Thus we can create our custom parser and get the parsed text as following.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    script = False
    res = ""
    def handle_starttag(self, tag, attrs):
        if tag.lower() in ["script","style"]:
            self.script = True
    def handle_endtag(self, tag):
        if tag.lower() in ["script","style"]:
            self.script = False
    def handle_data(self, data):
        if str.strip(data)=="" or self.script:
            return
        # We also in this example choose to remove [ edit ]
        # in the following way.
        self.res += ' '+data.replace('[ edit ]','')

parser = MyHTMLParser()
parser.feed(text)
text = parser.res
print(text[:1000])

 Data science - Wikipedia Jump to content Main menu Main menu move to sidebar hide 
        Navigation
     Main page Contents Current events Random article About Wikipedia Contact us Donate 
        Contribute
     Help Learn to edit Community portal Recent changes Upload file Languages Language links are at the top of the page. Search Search Create account Log in Personal tools Create account Log in 
        Pages for logged out editors  learn more Contributions Talk Contents move to sidebar hide (Top) 1 Foundations Toggle Foundations subsection 1.1 Relationship to statistics 2 Etymology Toggle Etymology subsection 2.1 Early usage 2.2 Modern usage 3 Data Science and Data Analysis 4 History 5 See also 6 References Toggle the table of contents Data science 46 languages العربية Azərbaycanca বাংলা Български Català Čeština Deutsch Eesti Ελληνικά Español Esperanto Euskara فارسی Français Galego 한국어 Հայերեն हिन्दी Bahasa Indonesia IsiZulu Italiano עברית ಕನ್ನಡ Қазақша Latviešu Македонски Bahasa Melayu မြန်မာဘာသာ Ned