Let’s assume we have some html text. For example we might’ve gotten it the following way.
import requestsurl ='https://en.wikipedia.org/wiki/Data_science'text = requests.get(url).content.decode('utf-8')# Let's see the beggining of this textprint(text[:1000])
Now let’s say our goal is to extract all the text inside all tags except for those inside <script> and <style> tags.
One way to achieve this goal is to make a parser object. Then the idea is that if we have our parser parser then we can run parser.feed(text) and we can then get our result text from parser.res.
To create this object we will create a class that inherits from the class HTMLParser. It is from this class we will inherit the feed method. My understanding of what happens when it is called is that it goes through the text in order but in a chunked up way such that tags and data inside tags are seen as atomic elements. And it runs a corresponding function based on if that atomic element is a start tag, end tag or data. Also for each atomic tag element there seems to be the method lower available, which extracts the name of the tag in the form of a string.
Thus we can create our custom parser and get the parsed text as following.
from html.parser import HTMLParserclass MyHTMLParser(HTMLParser): script =False res =""def handle_starttag(self, tag, attrs):if tag.lower() in ["script","style"]:self.script =Truedef handle_endtag(self, tag):if tag.lower() in ["script","style"]:self.script =Falsedef handle_data(self, data):ifstr.strip(data)==""orself.script:return# We also in this example choose to remove [ edit ]# in the following way.self.res +=' '+data.replace('[ edit ]','')parser = MyHTMLParser()parser.feed(text)text = parser.resprint(text[:1000])
Data science - Wikipedia Jump to content Main menu Main menu move to sidebar hide
Navigation
Main page Contents Current events Random article About Wikipedia Contact us Donate
Contribute
Help Learn to edit Community portal Recent changes Upload file Languages Language links are at the top of the page. Search Search Create account Log in Personal tools Create account Log in
Pages for logged out editors learn more Contributions Talk Contents move to sidebar hide (Top) 1 Foundations Toggle Foundations subsection 1.1 Relationship to statistics 2 Etymology Toggle Etymology subsection 2.1 Early usage 2.2 Modern usage 3 Data Science and Data Analysis 4 History 5 See also 6 References Toggle the table of contents Data science 46 languages العربية Azərbaycanca বাংলা Български Català Čeština Deutsch Eesti Ελληνικά Español Esperanto Euskara فارسی Français Galego 한국어 Հայերեն हिन्दी Bahasa Indonesia IsiZulu Italiano עברית ಕನ್ನಡ Қазақша Latviešu Македонски Bahasa Melayu မြန်မာဘာသာ Ned