Researching how to make the email scraper
So recently I was given a task to find a couple emails from various websites. Right off the bat I knew python was going to be my main tool for this email scraper. While Researching what I can use to make this goal achievable, I ran into Beautiful Soup. Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. This seemed like the perfect tool for the job. After a few hours of reading tutorials and trying a few things out, this is what I came up with.
How the email scraper works
First we had to import the requests and bs4 modules. I then pulled the HTML data from the website I wanted to grab the email from and stored it in the response variable. Next I set the content over to Beautiful Soup to parse the data and only search for links containing emails only(mailto). Lastly I set a for loop to print out all the results and append it to a csv file called website_emails.
Sometimes not all websites list their email addresses as a “mailto” link and post them as regular text, so the scraper wont pick it up. Also some websites, like mine included obfuscate the email address. Web administrators have come up with clever ways to protect against this by writing out email addresses (i.e., help [at] gmail [dot] com) or by using embedded images of the email address. Feel free to leave a comment if anyone knows a clever way around these issues.
Interested in Learning Python?
Python Crash Course is a fast-paced, thorough introduction to Python that will have you writing programs, solving problems, and making things that work in no time.
In the first half of the book, you’ll learn about basic programming concepts, such as lists, dictionaries, classes, and loops, and practice writing clean and readable code with exercises for each topic. You’ll also learn how to make your programs interactive and how to test your code safely before adding it to a project. In the second half of the book, you’ll put your new knowledge into practice with three substantial projects: a Space Invaders–inspired arcade game, data visualizations with Python’s super-handy libraries, and a simple web app you can deploy online.
This blog is located under the Programming Category
No Copyright Infringement Intended