nitneihtnotes: Scrap all link on webpage using BeautifulSoup Python

Thứ Bảy, 10 tháng 1, 2015

Scrap all link on webpage using BeautifulSoup Python

What I am trying to do is manipulate HTML document.
After searching around using wget utility of Ubuntu, html2text lib try some xml manipulate lib like DOM, etree ...quite hard core for a while and I found BeautifulSoup very cool lib for scraping web, parse HTML document.
Documents for BeautifulSoup can be found here
I just list here some awesome features of beautifulsoup:

Ease navigate through HTML document
Handle tag, name, attributes easily
Searching ...

Let's do it.

> sudo pip install beautifulsoup4
> sudo pip install requests

import requests

from bs4 import BeautifulSoup as bfs

url = raw_input("Enter url you want to scrap ")

r = requests.get(url)

data = r.text

soup = bfs(data)

for link in soup.find_all('a'):

print (link.get('href'))

or you can search for text content in <p> tag:

for link in soup.find_all('p'):
print (link.text)

That's it.

nitneihtnotes

Chém gió

Thứ Bảy, 10 tháng 1, 2015

Scrap all link on webpage using BeautifulSoup Python

Không có nhận xét nào:

Đăng nhận xét