Chém gió

Thứ Bảy, 10 tháng 1, 2015

Scrap all link on webpage using BeautifulSoup Python

What I am trying to do is manipulate HTML document.
After searching around using wget utility of Ubuntu, html2text lib try some xml manipulate lib like DOM, etree ...quite hard core for a while and I found BeautifulSoup very cool lib for scraping web, parse HTML document.
Documents for BeautifulSoup can be found here
I just list here some awesome features of beautifulsoup:

  • Ease navigate through HTML document
  • Handle tag, name, attributes easily
  • Searching ...

Let's do it.

> sudo pip install beautifulsoup4
> sudo pip install requests
import requests
from bs4 import BeautifulSoup as bfs
url = raw_input("Enter url you want to scrap ")
r = requests.get(url)
data = r.text
soup = bfs(data)
for link in soup.find_all('a'):
    print (link.get('href'))

or you can search for text content in <p> tag:

for link in soup.find_all('p'):
    print (link.text)

That's it.

Không có nhận xét nào:

Đăng nhận xét