After searching around using wget utility of Ubuntu, html2text lib try some xml manipulate lib like DOM, etree ...quite hard core for a while and I found BeautifulSoup very cool lib for scraping web, parse HTML document.
Documents for BeautifulSoup can be found here
I just list here some awesome features of beautifulsoup:
- Ease navigate through HTML document
- Handle tag, name, attributes easily
- Searching ...
Let's do it.
> sudo pip install beautifulsoup4
> sudo pip install requests
import requests
from bs4 import BeautifulSoup as bfs
url = raw_input("Enter url you want to scrap ")
r = requests.get(url)
data = r.text
soup = bfs(data)
for link in soup.find_all('a'):
print (link.get('href'))or you can search for text content in <p> tag:
for link in soup.find_all('p'):
print (link.text)
That's it.
Không có nhận xét nào:
Đăng nhận xét