nitneihtnotes: tháng 5 2015

Thứ Ba, 5 tháng 5, 2015

scraper website content should pay attention

I try to get a list of product from key word on amazon website and it should be automatically done.
I use beautiful soup and urllib2.
But web site I see and what I scraped it was slightly difference(on website I see more item).
After google around I found that when we use automate tool for scraping web site we have to fake a browser by providing User-agent to the header we can do this as follow:

>>> import urllib2
>>> opener = urllib2.build_opener()
>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>>> url = "http://www.amazon.com/Sony-D6653-International-Version-Warranty/dp/B00TLAIFDE/ref=sr_1_1?ie=UTF8&qid=1430831439&sr=8-1&keywords=sony+z3"
>>> response = opener.open(url)
>>> page = response.read()
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(page)

Work like a cham :D

Chém gió

Thứ Ba, 5 tháng 5, 2015

scraper website content should pay attention