Web scraping using BeautifulSoup in Python : EAN number vs price from a German e-commerce website

 Input : List of URLs of product category containing list of products to obtain information from.

https://www.duo-shop.de/de-DE/List/4/0/0/

https://www.duo-shop.de/de-DE/List/5/0/0/

https://www.duo-shop.de/de-DE/List/70/0/0/

https://www.duo-shop.de/de-DE/List/259/0/0/

https://www.duo-shop.de/de-DE/List/72/0/0/

https://www.duo-shop.de/de-DE/List/73/0/0/

https://www.duo-shop.de/de-DE/List/9/0/0/

https://www.duo-shop.de/de-DE/List/690/0/0/

https://www.duo-shop.de/de-DE/List/329/0/0/

https://www.duo-shop.de/de-DE/List/537/0/0/


Task : Get EAN number and associated price for each EAN number.


Output : Spreadsheet(or CSV) with EAN and price columns. Separate file for each product category.


We begin by investigating the website to be scraped from. It's a german e-commerce website selling wide range of product(product type doesn't matter for what we're trying to achieve here). All product listing follow html structure, so if we can get information of one EAN, we can basically iterate through each category to get complete data and thus iterate through the entire list above.



Using BeautifulSoup we can use different parsers to parse through the website content. We will use lxml for our purpose.

req = requests.get(url)
data = BeautifulSoup(req.content,'lxml')

Our initial strategy is to go inside one of the products and get unique features of the element. There are two informations we need to obtain, ean and price.

EAN : ean is available under Details section of the product page. Doing a quick inspect element yielded a <tr> element with CSS classname 'even' as the wrapping element. The text itself was inside a td element but to distinguish from other td element we have to use empty attributes for style. Once we get to this element we'd just access the text of the first find result.

ean_div = data.select_one('tr.even')
ean = ean_div.find_all("td",{"style": ""})[0].text

PRICE : price is available inside a outer div element with classname price and price-marker. This turned out less challenging than thought initially since the list of returned child elements had only one div element, which would give us the price we needed.

price_div = data.select_one(('div.price.price-marker'))
price_next = price_div.find_all("div")
price = price_div.text

Now that we were able to get information correctly from the product page, we'd have to do the exact same thing for each product in the category page. Our next strategy is to get a list of all products in the category page, Inspect element showed they were inside a <a> element with classame fn. So we use find_all function to get this list and launched a new http request by accessing href element for each element in the list.

content = data.find_all("a",class_="fn")
for i in content:
req1 = requests.get(url + i['href'])

Now that we have the required information and also a way to iterate through the category page, we'd need to write this to a csv file, with each row corresponding to one pair of ean and price data. csv library is an excellent choice for this. Using dict command to initialize a dictionary variable with the above data, we open a filename and use writerow(my_dict) to write each line to the file. We also write output to the command line to see what we're writing to the file.

my_dict = dict({"ean": ean, "price": price})
print(my_dict)
writer.writerow(my_dict)

Once a file is successfully created from one category page, we'd just need to iterate through our list of category page URLs to create a bunch of files for each category.

for idx,sub in enumerate(sub_url):
req = requests.get(url+sub)

Thus after our code completes, we will have a bunch of CSV files in current directory which can be opened and checked for verification. Final code output looks like this :

0.csv
{'ean': '0853653006987', 'price': '6,99 €'}
{'ean': '0853653006970', 'price': '6,99 €'}
{'ean': '0857560006016', 'price': '6,99 €'}
{'ean': '0857560006009', 'price': '6,99 €'}
{'ean': '0857560006047', 'price': '7,99 €'}
{'ean': '0857560006030', 'price': '7,99 €'}
{'ean': '0853653006956', 'price': '6,99 €'}
{'ean': '0857560006023', 'price': '10,99 €'}
{'ean': '0853653006994', 'price': '6,99 €'}
{'ean': '0853653006963', 'price': '6,99 €'}
{'ean': '0853653006949', 'price': '6,99 €'}
{'ean': '4893156034168', 'price': '28,98 €'}
{'ean': '4018928632810', 'price': '21,68 €'}
{'ean': '4018928685557', 'price': '15,99 €'}
{'ean': '4011898090918', 'price': '13,99 €'}
{'ean': '4011898099812', 'price': '13,99 €'}
{'ean': '4011898091519', 'price': '13,99 €'}
{'ean': '4011898091410', 'price': '13,99 €'}
1.csv
{'ean': '4251288487137', 'price': '4,49 €'}
{'ean': '4251288487359', 'price': '6,49 €'}
{'ean': '4251288436807', 'price': '7,49 €'}
{'ean': '4033874240551', 'price': '18,95 €'}
{'ean': '5018206068408', 'price': '25,39 €'}
{'ean': '4033874310735', 'price': '2,49 €'}
{'ean': '4018206947377', 'price': '1,69 €'}
{'ean': '4033874795099', 'price': '1,69 €'}
{'ean': '4250741377381', 'price': '0,99 €'}
{'ean': '0073228125008', 'price': '299,00 €'}
{'ean': '0073228125039', 'price': '333,00 €'}
{'ean': '0073228125077', 'price': '299,00 €'}
{'ean': '4027521001619', 'price': '2,49 €'}
{'ean': '4250164830432', 'price': '14,99 €'}
{'ean': '4007735002602', 'price': '1,49 €'}
{'ean': '3660016420338', 'price': '5,00 €'}
{'ean': '3660016420161', 'price': '9,99 €'}
{'ean': '3660016420512', 'price': '12,71 €'}

Final scraped data available here.
Follow me @sndpwrites.

Comments

Popular posts from this blog

Youtube not working in Vianet NetTV? Here's how you can fix it

Building a news app in react-native using Expo and Express on the Node.js server