Simple web scraping with Python

4 minute read

Published: February 01, 2019

The situation: I wanted to extract chemical identifiers of a set of ~350 chemicals offered by a vendor to compare it to another list. Unfortunately, there is no catalog that neatly tabulates this information, but there is a product catalog pdf that has the list of product numbers. The detailed information of each product (including the chemical identifier) can be found in the vendor’s website like this: vendor.com/product/[product_no]. Let me show you how to solve this problem with bash and Python.

Let’s break the problem down into steps:

Extract list of product numbers (call it list A)
Iterate over list A and webscrape chemical id to get a list (call it list B)
Compare list B with desired list C

Steps 1 and 3 look easy – just some text manipulation. Step 2 is basically the automated version of going to the product webpage and copy-paste the chemical identifier, and repeat this ~350 times (yup, not going to do that).

Step 1

I have pdf catalogue that looks like this:

Plate	Well	Product	Product No.
1	A1	chemical x	1111
1	A2	chemical y	2222
…	…	…	…

And of course, when copy-pasted to a text file, it is messed up…

$ cat temp
1
A1
chemical x
1111
1
A2
chemical y
2222
...

Well, that is quite easy to fix. If we are sure that each table row becomes 4 lines, we can do some bash magic:

$ paste - - - - < temp > catalog.tsv

and we will get

$ cat catalog.tsv
1 A1  chemical x  1111
1 A2  chemical y  2222
...

But beware of empty cells! This may cause a table row to become fewer than 3 lines and mess up your data. This is why I choose paste in this case even though we could have just extracted every 4th line with $ sed -n '0~4p' temp. With a quick glance you can easily verify that the data is reformatted to look like the original table.

So, inspecting that the reformatted table looks fine, extract the product number, i.e. the 4th column:

$ awk '{print $4}' catalog.tsv > prod_no.txt # list A

Step 2

Let’s do a test by scraping from the webpage of one product. Go to the webpage in your browser and do “Inspect element” to inspect the HTML underneath. I found my chemical identifier nicely contained in a <div> tag which has the id inchiKey.

Make sure you have the packages requests and BeautifulSoup and run this Python script:

import requests
from bs4 import BeautifulSoup

prod_no = '1111'

def get_chemical_id():
    r  = requests.get("https://www.vendor.com/product/"+prod_no)
    data = r.text
    soup = BeautifulSoup(data, 'html.parser')
    inchi_key = soup.find('div', attrs={'id':'inchiKey'}).text
    print(inchi_key.rstrip())
    
get_chemical_id()

Do you get the correct chemical identifier? If so, it’s time to wrap this in a loop that iterates over the list of product numbers:

with open('prod_no.txt', 'r') as prod_no_list:
    for prod_no in prod_no_list:
        prod_no = prod_no.rstrip()
        print(prod_no, end='')
        get_chemical_id()

Together with the chemical id, I printed out the product number again to ensure correspondence – some product numbers may be invalid and thus won’t yield the chemical id! This is guarding against that.

I get something like this as the output:

1111
    
    chemical id x

2222
    chemical id y
...

Notice the extraneous space and blank lines. Instead of trying to wrangle Python to output more consistently formatted output, I cleaned up with bash – it’s much easier:

$ sed 's/[[:space:]]//g; /^$/d' output > output_clean

You can confirm that each product number corresponds to a chemical identifier, then extracts just the identifiers like in Step 1:

$ paste - - < output_clean | awk '{print $2}' > chem_id.txt # list B

Step 3

Easy:

comm     <(sort list_c.txt) <(sort list_b.txt) # C ∪ B
comm -12 <(sort list_c.txt) <(sort list_b.txt) # C ∩ B

comm outputs 3 columns: (1) C-B (2) B-C (3) C ∩ B. The flag -12 suppresses columns 1 and 2. You can similarly suppress the other columns to output what you need.

Bottom line

Verify, verify your data at every step
Freely switch bash and Python according to your needs

Share on

Twitter Facebook Google+ LinkedIn

Your email address will not be published. Required fields are marked *

Link roundup: 2024

1 minute read

Published: December 31, 2024

Science
Chaos and cause

To fully appreciate what this means, heed a lesson from Fyodor Dostoyevsky’s great novel, The Brothers Karamazov (1880), which asks how a benevolent God could allow suffering. There is just one virtuous character in the novel, the monk Father Zosima, whose simple teaching, dictated through the genius of Dostoyevsky, sheds light on chaos, causation and difference-making:
See, here you have passed by a small child, passed by in anger, with a foul word, with a wrathful soul; you perhaps did not notice the child, but he saw you, and your unsightly and impious image has remained in his defenceless heart. You did not know it, but you may thereby have planted a bad seed in him, and it may grow, and all because you did not restrain yourself before the child, because you did not nurture in yourself a heedful, active love … for one ought to love not for a chance moment but for all time. Anyone, even a wicked man, can love by chance. My young brother asked forgiveness of the birds: it seems senseless, yet it is right, for all is like an ocean, all flows and connects; touch it in one place and it echoes at the other end of the world.

Link roundup: Jan–Jun 2023

less than 1 minute read

Published: June 30, 2023

Science
Go Ahead, Try to Explain Milk
Killer Heat Waves Are Coming
Science Shows Why Traditional Kimchi Making Works So Well
A New Approach to Computation Reimagines Artificial Intelligence
The Computer Scientist Peering Inside AI’s Black Boxes

Others
The Case Against Travel
How to Keep Life from Becoming a Parody of Itself: Simone de Beauvoir on the Art of Growing Older
Is Wine Fake?
The Sound of Home: Sonorous Desert by Kim Haines-Eitzen
The Meaning of Life
The Dao of Using Your Smartphone
Camus’s Atheism and the Virtues of Inconsistency
Fatal Distraction: Forgetting a Child in the Backseat of a Car Is a Horrifying Mistake. Is It a Crime?

Link roundup: Apr–Dec 2022

1 minute read

Published: December 31, 2022

Science
Wood spirits: How Japan made the world’s first liquor from trees
The price of ‘sugar free’: are sweeteners as harmless as we thought?
A language model beats alphafold2 on orphans
https://github.com/FellowsFreiesWissen/computational_notebooks Why Conventional Wisdom About Cancer Can Be Misleading
Machine Learning to Handle the Proteome
‘The entire protein universe’: AI predicts shape of nearly every known protein
Could machine learning fuel a reproducibility crisis in science?
Blots on a field?
PNAS | Leveraging nonstructural data to predict structures and affinities of protein–ligand complexes
Breaking into the black box of artificial intelligence

Others
When a Houseplant Obsession Becomes a Nightmare
Book Review: What We Owe The Future
If Someone Is Typing, Then Stops … Can I Ask Why?

Link roundup: Jan–Mar 2022

2 minute read

Published: March 30, 2022

Since these AIs are just giant matrix multiplication machines, “intuition” now has a firm grounding in math - just much bigger, more complicated math than the usual kind that we call “logical”.

This would be a common pattern for sciences: much worse at everyday tasks than people who do them intuitively, until it generates some surprising and powerful new technology. Democritus figured out what matter was made of in 400 BC, and it didn’t help a single person do a single useful thing with matter for the next 2000 years of followup research, and then you got the atomic bomb (I may be skipping over all of chemistry, sorry).
– What Are We Arguing About When We Argue About Rationality?

What he seeks to practice is, in a phrase popularized by the Marxist philosopher Antonio Gramsci, “pessimism of the intellect, optimism of the will.”
– Can Science Fiction Wake Us Up to Our Climate Reality?

Caulfield then introduced two different ways of thinking about how we engage with ideas when we’re on the internet: The web as a garden and the web as a stream. Think of the web as an organically developing garden: a space in which there’s no predetermined order or relationship of things to one another. Caulfield writes, “Every walk through the garden creates new paths, new meanings.” What came first in the garden doesn’t matter either. Each thing in the garden is related to the other things as it exists in the moment.
– The Faithful Gardener

Science
Dual use of artificial-intelligence-powered drug discovery
Twelve quick tips for software design
Computer Scientists Prove Why Bigger Neural Networks Do Better
Failing the test: DNA barcoding brought botanist Steven Newmaster scientific fame and entrepreneurial success. Was it all based on fraud?
What’s the buzz? Let’s talk about numbing ingredients
The pandemic’s true death toll: millions more than official counts
5 nutrition goals that are better than weight loss

Others
https://github.com/csinva/imodels
Synaesthetics
Transformative Experience and Pascal’s Wager
Do Good Doorbell Cams Make Good Neighbors?
How to Want Less
It’s Your Friends Who Break Your Heart
How to be useless
What We Don’t Want to Know
It’s Time for Some Game Theory
Why does woman have ‘man’ in it and female has the word ‘male’ in it?

Yossa Dwi Hartono