Drupal Development and Consultation - Proudly operating in The United States of America

BASH: Scrape File Links (HREFs) From URL and Save Files to Directory

By Xandermar LLC, June 1, 2022

This code example uses BASH to show how to get all PDF, DOC or XLS links from a URL and download the files to a specified directory.


The Explanation of the Script:


1) Create a new BASH script file

$: vi scraper.sh
                

2) Declare the file as a BASH file

#!/bin/bash
                clear;
                

3) Collect the info for the script

# This is the URL to the webpage that has the files to scrape
                read -e -p "URL to scrape: " url;
                # The location of where curl will save the files
                read -e -p "Path to store files (/path/to/files/dir_name or ./dir_name): " path;
                # To only download from a specific domain, we need to add a filter
                read -e -p "Only scrape from what domain? (i.e. www.dangibson.me - no http/https or '/'): " domain;
                

4) The curl GET command

# Finds an absolute link - if there are relative (or local) links, you'll need to modify this line
                c=$(curl -X GET ${url} --silent | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*");
                

5) Create an array with all curl entities

# adds each link as an array element
                entity_arr=($(echo $c | tr ";" "\n"))
                

6) Print the split string and save the file to a specified location

# For each array element, we print the response for the user and save the files to the specified location using WGET
                for i in "${entity_arr[@]}"
                do
                    if [[ $i == *"$domain"* ]]; then
                        if [[ $i == *".pdf"* ]] || [[ $i == *".doc"* ]] || [[ $i == *".xls"* ]]; then
                            echo $i
                            wget $i -P $path >/dev/null 2>&1
                        fi
                    fi
                done
                

7) Finally, let the user know the script has completed

echo 'Done'
                

Here’s the Entire Script:


#!/bin/bash
                clear;
                # This is the URL to the webpage that has the files to scrape
                read -e -p "URL to scrape: " url;
                # The location of where curl will save the files
                read -e -p "Path to store files (/path/to/files/dir_name or ./dir_name): " path;
                $ To only download from a specific domain, we need to add a filter
                read -e -p "Only scrape from what domain? (i.e. www.dangibson.me - no http/https or '/'): " domain;
                #Finds an absolute link - if there are relative (or local) links, you'll need to modify this line
                c=$(curl -X GET ${url} --silent | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*");
                # adds each link as an array element
                entity_arr=($(echo $c | tr ";" "\n"))
                # For each array element, we print the response for the user and save the files to the specified location using WGET
                for i in "${entity_arr[@]}"
                do
                    if [[ $i == *"$domain"* ]]; then
                        if [[ $i == *".pdf"* ]] || [[ $i == *".doc"* ]] || [[ $i == *".xls"* ]]; then
                            echo $i
                            wget $i -P $path >/dev/null 2>&1
                        fi
                    fi
                done
                echo 'Done'
                

New Articles Alert

Associated Tags

This content is the only content that uses this tag. More coming soon!

Comments