This code example uses BASH to show how to get all PDF, DOC or XLS links from a URL and download the files to a specified directory.
The Explanation of the Script:
1) Create a new BASH script file
$: vi scraper.sh
2) Declare the file as a BASH file
#!/bin/bash
clear;
3) Collect the info for the script
# This is the URL to the webpage that has the files to scrape
read -e -p "URL to scrape: " url;
# The location of where curl will save the files
read -e -p "Path to store files (/path/to/files/dir_name or ./dir_name): " path;
# To only download from a specific domain, we need to add a filter
read -e -p "Only scrape from what domain? (i.e. www.dangibson.me - no http/https or '/'): " domain;
4) The curl GET command
# Finds an absolute link - if there are relative (or local) links, you'll need to modify this line
c=$(curl -X GET ${url} --silent | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*");
5) Create an array with all curl entities
# adds each link as an array element
entity_arr=($(echo $c | tr ";" "\n"))
6) Print the split string and save the file to a specified location
# For each array element, we print the response for the user and save the files to the specified location using WGET
for i in "${entity_arr[@]}"
do
if [[ $i == *"$domain"* ]]; then
if [[ $i == *".pdf"* ]] || [[ $i == *".doc"* ]] || [[ $i == *".xls"* ]]; then
echo $i
wget $i -P $path >/dev/null 2>&1
fi
fi
done
7) Finally, let the user know the script has completed
echo 'Done'
Here’s the Entire Script:
#!/bin/bash
clear;
# This is the URL to the webpage that has the files to scrape
read -e -p "URL to scrape: " url;
# The location of where curl will save the files
read -e -p "Path to store files (/path/to/files/dir_name or ./dir_name): " path;
$ To only download from a specific domain, we need to add a filter
read -e -p "Only scrape from what domain? (i.e. www.dangibson.me - no http/https or '/'): " domain;
#Finds an absolute link - if there are relative (or local) links, you'll need to modify this line
c=$(curl -X GET ${url} --silent | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*");
# adds each link as an array element
entity_arr=($(echo $c | tr ";" "\n"))
# For each array element, we print the response for the user and save the files to the specified location using WGET
for i in "${entity_arr[@]}"
do
if [[ $i == *"$domain"* ]]; then
if [[ $i == *".pdf"* ]] || [[ $i == *".doc"* ]] || [[ $i == *".xls"* ]]; then
echo $i
wget $i -P $path >/dev/null 2>&1
fi
fi
done
echo 'Done'
Comments