Get latest data from Internet Archive
This article is a translation of the following my article:
Original: Internet Archiveから最新データを取得する(curl,xmllint,jq)
* Translated automatically by Google.
* Please note that some links or referenced content in this article may be in Japanese.
* Comments in the code are basically in Japanese.
by bokumin
Get latest data from Internet Archive
Do you know about Internet Archive? It has recently been the subject of a lawsuit and has caused a lot of controversy, but it is a non-profit organization established in 1996, and you may have used it without even knowing it.
This article will explain how to obtain the latest data from the Internet Archive using Linux’s curl, xmllint, and jq commands.
What is the Internet Archive
Internet Archive is an American organization that operates the Wayback Machine, a well-known archive viewing service for WWW and multimedia materials. Headquarters are located in the Richmond District of San Francisco, California. Archives contain copies of web pages (web archives) that are collected automatically by programs or manually by users, and are called “WWW snapshots.” Other items include software, movies, books, and recorded data (including recordings of live performances with permission from music bands, etc.). The archive provides these materials free of charge. (Excerpt from Wikipedia)
Originally, it started out as a way to save web pages, but now software, movies, books, audio recordings, etc. owned by users are also made available to the public.
Illegality of the Internet Archive
As a non-profit library based on fair use principles, the Internet Archive has a public interest role in preserving cultural and historical materials.
The bottom line is that it is legal based on the same principle as borrowing and reading books for free at a regular library.
In addition, we have an opt-out system for rights holders, so we cannot say that it is completely legal, but it is not illegal to use the Internet Archive normally.
As the name suggests, the Internet Archive is a site that can be published and used on the Internet.
Therefore, data can be obtained using Curl commands, etc.
*Please refrain from continuously downloading using a script, etc. as this will put a load on the other party’s server.
Now we will actually retrieve data from the Internet Archive.
Get data from the Internet Archive
This time, I would like to obtain a list of the latest data from the Internet Archive.
The URL to generate the Internet Archive RSS feed is as follows.
https://archive.org/services/collection-rss.php
This URL allows you to obtain updates for a specific Internet Archive collection as an RSS feed.
curl -s https://archive.org/services/collection-rss.php
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<link>https://archive.org</link>
<title>Internet Archive</title>
<description>The most recent additions to the Internet Archive collections. This RSS feed is generated dynamically</description>
Since it is an RSS feed, if you retrieve it with curl, data in an xml structure will be returned.
This time, we will start formatting using a package called xmllint.
Installing xmllint
# Ubuntu,Debianの場合
sudo apt-get install xmllint
# SUSEの場合
sudo zypper install xmllint
Get and format RSS feed
Next, use xmllint –format to format it into a format that is easy to read.
The following is created using a shell script.
#!/bin/bash
# 出力用フォルダ作成
WORKDIR="$HOME/bin/internet-archive/latest-archive"
mkdir -p "$WORKDIR"
cd "$WORKDIR"
curl -s "https://archive.org/services/collection-rss.php" | \
xmllint --format - | \
grep -E "<title>|<link>|<category>" | \
sed 's/<[^>]>//g' | \
sed 's/^[ \t]//' > "$WORKDIR/"latest_uploads.txt
awk 'NR%2{printf "%s - ",$0;next;}1' "$WORKDIR/"latest_uploads.txt
If you look at the contents of latest_uploads, you will see that the latest Internet Archive feed has been successfully retrieved.
~/bin/internet-archive> cat latest_uploads.txt
.# title(タイトル)
.# link(リンク先)
.# category(カテゴリー)
Пёс-2
https://archive.org/details/NTV_20241114_052500_Pyos-2
movies/TV-NTV
2024-11-14-ew
https://archive.org/details/2024-11-14-ew
texts/eugeneweekly
independent-media-central-america 2024-11-04T13:17:00PST to 2024-11-04T05:56:24PST
https://archive.org/details/IMCA-20241104131700-crawler01
web/independent-media-central-america
SUDAN_20241114_063000
https://archive.org/details/SUDAN_20241114_063000
movies/TV-SUDAN
Archive-It Crawl Data: Partner 2517 Collection 22185 Crawl Job 2046385
https://archive.org/details/ARCHIVEIT-22185-2024103116-00001
web/ArchiveIt-Collection-22185
Harakiri (Deluxe Edition)
https://archive.org/details/serj_tankian_harakiri_deluxe_edition_2012-01-01
audio/opensource_audio
Create files for each category
As you can see from the latest_update above, this is still a huge amount, so we will create a shell script to create a text file for each category.
※「https://archive.org/services/collection-rss.php?collection=任意のカテゴリー名」とすることで直接カテゴライズされたフィードを取得できますが、データ量が「https://archive.org/services/collection-rss.php」と比較して膨大になってしまうので、制御が容易なlatest_upload.txtを再利用しました。
#!/bin/bash
INPUT_FILE="latest-archive/latest_uploads.txt"
OUTPUT_DIR="latest-archive"
# 現在のカテゴリーを格納する変数
current_category=""
current_title=""
current_link=""
# 1行ずつ処理する
while IFS= read -r line; do
#カテゴリーでファイル作成
if [[ $line =~ \<category\>(.*)\</category\> ]]; then
if [[ -n $current_category && -n $current_title && -n $current_link ]]; then
category_file="${OUTPUT_DIR}/${current_category%/*}_latest.txt"
echo "Title: $current_title" >> "$category_file"
echo "Link: $current_link" >> "$category_file"
echo "" >> "$category_file"
fi
current_category="${BASH_REMATCH[1]}"
current_title=""
current_link=""
fi
# タイトルの処理
if [[ $line =~ \<title\>(.*)\</title\> ]]; then
current_title="${BASH_REMATCH[1]}"
fi
# リンクの処理
if [[ $line =~ \<link\>(.*)\</link\> ]]; then
current_link="${BASH_REMATCH[1]}"
fi
done < "$INPUT_FILE"
# 最後のエントリーを保存
if [[ -n $current_category && -n $current_title && -n $current_link ]]; then
category_file="${OUTPUT_DIR}/${current_category%/*}_latest.txt"
echo "Title: $current_title" >> "$category_file"
echo "Link: $current_link" >> "$category_file"
echo "" >> "$category_file"
fi
When you run the shell above, txt files divided by category will be created in the latest-archive directory.
~/bin/internet-archive/latest-archive> ls
# このようにカテゴリー分けされたテキストが生成されていればOK
audio_latest.txt data_latest.txt latest_uploads.txt texts_latest.txt
collection_latest.txt image_latest.txt movies_latest.txt web_latest.txt
~/bin/internet-archive/latest-archive> cat data_latest.txt
Title: Archive-It Crawl Data: Partner 1067 Collection 23094 Crawl Job 2050793
Link: https://archive.org/details/ARCHIVEIT-23094-2024111407-00000
Title: Nat'l Security Adviser & White House Press Sec. Hold Briefing
Link: https://archive.org/details/CSPAN_20241114_062200_Natl_Security_Adviser__White_House_Press_Sec._Hold_Briefing
Title: spn2-20241114072909
Link: https://archive.org/details/spn2-20241114072909
You can also access this link and download it directly
I have created a shell script to download it for those who want to complete it on the terminal after completing this step.
Installing jq command
# SUSEの場合
sudo zypper install jq
# Ubuntu/debianの場合
sudo apt-get install jq
By retrieving the metadata first, you can see the exact location and information of the files you need. The point is to avoid downloading extra files.
URL encoding is required so that files with special characters or spaces in their names will still be downloaded correctly.
#!/bin/bash
# ./download-archive.sh Link先
url="$1"
if [ -z "$url" ]; then
echo "Please set Archive-URL"
exit 1
fi
identifier=$(echo "$url" | sed 's|.*/details/||')
# メタデータを取得してファイル一覧を表示
echo "Available files:"
files=$(curl -s "https://archive.org/metadata/$identifier" | \
jq -r '.files[] | select(.name!="") | .name' | \
nl -w1 -s'. ')
echo "$files"
echo -e "\nEnter number to download (1,2,...): "
read number
# 選んだ番号に対応するファイル名を取得
filename=$(echo "$files" | awk -v num="$number" '$1 == num"." {$1=""; print $0}' | xargs)
if [ -n "$filename" ]; then
echo "Downloading: $filename"
# URLエンコードされたファイル名を作成
encoded_filename=$(printf '%s' "$filename" | perl -MURI::Escape -ne 'chomp; print uriescape($)')
curl -L "https://archive.org/download/$identifier/$encoded_filename" \
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" \
--output "$filename"
filesize=$(stat -f%z "$filename" 2>/dev/null || stat -c%s "$filename" 2>/dev/null)
fi
Here is an example of actual use.
sh download-archive.sh https://archive.org/details/sematary-truey-jeans-rainbow-bridge-3
Available files:
1. SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3].mp4
2. __ia_thumb.jpg
3. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000001.jpg
4. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000058.jpg
5. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000090.jpg
Enter number to download (1,2,...):
2
Downloading: __ia_thumb.jpg
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 8851 100 8851 0 0 5105 0 0:00:01 0:00:01 --:--:-- 0
Lastly, please note the following points when using it.
- Considering the server load, refrain from continuous downloading
- An opt-out system is in place for rights holders
- General use is legal, but please use it appropriately
End