Get latest data from Internet Archive – Blog

This article is a translation of the following my article:

Original: Internet Archiveから最新データを取得する(curl,xmllint,jq)

* Translated automatically by Google.
* Please note that some links or referenced content in this article may be in Japanese.
* Comments in the code are basically in Japanese.

by bokumin

Get latest data from Internet Archive

Do you know about Internet Archive? It has recently been the subject of a lawsuit and has caused a lot of controversy, but it is a non-profit organization established in 1996, and you may have used it without even knowing it.
This article will explain how to obtain the latest data from the Internet Archive using Linux’s curl, xmllint, and jq commands.

What is the Internet Archive

Internet Archive is an American organization that operates the Wayback Machine, a well-known archive viewing service for WWW and multimedia materials. Headquarters are located in the Richmond District of San Francisco, California. Archives contain copies of web pages (web archives) that are collected automatically by programs or manually by users, and are called “WWW snapshots.” Other items include software, movies, books, and recorded data (including recordings of live performances with permission from music bands, etc.). The archive provides these materials free of charge. (Excerpt from Wikipedia)

Originally, it started out as a way to save web pages, but now software, movies, books, audio recordings, etc. owned by users are also made available to the public.

Illegality of the Internet Archive

As a non-profit library based on fair use principles, the Internet Archive has a public interest role in preserving cultural and historical materials.
The bottom line is that it is legal based on the same principle as borrowing and reading books for free at a regular library.
In addition, we have an opt-out system for rights holders, so we cannot say that it is completely legal, but it is not illegal to use the Internet Archive normally.

As the name suggests, the Internet Archive is a site that can be published and used on the Internet.
Therefore, data can be obtained using Curl commands, etc.
*Please refrain from continuously downloading using a script, etc. as this will put a load on the other party’s server.

Now we will actually retrieve data from the Internet Archive.

Get data from the Internet Archive

This time, I would like to obtain a list of the latest data from the Internet Archive.
The URL to generate the Internet Archive RSS feed is as follows.
https://archive.org/services/collection-rss.php
This URL allows you to obtain updates for a specific Internet Archive collection as an RSS feed.

curl -s https://archive.org/services/collection-rss.php

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <link>https://archive.org</link>
    <title>Internet Archive</title>
    <description>The most recent additions to the Internet Archive collections.  This RSS feed is generated dynamically</description>

curl -s https://archive.org/services/collection-rss.php

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <link>https://archive.org</link>
    <title>Internet Archive</title>
    <description>The most recent additions to the Internet Archive collections.  This RSS feed is generated dynamically</description>

Since it is an RSS feed, if you retrieve it with curl, data in an xml structure will be returned.
This time, we will start formatting using a package called xmllint.

Installing xmllint

# Ubuntu,Debianの場合
sudo apt-get install xmllint
# SUSEの場合
sudo zypper install xmllint

# Ubuntu,Debianの場合
sudo apt-get install xmllint
# SUSEの場合
sudo zypper install xmllint

Get and format RSS feed

Next, use xmllint –format to format it into a format that is easy to read.
The following is created using a shell script.

#!/bin/bash

# 出力用フォルダ作成
WORKDIR="$HOME/bin/internet-archive/latest-archive"
mkdir -p "$WORKDIR"

cd "$WORKDIR"
curl -s "https://archive.org/services/collection-rss.php" | \
    xmllint --format - | \
    grep -E "<title>|<link>|<category>" | \
    sed 's/<[^>]>//g' | \
    sed 's/^[ \t]//' > "$WORKDIR/"latest_uploads.txt


awk 'NR%2{printf "%s - ",$0;next;}1'  "$WORKDIR/"latest_uploads.txt

#!/bin/bash

# 出力用フォルダ作成
WORKDIR="$HOME/bin/internet-archive/latest-archive"
mkdir -p "$WORKDIR"

cd "$WORKDIR"
curl -s "https://archive.org/services/collection-rss.php" | \
    xmllint --format - | \
    grep -E "<title>|<link>|<category>" | \
    sed 's/<[^>]>//g' | \
    sed 's/^[ \t]//' > "$WORKDIR/"latest_uploads.txt


awk 'NR%2{printf "%s - ",$0;next;}1'  "$WORKDIR/"latest_uploads.txt

If you look at the contents of latest_uploads, you will see that the latest Internet Archive feed has been successfully retrieved.

~/bin/internet-archive> cat latest_uploads.txt
.# title(タイトル)
.# link(リンク先)
.# category(カテゴリー)
Пёс-2 
https://archive.org/details/NTV_20241114_052500_Pyos-2
movies/TV-NTV
2024-11-14-ew
https://archive.org/details/2024-11-14-ew
texts/eugeneweekly
independent-media-central-america 2024-11-04T13:17:00PST to 2024-11-04T05:56:24PST
https://archive.org/details/IMCA-20241104131700-crawler01
web/independent-media-central-america
SUDAN_20241114_063000
https://archive.org/details/SUDAN_20241114_063000
movies/TV-SUDAN
Archive-It Crawl Data: Partner 2517 Collection 22185 Crawl Job 2046385
https://archive.org/details/ARCHIVEIT-22185-2024103116-00001
web/ArchiveIt-Collection-22185
Harakiri (Deluxe Edition)
https://archive.org/details/serj_tankian_harakiri_deluxe_edition_2012-01-01
audio/opensource_audio

~/bin/internet-archive> cat latest_uploads.txt
.# title(タイトル)
.# link(リンク先)
.# category(カテゴリー)
Пёс-2 
https://archive.org/details/NTV_20241114_052500_Pyos-2
movies/TV-NTV
2024-11-14-ew
https://archive.org/details/2024-11-14-ew
texts/eugeneweekly
independent-media-central-america 2024-11-04T13:17:00PST to 2024-11-04T05:56:24PST
https://archive.org/details/IMCA-20241104131700-crawler01
web/independent-media-central-america
SUDAN_20241114_063000
https://archive.org/details/SUDAN_20241114_063000
movies/TV-SUDAN
Archive-It Crawl Data: Partner 2517 Collection 22185 Crawl Job 2046385
https://archive.org/details/ARCHIVEIT-22185-2024103116-00001
web/ArchiveIt-Collection-22185
Harakiri (Deluxe Edition)
https://archive.org/details/serj_tankian_harakiri_deluxe_edition_2012-01-01
audio/opensource_audio

Create files for each category

As you can see from the latest_update above, this is still a huge amount, so we will create a shell script to create a text file for each category.

※「https://archive.org/services/collection-rss.php?collection=任意のカテゴリー名」とすることで直接カテゴライズされたフィードを取得できますが、データ量が「https://archive.org/services/collection-rss.php」と比較して膨大になってしまうので、制御が容易なlatest_upload.txtを再利用しました。

#!/bin/bash

INPUT_FILE="latest-archive/latest_uploads.txt"
OUTPUT_DIR="latest-archive"

# 現在のカテゴリーを格納する変数
current_category=""
current_title=""
current_link=""

# 1行ずつ処理する
while IFS= read -r line; do
    #カテゴリーでファイル作成
    if [[ $line =~ \<category\>(.*)\</category\> ]]; then
        if [[ -n $current_category && -n $current_title && -n $current_link ]]; then
            category_file="${OUTPUT_DIR}/${current_category%/*}_latest.txt"
            echo "Title: $current_title" >> "$category_file"
            echo "Link: $current_link" >> "$category_file"
            echo "" >> "$category_file"
        fi
        
        current_category="${BASH_REMATCH[1]}"
        current_title=""
        current_link=""
    fi
    
    # タイトルの処理
    if [[ $line =~ \<title\>(.*)\</title\> ]]; then
        current_title="${BASH_REMATCH[1]}"
    fi
    
    # リンクの処理
    if [[ $line =~ \<link\>(.*)\</link\> ]]; then
        current_link="${BASH_REMATCH[1]}"
    fi
done < "$INPUT_FILE"

# 最後のエントリーを保存
if [[ -n $current_category && -n $current_title && -n $current_link ]]; then
    category_file="${OUTPUT_DIR}/${current_category%/*}_latest.txt"
    echo "Title: $current_title" >> "$category_file"
    echo "Link: $current_link" >> "$category_file"
    echo "" >> "$category_file"
fi

#!/bin/bash

INPUT_FILE="latest-archive/latest_uploads.txt"
OUTPUT_DIR="latest-archive"

# 現在のカテゴリーを格納する変数
current_category=""
current_title=""
current_link=""

# 1行ずつ処理する
while IFS= read -r line; do
    #カテゴリーでファイル作成
    if [[ $line =~ \<category\>(.*)\</category\> ]]; then
        if [[ -n $current_category && -n $current_title && -n $current_link ]]; then
            category_file="${OUTPUT_DIR}/${current_category%/*}_latest.txt"
            echo "Title: $current_title" >> "$category_file"
            echo "Link: $current_link" >> "$category_file"
            echo "" >> "$category_file"
        fi
        
        current_category="${BASH_REMATCH[1]}"
        current_title=""
        current_link=""
    fi
    
    # タイトルの処理
    if [[ $line =~ \<title\>(.*)\</title\> ]]; then
        current_title="${BASH_REMATCH[1]}"
    fi
    
    # リンクの処理
    if [[ $line =~ \<link\>(.*)\</link\> ]]; then
        current_link="${BASH_REMATCH[1]}"
    fi
done < "$INPUT_FILE"

# 最後のエントリーを保存
if [[ -n $current_category && -n $current_title && -n $current_link ]]; then
    category_file="${OUTPUT_DIR}/${current_category%/*}_latest.txt"
    echo "Title: $current_title" >> "$category_file"
    echo "Link: $current_link" >> "$category_file"
    echo "" >> "$category_file"
fi

When you run the shell above, txt files divided by category will be created in the latest-archive directory.

~/bin/internet-archive/latest-archive> ls
# このようにカテゴリー分けされたテキストが生成されていればOK
audio_latest.txt       data_latest.txt   latest_uploads.txt  texts_latest.txt
collection_latest.txt  image_latest.txt  movies_latest.txt   web_latest.txt

~/bin/internet-archive/latest-archive> cat data_latest.txt 
Title: Archive-It Crawl Data: Partner 1067 Collection 23094 Crawl Job 2050793
Link: https://archive.org/details/ARCHIVEIT-23094-2024111407-00000

Title: Nat'l Security Adviser & White House Press Sec. Hold Briefing
Link: https://archive.org/details/CSPAN_20241114_062200_Natl_Security_Adviser__White_House_Press_Sec._Hold_Briefing

Title: spn2-20241114072909
Link: https://archive.org/details/spn2-20241114072909

~/bin/internet-archive/latest-archive> ls
# このようにカテゴリー分けされたテキストが生成されていればOK
audio_latest.txt       data_latest.txt   latest_uploads.txt  texts_latest.txt
collection_latest.txt  image_latest.txt  movies_latest.txt   web_latest.txt

~/bin/internet-archive/latest-archive> cat data_latest.txt 
Title: Archive-It Crawl Data: Partner 1067 Collection 23094 Crawl Job 2050793
Link: https://archive.org/details/ARCHIVEIT-23094-2024111407-00000

Title: Nat'l Security Adviser & White House Press Sec. Hold Briefing
Link: https://archive.org/details/CSPAN_20241114_062200_Natl_Security_Adviser__White_House_Press_Sec._Hold_Briefing

Title: spn2-20241114072909
Link: https://archive.org/details/spn2-20241114072909

You can also access this link and download it directly
I have created a shell script to download it for those who want to complete it on the terminal after completing this step.

Installing jq command

# SUSEの場合
sudo zypper install jq
# Ubuntu/debianの場合
sudo apt-get install jq

# SUSEの場合
sudo zypper install jq
# Ubuntu/debianの場合
sudo apt-get install jq

By retrieving the metadata first, you can see the exact location and information of the files you need. The point is to avoid downloading extra files.
URL encoding is required so that files with special characters or spaces in their names will still be downloaded correctly.

#!/bin/bash
# ./download-archive.sh Link先
url="$1"
if [ -z "$url" ]; then
    echo "Please set Archive-URL"
    exit 1
fi
identifier=$(echo "$url" | sed 's|.*/details/||')
# メタデータを取得してファイル一覧を表示
echo "Available files:"
files=$(curl -s "https://archive.org/metadata/$identifier" | \
       jq -r '.files[] | select(.name!="") | .name' | \
       nl -w1 -s'. ')
echo "$files"
echo -e "\nEnter number to download (1,2,...): "
read number
# 選んだ番号に対応するファイル名を取得
filename=$(echo "$files" | awk -v num="$number" '$1 == num"." {$1=""; print $0}' | xargs)
if [ -n "$filename" ]; then
    echo "Downloading: $filename"

    # URLエンコードされたファイル名を作成
    encoded_filename=$(printf '%s' "$filename" | perl -MURI::Escape -ne 'chomp; print uriescape($)')
    curl -L "https://archive.org/download/$identifier/$encoded_filename" \
         -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" \
         --output "$filename"

    filesize=$(stat -f%z "$filename" 2>/dev/null || stat -c%s "$filename" 2>/dev/null)
 fi

#!/bin/bash
# ./download-archive.sh Link先
url="$1"
if [ -z "$url" ]; then
    echo "Please set Archive-URL"
    exit 1
fi
identifier=$(echo "$url" | sed 's|.*/details/||')
# メタデータを取得してファイル一覧を表示
echo "Available files:"
files=$(curl -s "https://archive.org/metadata/$identifier" | \
       jq -r '.files[] | select(.name!="") | .name' | \
       nl -w1 -s'. ')
echo "$files"
echo -e "\nEnter number to download (1,2,...): "
read number
# 選んだ番号に対応するファイル名を取得
filename=$(echo "$files" | awk -v num="$number" '$1 == num"." {$1=""; print $0}' | xargs)
if [ -n "$filename" ]; then
    echo "Downloading: $filename"

    # URLエンコードされたファイル名を作成
    encoded_filename=$(printf '%s' "$filename" | perl -MURI::Escape -ne 'chomp; print uriescape($)')
    curl -L "https://archive.org/download/$identifier/$encoded_filename" \
         -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" \
         --output "$filename"

    filesize=$(stat -f%z "$filename" 2>/dev/null || stat -c%s "$filename" 2>/dev/null)
 fi

Here is an example of actual use.

sh download-archive.sh https://archive.org/details/sematary-truey-jeans-rainbow-bridge-3
Available files:
1. SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3].mp4
2. __ia_thumb.jpg
3. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000001.jpg
4. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000058.jpg
5. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000090.jpg

Enter number to download (1,2,...): 
2
Downloading: __ia_thumb.jpg
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  8851  100  8851    0     0   5105      0  0:00:01  0:00:01 --:--:--     0

sh download-archive.sh https://archive.org/details/sematary-truey-jeans-rainbow-bridge-3
Available files:
1. SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3].mp4
2. __ia_thumb.jpg
3. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000001.jpg
4. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000058.jpg
5. sematary-truey-jeans-rainbow-bridge-3.thumbs/SEMATARY - TRUEY JEANS [RAINBOW BRIDGE 3]_000090.jpg

Enter number to download (1,2,...): 
2
Downloading: __ia_thumb.jpg
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  8851  100  8851    0     0   5105      0  0:00:01  0:00:01 --:--:--     0

Lastly, please note the following points when using it.

Considering the server load, refrain from continuous downloading
An opt-out system is in place for rights holders
General use is legal, but please use it appropriately

End