bokumin.org

Github

Viewing wget 2.20 Through Code.

This article is a translation of the following my article:

 

 

* Translated automatically by Google.
* Please note that some links or referenced content in this article may be in Japanese.
* Comments in the code are basically in Japanese.

 

by bokumin

 

Viewing wget 2.20 Through Code.

 

wget 2.20 was released on November 24th (local time). This update has major improvements in core functionality, so we would like to introduce them to you.

 

You can check the changes made from the URL below
https://gitlab.com/gnuwget/wget2/-/blob/master/NEWS?ref_type=heads

 

 

 

This is a brief summary of what updates have been made this time. I hope this is helpful.

 

 

Security-related

 

Don’t log URI userinfo to logs(https://gitlab.com/gnuwget/wget2/-/commit/dc8966d9060533264501bcd269fec4b2dc443df2)

 

A safe_uri has been added to the structure so that it does not contain user information from the URL.

 

wget_iri *wget_iri_parse(const char *url, const char *encoding)
{
// 認証情報がある場合
if (iri->userinfo) {
    iri->safe_uri = create_safe_uri(iri);
} else {
    // userinfoがない場合は元のURIをそのまま使用
    iri->safe_uri = iri->uri;
}

return iri;

 

wget https://user:[email protected]/file

 

This prevents the authentication information from being left in the log file when entering a user and password to retrieve a file. This will reduce the number of issues pointed out during security audits.

 

Disable explicit OCSP requests by default(https://gitlab.com/gnuwget/wget2/-/commit/c341fcd1dfd57b3cf5a1f5acb84784571fff3a20)

 

	.check_certificate = 1,
	.check_hostname = 1,
#ifdef WITH_OCSP
  // デフォルトで無効に変更
	// .ocsp = 1, 変更前
	.ocsp = 0, // 変更後
	.ocsp_stapling = 1,
#endif
	.ca_type = WGET_SSL_X509_FMT_PEM,

 

Fix segfault when OCSP response is missing(https://gitlab.com/gnuwget/wget2/-/commit/c556a3226aca0e99191b52218117b7967889a9bf)

 

Added a part that checks that the response object is not empty.

 

// 変更前
certid = OCSP_cert_to_id(EVP_sha1(), subject_cert, issuer_cert);

if (!(ocspreq = send_ocsp_request(ocsp_uri, certid, &resp)))
    return -1;

// 変更後
certid = OCSP_cert_to_id(EVP_sha1(), subject_cert, issuer_cert);

if (!(ocspreq = send_ocsp_request(ocsp_uri, certid, &resp)) || !resp || !resp->body)
    return -1;

 

Fix OCSP verification of first intermediate certificate(https://gitlab.com/gnuwget/wget2/-/commit/53a8a88e8479fca04fb17f923b0f40781ee6a253)

 

// if (config.ocsp && it > nvalid) {
// オフバイワンエラーの修正
		if (config.ocsp && it > = nvalid) {
			char fingerprint[64 * 2 +1];
			int revoked;

 

To summarize OCSP-related information, it is as follows: – Explicit OCSP is now disabled by default.
– Fixed crash when OCSP response is missing.
– Improved OCSP intermediate certificate verification process.

 

Disable TCP Fast Open by default(https://gitlab.com/gnuwget/wget2/-/commit/7a945d31aeb34fc73cf86a494673ae97e069d84d)

 

	.max_redirect = 20,
	.max_threads = 5,
	.dns_caching = 1,
	// .tcp_fastopen = 1, デフォルトで1から0に変更
	// we use 'Wget' here for compatibility, see https://github.com/rockdaboot/wget2/issues/314
	.user_agent = "Wget/"PACKAGE_VERSION,
	.verbose = 1,

 

TCP Fast Open (TFO) is now disabled by default. TFO seems to have been disabled due to security and privacy concerns, as it could be used for tracking by third parties, and there were issues with middleboxes that do not support TFO.

 

Allow option –no-tcp-fastopen to work on Linux kernels >= 4.11(https://gitlab.com/gnuwget/wget2/-/commit/7929bf887c69ffdcbdfb525825bffba4c9e5d6e8)

 

--no-tcp-fastopen option has been fixed to work on Linux kernels 4.11 and later. It feels like it leaves the user with a choice.

 

Limit cases where methods are redirected to GET(https://gitlab.com/gnuwget/wget2/-/commit/329d1282caa9ae58105a6b6832138050c492dc28)

 

There is now strict control over method changes during redirects. For 301,302,303, automatic changes to HTTP redirects have been restricted. This prevents method changes due to inappropriate redirects.

 

	if (resp->code / 100  == 3 && resp->code != 307) job->redirect_get = 1; //変更前
	
	// 変更後
	if (!wget_strcasecmp_ascii(resp->req->method, "POST"))
		{
			if (resp->code == 301 || resp->code == 302 || resp->code == 303)
				job->redirect_get = 1;
		}

 

Bug fixes

 

Don’t truncate file when -c and -O are combined(https://gitlab.com/gnuwget/wget2/-/commit/1cb578e3e9e86b32f9a5157a598d8ff0de44bd3c)

 

Fixed an issue where the file would be truncated when the -c (download from continuation) and -O (specify file name) commands were used together. It is now possible to restart midway when specifying a file name.

 

// 変更前
		} else {
// 変更後
// 継続ダウンロードでない場合は切り捨てるように変更
		} else if (!config.continue_download) {
			int fd = open(config.output_document, O_WRONLY | O_TRUNC | O_BINARY);

			if (fd ! = -1)

 

Fix downloading multiple files via HTTP/2(https://gitlab.com/gnuwget/wget2/-/commit/ec27488feadd44b5e126592bc18ff2441f8cae5a)

 

This is literally a fix for multiple downloads in HTTP/2. Previously, disconnections and crosstalk occurred due to improper stream state management in a single TCP connection. If you look at the commit history, you can see that it has been improved in the following parts.

 

struct http2_stream_context {
    wget_http_connection *conn;
    wget_http_response *resp;
    wget_decompressor *decompressor;
    };

 

*conn→ Maintains a reference to HTTP/2 connections and tracks which connection each stream belongs to
*resp→ Manages response data for each stream separately
*decompressor → Manage the decompression status of compressed response data independently for each stream

This allows appropriate management of each stream and enables appropriate tracking of response processing.

 

Fix redirections with –no-parenthttps://gitlab.com/gnuwget/wget2/-/commit/55a4c145c80325b2fb0b1fb3768f31094154e5d3

 

When –no-parent was added, the parent URL was not properly checked and valid URLs were also skipped, but this has been fixed.

 

if (config.recursive && !config.parent && !(flags & URL_FLG_REQUISITE)) {
    // 親ディレクトリへの移動をデフォルトで制限
    bool ok = false;

    // 少なくとも1つの親ディレクトリと一致するかチェック
    for (int it = 0; it < wget_vector_size(parents); it++) {
        wget_iri *parent = wget_vector_get(parents, it);

        if (!wget_strcmp(parent->host, iri->host)) {
            if (!parent->dirlen || !wget_strncmp(parent->path, iri->path, parent->dirlen)) {
                ok = true;
                break;
            }
        }
    }

    if (!ok) {
        info_printf(_("URL '%s' not followed (parent ascending not allowed)\n"), url);
        goto out;
    }
}

 

Fix –no-parent for denormalized paths(https://gitlab.com/gnuwget/wget2/-/commit/9aeab55d09f9df833bca4467b0a209cea2901ede)

 

The wget_iri_parse() function now calls the normalize_path() function to normalize the path part of the IRI.

 

		c = *s;
		if (c) *s++ = 0;
		wget_iri_unescape_inline((char *)iri->path);
		normalize_path((char *)iri->path); // 追加
	}

 

Fix status 8 for failed redirection of robots.txt(https://gitlab.com/gnuwget/wget2/-/commit/2b1f266ca639b7712a973c1512a2611d5fce7930)

 

This is an issue where wget2 was improperly exiting with status code 8 when the error in robots.txt was a 302 redirect and the subsequent request returned a 404 error. Originally, the 404 error in robots.txt should have been ignored, but it was not handled properly when a redirect was included. The following new lines have been added:

 

new_job->robotstxt = job->robotstxt;

 

This will now copy the robotstxt flag from the original job to the new job when redirecting. Redirected requests will now be recognized as robots.txt, and 404 errors will be handled appropriately.

 

Fix IPv6 address representation(https://gitlab.com/gnuwget/wget2/-/commit/ff881ed20182950accf77cf70bcaf51ec75d1a87)

 

if (sscanf(buf, "%63[0-9.:] %255[a-zA-Z0-9.-]", ip, name) != 2) // 変更前
if (sscanf(buf, "%63s %255s", ip, name) != 2) // 変更後

 

I used the character class [0-9.:] until last time, but it did not accept some valid characters in IPv6 addresses (hexadecimal alphabets, [], etc.). It has been corrected as below.

 

*The limit of 63 characters for input and 255 characters for host name will continue.

 

Fix –dns-cache-preload for IPv6 (https://gitlab.com/gnuwget/wget2/-/commit/ff881ed20182950accf77cf70bcaf51ec75d1a87)

 

Same modification as above.

 

Fix –restrict-file-names to be backwards compatible with wget 1.x(https://gitlab.com/gnuwget/wget2/-/commit/284954553613f75e57ef107ceaa06ae4d9dd8c59)

 

This means that –restrict-file-names has been improved so that it can perform the same processing as wget1.x.

 

--restrict-file-names=windows,ascii,lowercase // 複数のオプションサポートが可能に

 

Several improvements to the WolfSSL code(https://gitlab.com/gnuwget/wget2/-/commit/1d6632a31c5fbec2145762c5fffcf31af313e47a)

 

Adding a semicolon.

 

//変更前
	XFREE(subject, 0, DYNAMIC_TYPE_OPENSSL) 
	XFREE(issuer, 0, DYNAMIC_TYPE_OPENSSL)
//変更後
	XFREE(subject, 0, DYNAMIC_TYPE_OPENSSL);
	XFREE(issuer, 0, DYNAMIC_TYPE_OPENSSL);

 

Added functions

 

Support connecting with HTTP/1.0 proxies(https://gitlab.com/gnuwget/wget2/-/commit/f5344eb415a8b221e1b887d02b31090c6459bfd8)

 

Previously, only HTTP/1.1 was allowed, but now HTTP/1.0 is also allowed. This improves compatibility with various proxy servers.

 

	if (wget_strncasecmp_ascii(sbuf, "HTTP/1.1 200", 12)) { // 変更前
	if (wget_strncasecmp_ascii(sbuf, "HTTP/1.1 200", 12) && wget_strncasecmp_ascii(sbuf, "HTTP/1.0 200", 12)) { // 変更後

 

Ignore 1xx HTTP responses for HTTP/1.1(https://gitlab.com/gnuwget/wget2/-/commit/fa638f597c3eefa3cc87493debe4cf075aed3c55)

 

A process has been implemented that effectively ignores 1xx responses and waits for the next response.

 

skip_1xx: 
if (nread < 4) 
    continue;

if (nread - nbytes <= 4) 

if (H_10X(resp->code)) { 
    wget_http_free_response(&resp); 
    p += 4; 
    // 現在まで読んだバイト数を計算
    nbytes = nread -= (p - buf); 
    // 残りのデータをバッファの先頭に移動
    memmove(buf, p, nread + 1); 
    goto skip_1xx; // 中間応答を無視、ボディは想定しない
}

 

Fix ignoring connect timeout (regression)(https://gitlab.com/gnuwget/wget2/-/commit/21f41932af46faa9a144b7025d99270353021a61)

 

	struct timeval tv = { 
	// タイムアウトを秒とマイクロに分割
	                      .tv_sec = tcp->connect_timeout/1000,
	                      .tv_usec = tcp->connect_timeout % 1000 * 1000 
	                      };
	// ソケットオプション(SO_SNDTIMEO)の追加
	if (setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof(tv)) == -1)
		error_printf(_("Failed to set socket option SO_SNDTIMEO\n"));
}

 

This is a modification, but it seems that a new timeout for sending operations has been added to make it easier to understand. This now prevents connections from being blocked for long periods of time.

 

Accept –progress=dot:… for backwards compatibility(https://gitlab.com/gnuwget/wget2/-/commit/e8f1e99c96a8303421e66b0feda1651a11c8b250)

 

–progress option (progress display) has been improved. wget_strncasecmp_ascii is now used to perform a prefix match check for options, and val[3] == ':' || val[3] == 0 now allows for colons and terminating characters after options.
*The dot option is not implemented at this time, so it only displays an informational message.
Also added logic to set the config.force_progress flag.

 

	if (!wget_strcasecmp_ascii(val, "none"))
		*((char *)opt->var) = PROGRESS_TYPE_NONE;
		
	// else if (!wget_strncasecmp_ascii(val, "bar", 3)) { // 変更前
	else if (!wget_strncasecmp_ascii(val, "bar", 3) && (val[3] == ':' || val[3] == 0)) { // 変更後
		
		*((char *)opt->var) = PROGRESS_TYPE_BAR;
		
	// if (!wget_strncasecmp_ascii(val+3, ":force", 6) || !wget_strncasecmp_ascii(val+3, ":noscroll:force", 15)) { // 変更前
		if (!wget_strncasecmp_ascii(val+4, "force", 5) || !wget_strncasecmp_ascii(val+4, "noscroll:force", 14)) { // 変更後
			config.force_progress = true;
		}
//	} else if (!wget_strcasecmp_ascii(val, "dot")) { // 変更前
	} else if (!wget_strncasecmp_ascii(val, "dot", 3) && (val[3] == ':' || val[3] == 0)) { // 変更後
		// Wget compatibility, whether want to support 'dot' depends on user feedback.
		info_printf(_("Progress type '%s' ignored. It is not implemented yet\n"), val);
	} else {

 

This now supports different formats of progress display options.

 

Fix possible deadlock when combining –no-clobber and –no-parent(https://gitlab.com/gnuwget/wget2/-/commit/8ebd0a25f068c34209dd42c6fea4db6e3b381626)

 

The URL duplication check and locking process in a multi-threaded environment and the conditions for recursive downloads have been improved.

 

if (wget_hashmap_put(known_urls, wget_strmemdup(buf.data, buf.length), NULL) == 0) { // 変更前

// 変更後
wget_thread_mutex_lock(known_urls_mutex);
int rc = wget_hashmap_put(known_urls, wget_strmemdup(buf.data, buf.length), NULL);
wget_thread_mutex_unlock(known_urls_mutex);

if (rc == 0) {

 

if (config.recursive && (!config.level || (job && job->level < config.level + config.page_requisites))){ // 変更前
if (config.recursive && (!config.level || !job || (job && job->level < config.level + config.page_requisites))) { // 変更後

 

This change reduces the risk of potential deadlock conditions.

 

Fix xattr reading of user.mime_type(https://gitlab.com/gnuwget/wget2/-/commit/ac9c84b3bab60a0cd1100ac6f189fc526694e95c)

 

The logic for reading MIME types has been simplified and changed to focus on the “user.mime_type” attribute.

 

		if (read_xattr_metadata("user.mimetype", _mimetype, sizeof(_mimetype), fd) < 0) // 変更前
		if (read_xattr_metadata("user.mime_type", _mimetype, sizeof(_mimetype), fd) < 0) // 変更後

 

Fix robots.txt parser(https://gitlab.com/gnuwget/wget2/-/commit/07b15e71f4d72c53fb10fdeb28a188b94b6c35ac)

 

The logic of robots.txt has been significantly improved. This allows us to support robots.txt files in various formats.

 

static bool parse_record_field(const char **data, const char *field, size_t field_length)
{
	advance_ws(data);

	if (wget_strncasecmp_ascii(*data, field, field_length))
		return false;

	*data += field_length;
	advance_ws(data);

	if (**data ! = ':')
		return false;

	*data += 1;
	advance_ws(data);

	return true;
}

 

Add fetchmail compatibility for user/password in .netrc(https://gitlab.com/gnuwget/wget2/-/commit/ae24f83fd06d8834188957830eeb043c5b7d5cc9)

 

The following code has been modified to improve compatibility with .netrc files.

 

// 変更前
else if (!strcmp(key, "login")) {
}
else if (!strcmp(key, "password")) {
}

// 変更後
else if (!strcmp(key, "login") || !strcmp(key, "user")) {
    // "user" is for fetchmail compatibility
    if (!netrc.login)
        netrc.login = wget_strmemdup(p, linep - p);
}
else if (!strcmp(key, "password") || !strcmp(key, "passwd")) {
    // "passwd" is for fetchmail compatibility
    if (!netrc.password) {
        if (!escaped)
            netrc.password = wget_strmemdup(p, linep - p);
    }
}

 

This is an expanded form of reading login information and password information.

 

Improve suport for non-standard cookie timestamps(https://gitlab.com/gnuwget/wget2/-/commit/7bf93ff6c64520e2931b1c79663851b188ee2016)

 

It is now possible to support date formats that were previously non-standard.
Specifically, the format is Sun Nov 26 2023 21:24:47 .

 

else if (sscanf(s, " %*s %3s %2d %4d %2d:%2d:%2d", // non-standard: Sun Nov 26 2023 21:24:47
                mname, &day, &year, &hour, &min, &sec) == 6) {
}

 

Add libproxy support(https://gitlab.com/gnuwget/wget2/-/commit/1a886595e69f54247c70a1e553676407fc8028c7)

 

Added libproxy support (sorry for the literal translation).
Specifically, add –enable-libproxy to
configure.ac

 

# libproxy support
with_libproxy=no
AC_ARG_ENABLE(libproxy,
  [  --enable-libproxy       libproxy support for system wide proxy configuration])
AS_IF([test "${enable_libproxy}" = "yes"], [
  with_libproxy=yes
  PKG_CHECK_MODULES([LIBPROXY], [libproxy-1.0], [
    LIBS="$LIBPROXY_LIBS $LIBS"
    CFLAGS="$LIBPROXY_CFLAGS $CFLAGS"
    AC_DEFINE([HAVE_LIBPROXY], [1], [Define if using libproxy.])
  ])
])

 

Add code to get system-wide proxy settings to libget/http.c

 

{
	pxProxyFactory *pf = px_proxy_factory_new();
	if (pf) {
		char **proxies = px_proxy_factory_get_proxies(pf, iri->uri);

		if (proxies) {
			if (proxies[0]) {
				if (strcmp (proxies[0], "direct://") != 0) {
					wget_iri *proxy_iri = wget_iri_parse(proxies[0], "utf-8");
					host = strdup(proxy_iri->host);
					port = proxy_iri->port;

					if (proxy_iri->scheme == WGET_IRI_SCHEME_HTTP) {
						ssl = false;
						conn->proxied = 1;
					} else {
						ssl = true;
						need_connect = true;
					}
					wget_iri_free(&proxy_iri);
				}
			}

			px_proxy_factory_free_proxies(proxies);
		}

		px_proxy_factory_free (pf);
	}
}

 

You can now use libproxy to automatically select the best proxy depending on the destination.

 

Other

 

Add instruction on how to cross-build wget2.exe via docke(https://gitlab.com/gnuwget/wget2/-/commit/045976cf8f046477efea081d4ea9e336cb5ce15b)

 

Added documentation and instructions for cross-compilation for windows.
It would be helpful if you refer to the link for details.
In summary,
1. Set up a cross-compilation environment using a Dockerfile
2. Build a static binary (wget2.exe) for Windows
3. Copy the generated binary to the host machine
*Optional steps are provided to remove debugging symbols and compress the executable.

 

Don’t request preferred mime type for single file downloads
Slightly improved compatibility with LibreSSL

 

I looked for commits for these two items, but I couldn’t find any. It may be updated in the future

 

 

More than just technical improvements, this release strengthens security by adding and changing user information protection, OCSP control, privacy-friendly TCP settings, and safer redirect processing.
We reduce the risk of credential leakage by not logging user information in URIs, and limit the possibility of network tracking by disabling TCP Fast Open by default. I also felt that they focused on practical security and reliability enhancements, such as improving stability when downloading multiple files using HTTP/2 and improving parsing of robots.txt.

 

End