Download File from the Internet (2024)

download.file {utils}R Documentation

Description

This function can be used to download a file from the Internet.

Usage

download.file(url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE, extra = getOption("download.file.extra"), headers = NULL, ...)

Arguments

url

a character string (or longer vectorfor the "libcurl" method) naming the URL of a resource to bedownloaded.

destfile

a character string (or vector, see the urlargument) with the file path where the downloaded file is to besaved. Tilde-expansion is performed.

method

Method to be used for downloading files. Currentdownload methods are "internal", "libcurl","wget", "curl" and "wininet" (Windowsonly), and there is a value "auto": see ‘Details’ and‘Note’.

The method can also be set through the option"download.file.method": see options().

quiet

If TRUE, suppress status messages (if any), andthe progress bar.

mode

character. The mode with which to write the file. Usefulvalues are "w", "wb" (binary), "a" (append) and"ab". Not used for methods "wget" and "curl".See also ‘Details’, notably about using "wb" for Windows.

cacheOK

logical. Is a server-side cached value acceptable?

extra

character vector of additional command-line arguments forthe "wget" and "curl" methods.

headers

named character vector of additional HTTP headers touse in HTTP[S] requests. It is ignored for non-HTTP[S] URLs. TheUser-Agent header taken from the HTTPUserAgent option(see options) is automatically used as the first header.

...

allow additional arguments to be passed, unused.

Details

The function download.file can be used to download a singlefile as described by url from the internet and store it indestfile.

The url must start with a scheme such as ‘⁠http://⁠’,‘⁠https://⁠’ or ‘⁠file://⁠’. Which methods support whichschemes varies by R version, but method = "auto" will try tofind a method which supports the scheme.

For method = "auto" (the default) currently the"internal" method is used for ‘⁠file://⁠’ URLs and"libcurl" for all others.

Support for method "libcurl" was optional on Windows prior toR 4.2.0: use capabilities("libcurl") to see if it issupported on an earlier version. It uses an external library of thatname (https://curl.se/libcurl/) against which R can becompiled.

When method "libcurl" is used, there is support forsimultaneous downloads, so url and destfile can becharacter vectors of the same length greater than one (but the methodhas to be specified explicitly and not via "auto"). Fora single URL and quiet = FALSE a progress bar is shown ininteractive use.

Nowadays the "internal" method only supports the ‘⁠file://⁠’scheme (for which it is the default). On Windows the "wininet"method currently supports ‘⁠file://⁠’ and (but deprecated with awarning) ‘⁠http://⁠’ and ‘⁠https://⁠’ schemes.

For methods "wget" and "curl" a system call is made tothe tool given by method, and the respective program must beinstalled on your system and be in the search path for executables.They will block all other activity on the R process until theycomplete: this may make a GUI unresponsive.

cacheOK = FALSE is useful for ‘⁠http://⁠’ and‘⁠https://⁠’ URLs: it will attempt to get a copy directly from thesite rather than from an intermediate cache. It is used byavailable.packages.

The "libcurl" and "wget" methods follow ‘⁠http://⁠’and ‘⁠https://⁠’ redirections to any scheme they support. (Formethod "curl" use argument extra = "-L". To disableredirection in wget, use extra = "--max-redirect=0".)The "wininet" method supports some redirections but not all.(For method "libcurl", messages will quote the endpoint ofredirections.)

See url for how ‘⁠file://⁠’ URLs are interpreted,especially on Windows. The "internal" and "wininet"methods do not percent-decode, but the "libcurl" and"curl" methods do: method "wget" does not support them.

Most methods do not percent-encode special characters such as spacesin URLs (see URLencode), but it seems the"wininet" method does.

The remaining details apply to the "wininet" and"libcurl" methods only.

The timeout for many parts of the transfer can be set by the optiontimeout which defaults to 60 seconds. This is ofteninsufficient for downloads of large files (50MB or more) andso should be increased when download.file is used in packagesto do so. Note that the user can set the default timeout by theenvironment variable R_DEFAULT_INTERNET_TIMEOUT in recentversions of R, so to ensure that this is not decreased packages shoulduse something like

 options(timeout = max(300, getOption("timeout"))) 

(It is unrealistic to require download times of less than 1s/MB.)

The level of detail provided during transfer can be set by thequiet argument and the internet.info option: the detailsdepend on the platform and scheme. For the "libcurl" methodvalues of the option less than 2 give verbose output.

A progress bar tracks the transfer platform-specifically:

On Windows

If the file length is known, thefull width of the bar is the known length. Otherwise the initialwidth represents 100 Kbytes and is doubled whenever the current widthis exceeded. (In non-interactive use this uses a text version. If thefile length is known, an equals sign represents 2% of the transfercompleted: otherwise a dot represents 10Kb.)

On a Unix-alike

If the file length is known, anequals sign represents 2% of the transfer completed: otherwise a dotrepresents 10Kb.

The choice of binary transfer (mode = "wb" or "ab") isimportant on Windows, since unlike Unix-alikes it does distinguishbetween text and binary files and for text transfers changes ‘⁠\n⁠’line endings to ‘⁠\r\n⁠’ (aka ‘CRLF’).

On Windows, if mode is not supplied (missing())and url ends in one of ‘⁠.gz⁠’, ‘⁠.bz2⁠’, ‘⁠.xz⁠’,‘⁠.tgz⁠’, ‘⁠.zip⁠’, ‘⁠.jar⁠’, ‘⁠.rda⁠’, ‘⁠.rds⁠’,‘⁠.RData⁠’ or ‘⁠.pdf⁠’, mode = "wb" is set so that a binarytransfer is done to help unwary users.

Code written to download binary files must use mode = "wb" (or"ab"), but the problems incurred by a text transfer will onlybe seen on Windows.

Value

An (invisible) integer code, 0 for success and non-zero forfailure. For the "wget" and "curl" methods this is thestatus code returned by the external program. The "internal"method can return 1, but will in most cases throw an error.

What happens to the destination file(s) in the case of error dependson the method and R version. Currently the "internal","wininet" and "libcurl" methods will remove the file ifthe URL is unavailable except when mode specifiesappending when the file should be unchanged.

Setting Proxies

For the Windows-only method "wininet", the ‘InternetOptions’ of the system are used to choose proxies and so on; these areset in the Control Panel and are those used for system browsers.

For the "libcurl" and "curl" methods, proxies can be setvia the environment variables http_proxy orftp_proxy. Seehttps://curl.se/libcurl/c/libcurl-tutorial.html for furtherdetails.

Secure URLs

Methods which access ‘⁠https://⁠’ and (wheresupported) ‘⁠ftps://⁠’ URLs should try to verify the sitecertificates. This is usually done using the CA root certificatesinstalled by the OS (although we have seen instances in which thesegot removed rather than updated). For further information seehttps://curl.se/docs/sslcerts.html.

On Windows with method = "libcurl", the CA root certificatesare provided by the OS when R was linked with libcurl withSchannel enabled, which is the current default in Rtools. Thiscan be verified by checking that libcurlVersion() returns aversion string containing ‘⁠"Schannel"⁠’. If it does not, forverification to be on the environment variable CURL_CA_BUNDLEmust be set to a path to a certificate bundle file, usually named‘ca-bundle.crt’ or ‘curl-ca-bundle.crt’. (This is normallydone automatically for a binary installation of R, which installs‘R_HOME/etc/curl-ca-bundle.crt’ and setsCURL_CA_BUNDLE to point to it if that environment variable isnot already set.) For an updated certificate bundle, seehttps://curl.se/docs/sslcerts.html. Currently one can downloada copy fromhttps://raw.githubusercontent.com/bagder/ca-bundle/master/ca-bundle.crtand set CURL_CA_BUNDLE to the full path to the downloaded file.

On Windows with method = "libcurl", when R was linked withlibcurl with Schannel enabled, the connection fails if itcannot be established that the certificate has not been revoked. SomeMITM proxies present particularly in corporate environments do not workwith this behavior. It can be changed by setting environment variableR_LIBCURL_SSL_REVOKE_BEST_EFFORT to TRUE, with theconsequence of reducing security.

Note that the root certificates used by R may or may not be the sameas used in a browser, and indeed different browsers may use differentcertificate bundles (there is typically a build option to chooseeither their own or the system ones).

Good practice

Setting the method should be left to the end user. Neither ofthe wget nor curl commands is widely available:you can check if one is available via Sys.which,and should do so in a package or script.

If you use download.file in a package or script, you must checkthe return value, since it is possible that the download will failwith a non-zero status but not an R error.

The supported methods do change: method libcurl wasintroduced in R 3.2.0 and was optional on Windows until R 4.2.0 –use capabilities("libcurl") in a program to see if it isavailable.

⁠ftp://⁠’ URLs

Most modern browsers do not support such URLs, and ‘⁠https://⁠’ones are much preferred for use in R. ‘⁠ftps://⁠’ URLs have alwaysbeen rare, and are nowadays even less supported.

It is intended that R will continue to allow such URLs for as long aslibcurl does, but as they become rarer this is increasinglyuntested. What ‘protocols’ the version of libcurlbeing used supports can be seen by calling libcurlVersion().

These URLs are accessed using the FTP protocol which has anumber of variants. One distinction is between ‘active’ and‘(extended) passive’ modes: which is used is chosen by theclient. The "libcurl" method uses passive mode which wasalmost universally used by browsers before they dropped supportaltogether.

Note

Files of more than 2GB are supported on 64-bit builds of R; theymay be truncated on some 32-bit builds.

Methods "wget" and "curl" are mainly for historicalcompatibility but provide may provide capabilities not supported bythe "libcurl" or "wininet" methods.

Method "wget" can be used with proxy firewalls which requireuser/password authentication if proper values are stored in theconfiguration file for wget.

wget (https://www.gnu.org/software/wget/) is commonlyinstalled on Unix-alikes (but not macOS). Windows binaries areavailable from MSYS2 and elsewhere.

curl (https://curl.se/) is installed on macOS andincreasingly commonly on Unix-alikes. Windows binaries are availableat that URL.

See Also

options to set the HTTPUserAgent, timeoutand internet.info options used by some of the methods.

url for a finer-grained way to read data from URLs.

url.show, available.packages,download.packages for applications.

Contributed packages RCurl and curl provide morecomprehensive facilities to download from URLs.

[Package utils version 4.4.1 Index]

Download File from the Internet (2024)

References

Top Articles
Latest Posts
Article information

Author: Tuan Roob DDS

Last Updated:

Views: 6072

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Tuan Roob DDS

Birthday: 1999-11-20

Address: Suite 592 642 Pfannerstill Island, South Keila, LA 74970-3076

Phone: +9617721773649

Job: Marketing Producer

Hobby: Skydiving, Flag Football, Knitting, Running, Lego building, Hunting, Juggling

Introduction: My name is Tuan Roob DDS, I am a friendly, good, energetic, faithful, fantastic, gentle, enchanting person who loves writing and wants to share my knowledge and understanding with you.