Urllib Read Decode Utf-8 Each Character Own Line
If you need to make HTTP requests with Python, then you lot may find yourself directed to the vivid requests library. Though it's a keen library, y'all may have noticed that it's not a congenital-in function of Python. If you adopt, for whatever reason, to limit your dependencies and stick to standard-library Python, and then you tin accomplish for urllib.request!
In this tutorial, you'll:
- Learn how to brand basic HTTP requests with
urllib.request - Swoop into the nuts and bolts of an HTTP bulletin and how
urllib.requestrepresents it - Understand how to deal with character encodings of HTTP messages
- Explore some common errors when using
urllib.requestand learn how to resolve them - Dip your toes into the earth of authenticated requests with
urllib.request - Understand why both
urlliband therequestslibrary exist and when to use ane or the other
If you've heard of HTTP requests, including Become and Postal service, so you're probably ready for this tutorial. Also, you should've already used Python to read and write to files, ideally with a context manager, at least once.
Ultimately, you'll find that making a request doesn't accept to be a frustrating feel, although it does tend to accept that reputation. Many of the issues that you lot tend to run into are due to the inherent complication of this marvelous thing called the Net. The good news is that the urllib.request module can help to demystify much of this complexity.
Bones HTTP Become Requests With urllib.request
Before diving into the deep end of what an HTTP request is and how it works, you're going to get your feet moisture by making a basic Become request to a sample URL. You'll too make a Get request to a mock REST API for some JSON data. In case you're wondering about Postal service Requests, you'll be covering them later in the tutorial, once you accept some more cognition of urllib.request.
To get started, you lot'll make a request to www.case.com, and the server will render an HTTP message. Ensure that you're using Python iii or above, and and so use the urlopen() function from urllib.asking:
>>>
>>> from urllib.request import urlopen >>> with urlopen ( "https://www.example.com" ) as response : ... body = response . read () ... >>> body [: xv ] b'<!doctype html>' In this example, you import urlopen() from urllib.request. Using the context manager with, you make a request and receive a response with urlopen(). Then yous read the trunk of the response and close the response object. With that, y'all display the beginning 15 positions of the body, noting that it looks like an HTML document.
At that place you are! You've successfully made a request, and you received a response. By inspecting the content, you can tell that it's likely an HTML certificate. Note that the printed output of the trunk is preceded by b. This indicates a bytes literal, which you may demand to decode. Later in the tutorial, you'll acquire how to turn bytes into a string, write them to a file, or parse them into a lexicon.
The procedure is only slightly different if you want to make calls to Residuum APIs to become JSON information. In the post-obit example, you lot'll brand a request to {JSON}Placeholder for some fake to-do data:
>>>
>>> from urllib.request import urlopen >>> import json >>> url = "https://jsonplaceholder.typicode.com/todos/i" >>> with urlopen ( url ) every bit response : ... body = response . read () ... >>> todo_item = json . loads ( trunk ) >>> todo_item {'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False} In this example, yous're doing pretty much the same every bit in the previous example. But in this one, you import urllib.asking and json, using the json.loads() function with torso to decode and parse the returned JSON bytes into a Python lexicon. Voila!
If you're lucky enough to be using fault-free endpoints, such as the ones in these examples, then maybe the above is all that yous need from urllib.asking. And then again, you may find that it'due south non enough.
Now, before doing some urllib.request troubleshooting, you lot'll first gain an understanding of the underlying construction of HTTP messages and learn how urllib.request handles them. This understanding will provide a solid foundation for troubleshooting many different kinds of bug.
The Nuts and Bolts of HTTP Messages
To sympathize some of the issues that you lot may encounter when using urllib.asking, you'll demand to examine how a response is represented by urllib.asking. To do that, y'all'll benefit from a high-level overview of what an HTTP bulletin is, which is what you lot'll get in this section.
Before the high-level overview, a quick note on reference sources. If you want to get into the technical weeds, the Net Engineering Task Forcefulness (IETF) has an extensive set of Asking for Comments (RFC) documents. These documents terminate up condign the bodily specifications for things like HTTP messages. RFC 7230, part 1: Message Syntax and Routing, for instance, is all most the HTTP message.
If you're looking for some reference textile that's a flake easier to digest than RFCs, then the Mozilla Developer Network (MDN) has a great range of reference articles. For case, their article on HTTP messages, while withal technical, is a lot more digestible.
Now that you lot know near these essential sources of reference information, in the next section yous'll get a beginner-friendly overview of HTTP letters.
Understanding What an HTTP Message Is
In a nutshell, an HTTP bulletin can exist understood as text, transmitted as a stream of bytes, structured to follow the guidelines specified by RFC 7230. A decoded HTTP message can be as simple as 2 lines:
Become / HTTP/1.1 Host: world wide web.google.com This specifies a GET request at the root (/) using the HTTP/ane.1 protocol. The one and only header required is the host, www.google.com. The target server has enough information to make a response with this data.
A response is like in structure to a request. HTTP messages take ii primary parts, the metadata and the torso. In the request example to a higher place, the bulletin is all metadata with no body. The response, on the other hand, does have two parts:
HTTP/i.one 200 OK Content-Type: text/html; charset=ISO-8859-i Server: gws (... other headers ...) <!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" ... The response starts with a condition line that specifies the HTTP protocol HTTP/one.i and the status 200 OK. After the status line, you get many key-value pairs, such as Server: gws, representing all the response headers. This is the metadata of the response.
Subsequently the metadata, there's a blank line, which serves as the divider between the headers and the body. Everything that follows the blank line makes up the body. This is the part that gets read when yous're using urllib.request.
Y'all tin can assume that all HTTP messages follow these specifications, only information technology'due south possible that some may break these rules or follow an older specification. It'due south exceptionally rare for this to cause any problems, though. So, just continue it in the back of your mind in case you encounter a strange bug!
In the adjacent department, you'll see how urllib.request deals with raw HTTP messages.
Understanding How urllib.request Represents an HTTP Bulletin
The primary representation of an HTTP message that you'll be interacting with when using urllib.asking is the HTTPResponse object. The urllib.request module itself depends on the low-level http module, which yous don't demand to interact with directly. You lot do cease up using some of the data structures that http provides, though, such as HTTPResponse and HTTPMessage.
When you make a asking with urllib.request.urlopen(), you go an HTTPResponse object in render. Spend some time exploring the HTTPResponse object with pprint() and dir() to see all the dissimilar methods and properties that belong to it:
>>>
>>> from urllib.request import urlopen >>> from pprint import pprint >>> with urlopen ( "https://www.case.com" ) equally response : ... pprint ( dir ( response )) ... To reveal the output of this lawmaking snippet, click to aggrandize the collapsible department beneath:
[ '__abstractmethods__' , '__class__' , '__del__' , '__delattr__' , '__dict__' , '__dir__' , '__doc__' , '__enter__' , '__eq__' , '__exit__' , '__format__' , '__ge__' , '__getattribute__' , '__gt__' , '__hash__' , '__init__' , '__init_subclass__' , '__iter__' , '__le__' , '__lt__' , '__module__' , '__ne__' , '__new__' , '__next__' , '__reduce__' , '__reduce_ex__' , '__repr__' , '__setattr__' , '__sizeof__' , '__str__' , '__subclasshook__' , '_abc_impl' , '_checkClosed' , '_checkReadable' , '_checkSeekable' , '_checkWritable' , '_check_close' , '_close_conn' , '_get_chunk_left' , '_method' , '_peek_chunked' , '_read1_chunked' , '_read_and_discard_trailer' , '_read_next_chunk_size' , '_read_status' , '_readall_chunked' , '_readinto_chunked' , '_safe_read' , '_safe_readinto' , 'brainstorm' , 'chunk_left' , 'chunked' , 'close' , 'airtight' , 'code' , 'debuglevel' , 'detach' , 'fileno' , 'flush' , 'fp' , 'getcode' , 'getheader' , 'getheaders' , 'geturl' , 'headers' , 'info' , 'isatty' , 'isclosed' , 'length' , 'msg' , 'peek' , 'read' , 'read1' , 'readable' , 'readinto' , 'readinto1' , 'readline' , 'readlines' , 'reason' , 'seek' , 'seekable' , 'condition' , 'tell' , 'truncate' , 'url' , 'version' , 'will_close' , 'writable' , 'write' , 'writelines' ] That's a lot of methods and properties, merely you'll but end up using a handful of these . Autonomously from .read(), the important ones commonly involve getting information about the headers.
One way to inspect all the headers is to access the .headers attribute of the HTTPResponse object. This will return an HTTPMessage object. Conveniently, you can treat an HTTPMessage like a lexicon past calling .items() on it to get all the headers as tuples:
>>>
>>> with urlopen ( "https://www.example.com" ) as response : ... laissez passer ... >>> response . headers <http.client.HTTPMessage object at 0x000001E029D9F4F0> >>> pprint ( response . headers . items ()) [('Accept-Ranges', 'bytes'), ('Age', '398424'), ('Cache-Control', 'max-age=604800'), ('Content-Type', 'text/html; charset=UTF-eight'), ('Date', 'Tue, 25 Jan 2022 12:18:53 GMT'), ('Etag', '"3147526947"'), ('Expires', 'Tue, 01 Feb 2022 12:eighteen:53 GMT'), ('Last-Modified', 'Thu, 17 Oct 2019 07:eighteen:26 GMT'), ('Server', 'ECS (nyb/1D16)'), ('Vary', 'Accept-Encoding'), ('Ten-Enshroud', 'Hit'), ('Content-Length', '1256'), ('Connection', 'close')] Now you have access to all the response headers! Yous probably won't need nearly of this data, but residuum bodacious that some applications do utilize information technology. For case, your browser might use the headers to read the response, set cookies, and make up one's mind an appropriate cache lifetime.
There are convenience methods to get the headers from an HTTPResponse object considering it's quite a common operation. You can call .getheaders() directly on the HTTPResponse object, which will return exactly the same listing of tuples as above. If yous're simply interested in 1 header, say the Server header, then y'all can use the atypical .getheader("Server") on HTTPResponse or apply the square bracket ([]) syntax on .headers from HTTPMessage:
>>>
>>> response . getheader ( "Server" ) 'ECS (nyb/1D16)' >>> response . headers [ "Server" ] 'ECS (nyb/1D16)' Truth be told, you lot probably won't demand to interact with the headers directly like this. The information that y'all're most likely to need volition probably already have some built-in helper methods, but now you know, in instance y'all e'er need to dig deeper!
Closing an HTTPResponse
The HTTPResponse object has a lot in common with the file object. The HTTPResponse class inherits from the IOBase class, every bit practice file objects, which means that y'all have to exist mindful of opening and endmost.
In simple programs, you're not likely to find any problems if you forget to close HTTPResponse objects. For more complex projects, though, this can significantly irksome execution and cause bugs that are difficult to pinpoint.
Problems arise considering input/output (I/O) streams are limited. Each HTTPResponse requires a stream to be kept clear while it'southward being read. If yous never close your streams, this will eventually foreclose whatsoever other stream from being opened, and it might interfere with other programs or even your operating system.
So, make sure you close your HTTPResponse objects! For your convenience, you lot can utilize a context manager, equally you've seen in the examples. You lot tin also achieve the same result by explicitly calling .close() on the response object:
>>>
>>> from urllib.request import urlopen >>> response = urlopen ( "https://www.example.com" ) >>> trunk = response . read () >>> response . close () In this example, you don't use a context director, but instead close the response stream explicitly. The to a higher place case still has an result, though, considering an exception may be raised before the phone call to .close(), preventing the proper teardown. To make this call unconditional, as it should be, you can utilise a endeavour … except block with both an else and a finally clause:
>>>
>>> from urllib.asking import urlopen >>> response = None >>> effort : ... response = urlopen ( "https://www.example.com" ) ... except Exception equally ex : ... print ( ex ) ... else : ... torso = response . read () ... finally : ... if response is non None : ... response . close () In this example, you lot accomplish an unconditional call to .close() by using the finally cake, which will always run regardless of exceptions raised. The code in the finally block first checks if the response object exists with is not None, and then closes it.
That said, this is exactly what a a context manager does, and the with syntax is generally preferred. Not only is the with syntax less verbose and more than readable, just it also protects you from pesky errors of omission. Put some other way, it'south a far better guard confronting accidentally forgetting to close the object:
>>>
>>> from urllib.request import urlopen >>> with urlopen ( "https://www.example.com" ) equally response : ... response . read ( 50 ) ... response . read ( 50 ) ... b'<!doctype html>\n<html>\n<head>\n <championship>Example D' b'omain</title>\n\n <meta charset="utf-8" />\northward <thou' In this example, you import urlopen() from the urllib.request module. You apply the with keyword with .urlopen() to assign the HTTPResponse object to the variable response. Then, you read the first fifty bytes of the response so read the following fifty bytes, all within the with block. Finally, you close the with block, which executes the request and runs the lines of code within its block.
With this lawmaking, you crusade two sets of l bytes each to be displayed. The HTTPResponse object will close once you exit the with cake scope, meaning that the .read() method volition merely return empty bytes objects:
>>>
>>> import urllib.request >>> with urllib . asking . urlopen ( "https://www.example.com" ) equally response : ... response . read ( 50 ) ... b'<!doctype html>\north<html>\northward<head>\north <title>Case D' >>> response . read ( 50 ) b'' In this instance, the second call to read fifty bytes is outside the with scope. Beingness outside the with cake means that HTTPResponse is closed, even though yous can still access the variable. If you lot try to read from HTTPResponse when it's closed, it'll return an empty bytes object.
Another point to note is that you tin't reread a response in one case you've read all the way to the end:
>>>
>>> import urllib.request >>> with urllib . request . urlopen ( "https://www.instance.com" ) as response : ... first_read = response . read () ... second_read = response . read () ... >>> len ( first_read ) 1256 >>> len ( second_read ) 0 This example shows that once you've read a response, you lot can't read it again. If you've fully read the response, the subsequent endeavour only returns an empty bytes object even though the response isn't closed. You lot'd have to make the request again.
In this regard, the response is different from a file object, considering with a file object, y'all can read it multiple times by using the .seek() method, which HTTPResponse doesn't support. Even after closing a response, you can yet access the headers and other metadata, though.
Exploring Text, Octets, and $.25
In most of the examples so far, you read the response body from HTTPResponse, displayed the resulting data immediately, and noted that it was displayed equally a bytes object. This is because text information in computers isn't stored or transmitted as letters, but as bytes!
A raw HTTP message sent over the wire is cleaved upward into a sequence of bytes, sometimes referred to as octets. Bytes are 8-chip chunks. For example, 01010101 is a byte. To learn more nearly binary, bits, and bytes, cheque out Bitwise Operators in Python.
And then how do you correspond messages with bytes? A byte has 256 potential combinations, and you lot can assign a letter of the alphabet to each combination. You can assign 00000001 to A, 00000010 to B, and so on. ASCII graphic symbol encoding, which is quite common, uses this type of system to encode 128 characters, which is enough for a language like English. This is specially convenient considering just one byte can represent all the characters, with space to spare.
All the standard English characters, including capitals, punctuation, and numerals, fit within ASCII. On the other manus, Japanese is thought to have around fifty thousand logographic characters, then 128 characters won't cut it! Even the 256 characters that are theoretically bachelor within ane byte wouldn't exist virtually enough for Japanese. So, to accomodate all the earth's languages at that place are many different systems to encode characters.
Fifty-fifty though there are many systems, i thing you can rely on is the fact that they'll always be broken upwards into bytes. Most servers, if they tin can't resolve the MIME type and graphic symbol encoding, default to application/octet-stream, which literally means a stream of bytes. And then whoever receives the message can work out the character encoding.
Dealing With Character Encodings
Problems often arise because, as y'all may have guessed, at that place are many, many dissimilar potential character encodings. The ascendant character encoding today is UTF-8, which is an implementation of Unicode. Luckily, ninety-eight percent of web pages today are encoded in UTF-8!
UTF-8 is dominant considering information technology can efficiently handle a mind-boggling number of characters. Information technology handles all the one,112,064 potential characters defined by Unicode, encompassing Chinese, Japanese, Arabic (with correct-to-left scripts), Russian, and many more character sets, including emojis!
UTF-8 remains efficient considering information technology uses a variable number of bytes to encode characters, which means that for many characters, it simply requires one byte, while for others it tin require up to four bytes.
While UTF-eight is ascendant, and you usually won't get wrong with assuming UTF-viii encodings, you'll however run into dissimilar encodings all the time. The good news is that you don't need to be an expert on encodings to handle them when using urllib.request.
Going From Bytes to Strings
When you use urllib.asking.urlopen(), the body of the response is a bytes object. The get-go affair y'all may want to do is to catechumen the bytes object to a string. Perhaps y'all want to practise some web scraping. To exercise this, you need to decode the bytes. To decode the bytes with Python, all you need to observe out is the graphic symbol encoding used. Encoding, particularly when referring to character encoding, is often referred to equally a character set.
Equally mentioned, 90-eight percentage of the time, you'll probably be safety defaulting to UTF-eight:
>>>
>>> from urllib.request import urlopen >>> with urlopen ( "https://www.instance.com" ) as response : ... body = response . read () ... >>> decoded_body = torso . decode ( "utf-8" ) >>> print ( decoded_body [: 30 ]) <!doctype html> <html> <head> In this instance, yous take the bytes object returned from response.read() and decode it with the bytes object's .decode() method, passing in utf-8 as an argument. When you impress decoded_body, you can come across that it'south now a string.
That said, leaving it up to chance is rarely a good strategy. Fortunately, headers are a bang-up place to go character set information:
>>>
>>> from urllib.request import urlopen >>> with urlopen ( "https://www.example.com" ) as response : ... body = response . read () ... >>> character_set = response . headers . get_content_charset () >>> character_set 'utf-8' >>> decoded_body = body . decode ( character_set ) >>> impress ( decoded_body [: xxx ]) <!doctype html> <html> <head> In this instance, you telephone call .get_content_charset() on the .headers object of response and use that to decode. This is a convenience method that parses the Content-Type header and so that you can painlessly decode bytes into text.
Going From Bytes to File
If yous want to decode bytes into text, at present you're proficient to go. But what if you lot want to write the trunk of a response into a file? Well, you have two options:
- Write the bytes directly to the file
- Decode the bytes into a Python string, and and then encode the cord back into a file
The showtime method is the about straightforward, simply the 2nd method allows you to alter the encoding if yous desire to. To learn about file manipulation in more detail, have a look at Real Python'southward Reading and Writing Files in Python (Guide).
To write the bytes directly to a file without having to decode, you'll need the built-in open() function, and y'all'll need to ensure that you lot use write binary mode:
>>>
>>> from urllib.request import urlopen >>> with urlopen ( "https://www.instance.com" ) every bit response : ... torso = response . read () ... >>> with open ( "example.html" , mode = "wb" ) as html_file : ... html_file . write ( body ) ... 1256 Using open() in wb mode bypasses the need to decode or encode and dumps the bytes of the HTTP message torso into the example.html file. The number that'south output afterwards the writing performance indicates the number of bytes that have been written. That's it! You've written the bytes straight to a file without encoding or decoding annihilation.
Now say you have a URL that doesn't utilise UTF-8, only yous want to write the contents to a file with UTF-eight. For this, yous'd kickoff decode the bytes into a cord and and so encode the string into a file, specifying the character encoding.
Google's home page seems to use different encodings depending on your location. In much of Europe and the US, it uses the ISO-8859-ane encoding:
>>>
>>> from urllib.request import urlopen >>> with urlopen ( "https://www.google.com" ) as response : ... torso = response . read () ... >>> character_set = response . headers . get_content_charset () >>> character_set 'ISO-8859-1' >>> content = body . decode ( character_set ) >>> with open ( "google.html" , encoding = "utf-8" , style = "west" ) as file : ... file . write ( content ) ... 14066 In this code, you got the response character gear up and used it to decode the bytes object into a cord. And then yous wrote the string to a file, encoding it using UTF-eight.
Once you've written to a file, you should be able to open up the resulting file in your browser or text editor. Most mod text processors tin discover the character encoding automatically.
If in that location are encoding errors and you're using Python to read a file, and then yous'll likely get an error:
>>>
>>> with open ( "encoding-error.html" , style = "r" , encoding = "utf-8" ) as file : ... lines = file . readlines () ... UnicodeDecodeError: 'utf-8' codec tin't decode byte Python explicitly stops the process and raises an exception, merely in a program that displays text, such every bit the browser where you're viewing this page, you may find the infamous replacement characters:
The blackness rhombus with a white question marking (�), the square (□), and the rectangle (▯) are often used as replacements for characters which couldn't exist decoded.
Sometimes, decoding seems to piece of work but results in unintelligible sequences, such equally 文å—化ã'., which also suggests the wrong character ready was used. In Japan, they even have a word for text that'due south garbled due to character encoding issues, Mojibake, because these issues plagued them at the start of the Cyberspace historic period.
With that, y'all should now be equipped to write files with the raw bytes returned from urlopen(). In the next section, you'll learn how to parse bytes into a Python dictionary with the json module.
Going From Bytes to Lexicon
For application/json responses, you'll often find that they don't include any encoding information:
>>>
>>> from urllib.asking import urlopen >>> with urlopen ( "https://httpbin.org/json" ) as response : ... body = response . read () ... >>> character_set = response . headers . get_content_charset () >>> print ( character_set ) None In this example, you apply the json endpoint of httpbin, a service that allows yous to experiment with unlike types of requests and responses. The json endpoint simulates a typical API that returns JSON information. Note that the .get_content_charset() method returns naught in its response.
Even though at that place's no character encoding data, all is not lost. According to RFC 4627, the default encoding of UTF-8 is an absolute requirement of the awarding/json specification. That'south not to say that every single server plays by the rules, simply generally, y'all can assume that if JSON is being transmitted, it'll almost always be encoded using UTF-8.
Fortunately, json.loads() decodes byte objects nether the hood and even has some elbowroom in terms of different encodings that it can deal with. So, json.loads() should be able to cope with nigh bytes objects that yous throw at it, equally long as they're valid JSON:
>>>
>>> import json >>> json . loads ( torso ) {'slideshow': {'author': 'Yours Truly', 'date': 'appointment of publication', 'slides' : [{'title': 'Wake up to WonderWidgets!', 'type': 'all'}, {'items': ['Why <em>W onderWidgets</em> are great', 'Who <em>buys</em> WonderWidgets'], 'championship': 'Ove rview', 'type': 'all'}], 'title': 'Sample Slide Show'}} Equally you can encounter, the json module handles the decoding automatically and produces a Python dictionary. About all APIs return primal-value information every bit JSON, although yous might encounter some older APIs that piece of work with XML. For that, you might want to wait into the Roadmap to XML Parsers in Python.
With that, you should know plenty virtually bytes and encodings to be dangerous! In the side by side department, yous'll learn how to troubleshoot and gear up a couple of common errors that you might meet when using urllib.asking.
Mutual urllib.request Troubles
There are many kinds of bug you lot can meet on the world wild web, whether you're using urllib.request or not. In this department, you'll learn how to deal with a couple of the nearly common errors when getting started out: 403 errors and TLS/SSL document errors. Before looking at these specific errors, though, you'll first learn how to implement error handling more generally when using urllib.request.
Implementing Mistake Handling
Before you plough your attending to specific errors, boosting your code's power to gracefully deal with contrasted errors will pay off. Web development is plagued with errors, and you tin can invest a lot of time in handling errors sensibly. Here, you'll learn to handle HTTP, URL, and timeout errors when using urllib.asking.
HTTP condition codes back-trail every response in the condition line. If you lot can read a status code in the response, then the request reached its target. While this is good, you can only consider the request a complete success if the response code starts with a 2. For instance, 200 and 201 represent successful requests. If the status lawmaking is 404 or 500, for example, something went wrong, and urllib.request will raise an HTTPError.
Sometimes mistakes happen, and the URL provided isn't correct, or a connection can't be made for another reason. In these cases, urllib.request will heighten a URLError.
Finally, sometimes servers simply don't respond. Maybe your network connection is ho-hum, the server is down, or the server is programmed to ignore specific requests. To deal with this, you can laissez passer a timeout argument to urlopen() to raise a TimeoutError after a certain amount of fourth dimension.
The first step in handling these exceptions is to grab them. Y'all tin can catch errors produced within urlopen() with a try … except cake, making apply of the HTTPError, URLError, and TimeoutError classes:
# asking.py from urllib.error import HTTPError , URLError from urllib.request import urlopen def make_request ( url ): effort : with urlopen ( url , timeout = x ) as response : print ( response . status ) render response . read (), response except HTTPError every bit error : print ( error . status , mistake . reason ) except URLError as error : print ( error . reason ) except TimeoutError : print ( "Request timed out" ) The function make_request() takes a URL string every bit an argument, tries to get a response from that URL with urllib.asking, and catches the HTTPError object that'south raised if an error occurs. If the URL is bad, information technology'll grab a URLError. If it goes through without any errors, it'll just print the status and render a tuple containing the trunk and the response. The response will close after return.
The function too calls urlopen() with a timeout statement, which will cause a TimeoutError to be raised after the seconds specified. Ten seconds is generally a practiced amount of fourth dimension to wait for a response, though as always, much depends on the server that you need to make the request to.
Now you're fix to gracefully handle a diversity of errors, including just not express to the errors that y'all'll cover side by side.
Dealing With 403 Errors
You'll at present use the make_request() function to make some requests to httpstat.us, which is a mock server used for testing. This mock server will render responses that have the status lawmaking you request. If you make a request to https://httpstat.us/200, for instance, you should look a 200 response.
APIs like httpstat.usa are used to ensure that your application can handle all the different condition codes it might encounter. httpbin likewise has this functionality, just httpstat.the states has a more than comprehensive selection of status codes. Information technology even has the infamous and semi-official 418 status code that returns the message I'g a teapot!
To collaborate with the make_request() role that you wrote in the previous department, run the script in interactive manner:
With the -i flag, this command volition run the script in interactive manner. This ways that information technology'll execute the script so open the Python REPL afterward, so you tin can at present call the function that you just defined:
>>>
>>> make_request ( "https://httpstat.u.s.a./200" ) 200 (b'200 OK', <http.client.HTTPResponse object at 0x0000023D612660B0>) >>> make_request ( "https://httpstat.us/403" ) 403 Forbidden Hither you tried the 200 and 403 endpoints of httpstat.usa. The 200 endpoint goes through equally anticipated and returns the trunk of the response and the response object. The 403 endpoint but printed the error message and didn't return anything, as well as expected.
The 403 condition ways that the server understood the request but won't fulfill it. This is a common error that y'all tin see, peculiarly while spider web scraping. In many cases, you can solve information technology by passing a User-Agent header.
One of the primary ways that servers identify who or what is making the asking is past examining the User-Agent header. The raw default request sent by urllib.request is the post-obit:
Become https://httpstat.us/403 HTTP/1.1 Accept-Encoding: identity Host: httpstat.us User-Agent: Python-urllib/three.10 Connection: shut Detect that User-Agent is listed as Python-urllib/3.10. You may find that some sites will try to block spider web scrapers, and this User-Agent is a dead giveaway. With that said, you tin set up your own User-Agent with urllib.request, though you'll need to modify your function a little:
# request.py from urllib.error import HTTPError, URLError -from urllib.asking import urlopen +from urllib.request import urlopen, Request -def make_request(url): +def make_request(url, headers=None): + request = Request(url, headers=headers or {}) endeavour: - with urlopen(url, timeout=10) equally response: + with urlopen(request, timeout=x) as response: impress(response.status) return response.read(), response except HTTPError every bit error: impress(fault.status, error.reason) except URLError as mistake: print(error.reason) except TimeoutError: print("Request timed out") To customize the headers that you send out with your request, you lot outset have to instantiate a Request object with the URL. Additionally, y'all tin laissez passer in a keyword argument of headers, which accepts a standard dictionary representing any headers you wish to include. And so, instead of passing the URL string directly into urlopen(), yous laissez passer this Request object which has been instantiated with the URL and headers.
To use this revamped part, restart the interactive session, then call make_request() with a lexicon representing the headers as an argument:
>>>
>>> body , response = make_request ( ... "https://www.httpbin.org/user-agent" , ... { "User-Agent" : "Real Python" } ... ) 200 >>> body b'{\n "user-amanuensis": "Existent Python"\north}\n' In this instance, you make a asking to httpbin. Here you lot use the user-agent endpoint to return the request's User-Agent value. Because you lot fabricated the request with a custom user agent of Real Python, this is what gets returned.
Some servers are strict, though, and will merely accept requests from specific browsers. Luckily, it's possible to detect standard User-Agent strings on the web, including through a user agent database. They're merely strings, so all you need to practise is copy the user agent string of the browser that you desire to impersonate and use it every bit the value of the User-Agent header.
Fixing the SSL CERTIFICATE_VERIFY_FAILED Error
Another mutual error is due to Python not existence able to access the required security document. To simulate this error, yous can use some mock sites that have known bad SSL certificates, provided by badssl.com. You can make a asking to one of them, such as superfish.badssl.com, and experience the error firsthand:
>>>
>>> from urllib.asking import urlopen >>> urlopen ( "https://superfish.badssl.com/" ) Traceback (most recent call last): (...) ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997) During treatment of the above exception, some other exception occurred: Traceback (most contempo telephone call last): (...) urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] document verify failed: unable to become local issuer certificate (_ssl.c:997)> Here, making a request to an address with a known bad SSL document will event in CERTIFICATE_VERIFY_FAILED which is a type of URLError.
SSL stands for Secure Sockets Layer. This is something of a misnomer because SSL was deprecated in favor of TLS, Transport Layer Security. Sometimes quondam terminology just sticks! It'south a way to encrypt network traffic so that a hypothetical listener can't overhear on the information transmitted over the wire.
These days, well-nigh website addresses are preceded not by http:// but by https://, with the s standing for secure. HTTPS connections must exist encrypted through the TLS. urllib.request tin handle both HTTP and HTTPS connections.
The details of HTTPS are far beyond the telescopic of this tutorial, but you can retrieve of an HTTPS connection as involving two stages, the handshake and the transfer of data. The handshake ensures that the connection is secure. For more information about Python and HTTPS, bank check out Exploring HTTPS With Python.
To establish that a particular server is secure, programs that make requests rely on a store of trusted certificates. The server's document is verified during the handshake stage. Python uses the operating system'south store of certificates. If Python tin't find the organization'south shop of certificates, or if the store is out of date, and then y'all'll come across this error.
Sometimes the store of certificates that Python can access is out of date, or Python can't reach information technology, for whatsoever reason. This is frustrating because you tin can sometimes visit the URL from your browser, which thinks that it'due south secure, all the same urllib.request nevertheless raises this fault.
Y'all may be tempted to opt out of verifying the certificate, but this will render your connection insecure and is definitely non recommended:
>>>
>>> import ssl >>> from urllib.request import urlopen >>> unverified_context = ssl . _create_unverified_context () >>> urlopen ( "https://superfish.badssl.com/" , context = unverified_context ) <http.client.HTTPResponse object at 0x00000209CBE8F220> Hither you import the ssl module, which allows you to create an unverified context. You lot can then pass this context to urlopen() and visit a known bad SSL certificate. The connection successfully goes through because the SSL certificate isn't checked.
Earlier resorting to these desperate measures, try updating your Os or updating your Python version. If that fails, then you can take a folio from the requests library and install certifi:
certifi is a collection of certificates that you lot tin use instead of your system's drove. Y'all exercise this by creating an SSL context with the certifi bundle of certificates instead of the Os's bundle:
>>>
>>> import ssl >>> from urllib.asking import urlopen >>> import certifi >>> certifi_context = ssl . create_default_context ( cafile = certifi . where ()) >>> urlopen ( "https://sha384.badssl.com/" , context = certifi_context ) <http.client.HTTPResponse object at 0x000001C7407C3490> In this case, you used certifi to act equally your SSL certificate store, and you used information technology to successfully connect to a site with a known proficient SSL certificate. Note that instead of ._create_unverified_context(), you use .create_default_context().
This way, yous can stay secure without too much trouble! In the adjacent department, you'll exist dipping your toes into the world of authentication.
Authenticated Requests
Authentication is a vast subject, and if yous're dealing with authentication much more complicated than what's covered hither, this might be a proficient jumping-off point into the requests packet.
In this tutorial, you'll only cover one authentication method, which serves as an instance of the type of adjustments that you take to brand to authenticate your requests. urllib.request does have a lot of other functionality that helps with hallmark, merely that won't be covered in this tutorial.
I of the most mutual authentication tools is the bearer token, specified by RFC 6750. It'southward often used equally part of OAuth, simply can too be used in isolation. Information technology's also most common to see equally a header, which you can use with your current make_request() function:
>>>
>>> token = "abcdefghijklmnopqrstuvwxyz" >>> headers = { ... "Say-so" : f "Bearer { token } " ... } >>> make_request ( "https://httpbin.org/bearer" , headers ) 200 (b'{\northward "authenticated": true, \n "token": "abcdefghijklmnopqrstuvwxyz"\n}\n', <http.client.HTTPResponse object at 0x0000023D612642E0>) In this case, you make a asking to the httpbin /bearer endpoint, which simulates bearer authentication. It'll accept whatever cord as a token. It just requires the proper format specified by RFC 6750. The name has to be Authorization, or sometimes the lowercase dominance, and the value has to be Bearer, with a single infinite betwixt that and the token.
Congratulations, you lot've successfully authenticated, using a bearer token!
Another form of authentication is called Basic Access Authentication, which is a very uncomplicated method of authentication, only slightly better than sending a username and countersign in a header. Information technology'southward very insecure!
Ane of the most common protocols in use today is OAuth (Open Authorization). If you lot've ever used Google, GitHub, or Facebook to sign into some other website, then you've used OAuth. The OAuth flow more often than not involves a few requests between the service that you desire to interact with and an identity server, resulting in a short-lived bearer token. This bearer token can then be used for a catamenia of time with bearer hallmark.
Much of hallmark comes downwards to understanding the specific protocol that the target server uses and reading the documentation closely to go it working.
POST Requests With urllib.request
You've fabricated a lot of Go requests, but sometimes you lot desire to send data. That'due south where POST requests come up in. To make POST requests with urllib.request, you don't have to explicitly modify the method. You tin can just pass a data object to a new Asking object or directly to urlopen(). The data object must be in a special format, though. You'll adapt your make_request() function slightly to support POST requests by adding the information parameter:
# asking.py from urllib.error import HTTPError, URLError from urllib.request import urlopen, Asking -def make_request(url, headers=None): +def make_request(url, headers=None, data=None): - request = Request(url, headers=headers or {}) + request = Request(url, headers=headers or {}, data=data) try: with urlopen(asking, timeout=10) as response: print(response.status) render response.read(), response except HTTPError equally error: print(fault.status, error.reason) except URLError as error: print(mistake.reason) except TimeoutError: print("Request timed out") Here you but modified the role to accept a information argument with a default value of None, and yous passed that right into the Request instantiation. That's non all that needs to be done, though. You lot can use i of two different formats to execute a POST request:
- Class Data:
awarding/x-www-form-urlencoded - JSON:
application/json
The first format is the oldest format for POST requests and involves encoding the data with per centum encoding, also known as URL encoding. Yous may have noticed fundamental-value pairs URL encoded as a query string. Keys are separated from values with an equal sign (=), central-value pairs are separated with an ampersand (&), and spaces are generally suppressed just can be replaced with a plus sign (+).
If you're starting off with a Python lexicon, to utilize the class data format with your make_request() role, you'll need to encode twice:
- In one case to URL encode the dictionary
- Then once again to encode the resulting string into bytes
For the first stage of URL encoding, you'll use another urllib module, urllib.parse. Call up to starting time your script in interactive mode then that y'all can use the make_request() role and play with it on the REPL:
>>>
>>> from urllib.parse import urlencode >>> post_dict = { "Title" : "Hullo World" , "Proper name" : "Existent Python" } >>> url_encoded_data = urlencode ( post_dict ) >>> url_encoded_data 'Title=How-do-you-do+World&Name=Real+Python' >>> post_data = url_encoded_data . encode ( "utf-8" ) >>> torso , response = make_request ( ... "https://httpbin.org/annihilation" , data = post_data ... ) 200 >>> impress ( trunk . decode ( "utf-8" )) { "args": {}, "data": "", "files": {}, "course": { "Name": "Real Python", "Title": "Hello World" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "34", "Content-Type": "application/x-www-class-urlencoded", "Host": "httpbin.org", "User-Agent": "Python-urllib/3.10", "X-Amzn-Trace-Id": "Root=1-61f25a81-03d2d4377f0abae95ff34096" }, "json": cypher, "method": "POST", "origin": "86.159.145.119", "url": "https://httpbin.org/annihilation" } In this example, you:
- Import
urlencode()from theurllib.parsemodule - Initialize your Mail data, starting with a dictionary
- Use the
urlencode()function to encode the dictionary - Encode the resulting string into bytes using UTF-8 encoding
- Brand a request to the
anythingendpoint ofhttpbin.org - Print the UTF-8 decoded response trunk
UTF-8 encoding is part of the specification for the application/10-world wide web-form-urlencoded type. UTF-8 is used preemptively to decode the body because yous already know that httpbin.org reliably uses UTF-8.
The anything endpoint from httpbin acts as a sort of echo, returning all the information it received so that you tin can audit the details of the request you made. In this case, you can confirm that method is indeed Postal service, and you can see that the data you sent is listed under grade.
To make the same request with JSON, y'all'll plow a Python lexicon into a JSON string with json.dumps(), encode it with UTF-8, pass it every bit the information argument, and finally add a special header to bespeak that the data type is JSON:
>>>
>>> post_dict = { "Championship" : "Hello Globe" , "Proper name" : "Real Python" } >>> import json >>> json_string = json . dumps ( post_dict ) >>> json_string '{"Title": "Hullo World", "Proper noun": "Real Python"}' >>> post_data = json_string . encode ( "utf-8" ) >>> body , response = make_request ( ... "https://httpbin.org/anything" , ... data = post_data , ... headers = { "Content-Type" : "application/json" }, ... ) 200 >>> print ( body . decode ( "utf-eight" )) { "args": {}, "data": "{\"Title\": \"Hello World\", \"Name\": \"Real Python\"}", "files": {}, "form": {}, "headers": { "Accept-Encoding": "identity", "Content-Length": "47", "Content-Type": "application/json", "Host": "httpbin.org", "User-Agent": "Python-urllib/3.x", "X-Amzn-Trace-Id": "Root=1-61f25a81-3e35d1c219c6b5944e2d8a52" }, "json": { "Name": "Existent Python", "Title": "Hello Earth" }, "method": "Mail service", "origin": "86.159.145.119", "url": "https://httpbin.org/anything" } To serialize the lexicon this fourth dimension around, you lot use json.dumps() instead of urlencode(). You as well explicitly add the Content-Type header with a value of application/json. With this information, the httpbin server can deserialize the JSON on the receiving stop. In its response, you can meet the information listed under the json primal.
With that, you lot can now offset making POST requests. This tutorial won't go into more detail about the other asking methods, such equally PUT. Suffice to say that you can besides explicitly set the method by passing a method keyword statement to the instantiation of the Request object.
The Request Package Ecosystem
To circular things out, this last section of the tutorial is dedicated to clarifying the package ecosystem around HTTP requests with Python. Because there are many packages, with no clear standard, information technology can exist confusing. That said, there are employ cases for each package, which just means more pick for you lot!
What Are urllib2 and urllib3?
To respond this question, yous need to go back to early on Python, all the manner dorsum to version 1.2, when the original urllib was introduced. Around version 1.6, a revamped urllib2 was added, which lived aslope the original urllib. When Python 3 came along, the original urllib was deprecated, and urllib2 dropped the two, taking on the original urllib name. Information technology besides split into parts:
-
urllib.mistake -
urllib.parse -
urllib.asking -
urllib.response -
urllib.robotparser
So what nearly urllib3? That's a third-political party library developed while urllib2 was nonetheless effectually. It's non related to the standard library because it's an independently maintained library. Interestingly, the requests library actually uses urllib3 nether the hood, and then does pip!
When Should I Use requests Over urllib.request?
The main answer is ease of use and security. urllib.request is considered a low-level library, which exposes a lot of the detail nigh the workings of HTTP requests. The Python documentation for urllib.request makes no bones most recommending requests every bit a college-level HTTP client interface.
If you interact with many different Remainder APIs, 24-hour interval in and day out, then requests is highly recommended. The requests library bills itself as "congenital for human beings" and has successfully created an intuitive, secure, and straightforward API around HTTP. It's usually considered the go-to library! If you want to know more nigh the requests library, check out the Real Python guide to requests.
An instance of how requests makes things easier is when it comes to grapheme encoding. You'll recall that with urllib.request, you have to be aware of encodings and take a few steps to ensure an error-free experience. The requests package abstracts that abroad and will resolve the encoding by using chardet, a universal character encoding detector, just in case there's any funny business organisation.
If your goal is to acquire more almost standard Python and the details of how information technology deals with HTTP requests, then urllib.request is a bang-up style to get into that. Y'all could fifty-fifty go farther and utilize the very low-level http modules. On the other manus, you may just want to keep dependencies to a minimum, which urllib.request is more than capable of.
Why Is requests Not Part of the Standard Library?
Mayhap you're wondering why requests isn't part of cadre Python by this point.
This is a complex issue, and there's no hard and fast answer to it. In that location are many speculations equally to why, but two reasons seem to stand out:
-
requestshas other third-party dependencies that would need to exist integrated too. -
requestsneeds to stay active and can do this ameliorate outside the standard library.
The requests library has third-party dependencies. Integrating requests into the standard library would hateful also integrating chardet, certifi, and urllib3, among others. The alternative would be to fundamentally modify requests to utilise only Python's existing standard library. This is no trivial chore!
Integrating requests would also mean that the existing team that develops this library would have to relinquish total control over the design and implementation, giving fashion to the PEP controlling process.
HTTP specifications and recommendations change all the time, and a loftier-level library has to be active enough to go on up. If at that place's a security exploit to exist patched, or a new workflow to add, the requests squad can build and release far more chop-chop than they could as role of the Python release process. At that place take supposedly been times when they've released a security fix twelve hours after a vulnerability was discovered!
For an interesting overview of these issues and more, bank check out Adding Requests to The Standard Library, which summarizes a discussion at the Python Linguistic communication Summit with Kenneth Reitz, the creator and maintainer of Requests.
Because this agility is so necessary to requests and its underlying urllib3, the paradoxical statement that requests is also of import for the standard library is oftentimes used. This is because so much of the Python community depends on requests and its agility that integrating it into cadre Python would probably harm information technology and the Python community.
On the GitHub repository problems board for requests, an upshot was posted, asking for the inclusion of requests in the standard library. The developers of requests and urllib3 chimed in, mainly saying they would likely lose interest in maintaining it themselves. Some even said they would fork the repositories and continue developing them for their own apply cases.
With that said, note that the requests library GitHub repository is hosted under the Python Software Foundation'southward business relationship. Just because something isn't part of the Python standard library doesn't mean that it's not an integral part of the ecosystem!
It seems that the current situation works for both the Python core team and the maintainers of requests. While it may be slightly disruptive for newcomers, the existing structure gives the most stable feel for HTTP requests.
It's also important to note that HTTP requests are inherently complex. urllib.request doesn't try to sugarcoat that as well much. It exposes a lot of the inner workings of HTTP requests, which is why it's billed as a low-level module. Your option of requests versus urllib.request actually depends on your item employ case, security concerns, and preference.
Conclusion
You lot're now equipped to apply urllib.asking to make HTTP requests. Now you tin can use this congenital-in module in your projects, keeping them dependency-free for longer. You've also gained the in-depth agreement of HTTP that comes from using a lower-level module, such as urllib.request.
In this tutorial, you've:
- Learned how to make basic HTTP requests with
urllib.request - Explored the nuts and bolts of an HTTP message and studied how it's represented by
urllib.request - Figured out how to bargain with character encodings of HTTP messages
- Explored some mutual errors when using
urllib.requestand learned how to resolve them - Dipped your toes into the world of authenticated requests with
urllib.request - Understood why both
urlliband therequestslibrary be and when to use i or the other
You lot're at present in a position to make basic HTTP requests with urllib.request, and you too have the tools to dive deeper into low-level HTTP terrain with the standard library. Finally, you lot can choose whether to use requests or urllib.request, depending on what you want or need. Accept fun exploring the Web!
Source: https://realpython.com/urllib-request/
0 Response to "Urllib Read Decode Utf-8 Each Character Own Line"
Post a Comment