0

I'd like to create a method that returns the HTML for a url that's passed as a parameter. I'm aware of how to do this using tools like "urllib2" or "requests". However, I am restricted to using sockets. So far i've tried this and it's not working.

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((url, 80))
s.sendall("GET / HTTP/1.0\r\n\r\n")
return s.recv(4096)

The error is with the request, I think it's formatted incorrectly.

I've tried some similar solutions from other users here, but none of them have worked. Any help would be appreciated, thanks.

3
  • what is the error? Commented Jan 22, 2018 at 2:23
  • It depends on the url passed. For example, when I pass "www.stackoverflow.com" I get... HTTP/1.1 500 Domain Not Found <title>Fastly error: unknown domain </title> ... <p>Fastly error: unknown domain: . Please check that this domain has been added to a service.</p> ... Commented Jan 22, 2018 at 2:26
  • The errors are all of that style, with differing semantics. Commented Jan 22, 2018 at 2:28

1 Answer 1

1

Even though the Host header is mandatory only with HTTP/1.1 it is actually needed by many sites even if you are doing a HTTP/1.0 request, especially if they host different domains on the same IP address. So what you need is at least the following:

  s.sendall("GET / HTTP/1.0\r\nHost: " + hostname + "\r\n\r\n")

Note that some sites also require specific User-Agent values or other headers since they are trying to detect and block bots. And, sites often reply with a HTTP redirect, so if you want to get to the HTML you need to be able to parse the response, follow the redirect (and also include a given cookie in the new request), probably also deal with HTTPS instead of plain HTTP etc.

Sign up to request clarification or add additional context in comments.

2 Comments

Does the socket object have a way of getting the hostname?
@Xlqt: A socket is associated with a local IP address and the IP address it is connected to. It has no idea of hostnames, only of IP addresses. The hostname I have in the code is actually what you falsely claimed to be the url. A URL is something like http://hostname:port/path... but you only need the hostname or IP address in the socket and only give this (and the port) when connecting the socket and not the full URL.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.