1

So I'm trying to get the source code of google using only python sockets and not any other libraries such as urllib. I don't understand why my GET request isn't working, I tried all possible methods. This is the code I have, it's pretty small and I don't wanna get too much details. Just looking for a protocol that's used to get source codes. I assumed it would be the GET method but it doesn't work. I need a response that resembles urllib.request but using python sockets only.

  • If I pass "https://www.google.com" to socket.gethostbyname(), it fails on the getaddrinfo.
  • Also when I try to GET request from python.org, the while loop never ends.


import socket;

s=socket.socket();

host=socket.gethostbyname("www.google.com");

port=80;

send_buf="GET / \r\n"\
        "Host: www.google.com\r\n";

s.connect((host, port));

s.sendall(bytes(send_buf, encoding="utf-8"));

data="";

part=None;

while( True ):

    part=s.recv(2048);

    data+=str(part, "utf-8");

    if( part==b'' ):

        break;

s.close();
2
  • https://www.google.com isn't a hostname (it's a URL), so of course gethostbyname fails. Commented Mar 26, 2016 at 2:18
  • 1
    You don't need semicolons unless you're putting multiple statements on one line in Python :) Commented Mar 26, 2016 at 2:47

1 Answer 1

3

The following worked for me:

import socket
s=socket.socket()
host=socket.gethostbyname('www.google.com')
port=80
s.connect((host,port))
s.sendall("GET /\r\n")
val = s.recv(10000)
# Split off the HTTP headers
val = val.split('\r\n\r\n',1)[1]
Sign up to request clarification or add additional context in comments.

6 Comments

but this doesn't return the source code, it returns the exact same thing that my code returns :)
I tried this, I think what I really am looking for is something that would work like urllib.request() and return the full source of the website. I get the 302 Moved message from google.com unlike when i use urllib which gives the full source.
I get 200 OK (and the html for the google homepage) with exact code shown here, so I'm not sure why you would be getting 302 Moved
this is what i get b'<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>302 Moved</H1>\nThe document has moved\n<A HREF="google.fr/…>.\r\n</BODY></HTML>\r\n'
Looks like Google thinks (rightly or wrongly) that you are in France (see support.google.com/websearch/answer/873?hl=en). Changing the setting in a browser will likely resolve the issue (I believe it is just IP address based, which would be the same for browser or Python), or you can visit google.fr directly to get the source for that page
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.