WebScraping with Python / Selenium

Question

I'm trying to pull some data from Youtube, but i'm struggling with catching the text, here is my code:

username = "unboxtherapy"
driver = webdriver.Chrome('C:/Users/Chrome Web Driver/chromedriver.exe')
api_url = "https://www.youtube.com/user/"+username+"/about"
driver.get(api_url)
html = driver.find_element_by_tag_name('html')
soup=bs(html.text,'html.parser')
text=str(soup)

In the example above, I'm trying to capture the description shown on the page.

soup

returns all the text on the page i.e. the description that I want + a ton of other things which I don't want.

text

returns all the following text:

"GB\nSIGN IN\nUnbox Therapy\n13,802,667 subscribers\nJOIN\nSUBSCRIBE\nTwitter\nHOME\nVIDEOS\nPLAYLISTS\nCOMMUNITY\nCHANNELS\nABOUT\nDescription\nWhere products get naked.\n\nHere you will find a variety of videos showcasing the coolest products on the planet. From the newest smartphone to surprising gadgets and technology you never knew existed. It's all here on Unbox Therapy.\n\nBusiness / professional inquiries ONLY - business [at] unboxtherapy.com\n(please don't use YouTube inbox)\nLinks\nTwitter Facebook Instagram The Official Website\nStats\nJoined Dec 21, 2010\n2,698,921,226 views\nOTHER COOL CHANNELS.\nLew Later\nSUBSCRIBE\nMarques Brownlee\nSUBSCRIBE\nJonathan Morrison\nSUBSCRIBE\nAustin Evans\nSUBSCRIBE\nDetroitBORG\nSUBSCRIBE\nLooneyTek\nSUBSCRIBE\nSoldier Knows Best\nSUBSCRIBE\nUrAvgConsumer\nSUBSCRIBE\nRELATED CHANNELS\nLinus Tech Tips\nSUBSCRIBE\nJerryRigEverything\nSUBSCRIBE\nMrwhosetheboss\nSUBSCRIBE\nTechSmartt\nSUBSCRIBE"

Is there a way to capture just the description? is that possible at all?

Thank you in advance to whoever can help me.

Best Wishes

you can get element by ID and a quick F12 on youtube shows that the ID you are looking for is description — Nullman
– Nullman, Commented Feb 25, 2019 at 13:32
thank you. When I try: a= driver.find_element_by_id('description'), it returns the text but also a lot of "\n". Is there a way to remove them? here is the text that is returned: "Where products get naked.\n\nHere you will find a variety of videos showcasing the coolest products on the planet. From the newest smartphone to surprising gadgets and technology you never knew existed. It's all here on Unbox Therapy.\n\nBusiness / professional inquiries ONLY - business [at] unboxtherapy.com\n(please don't use YouTube inbox)" — tezzaaa
– tezzaaa, Commented Feb 25, 2019 at 13:50

KunduK · Accepted Answer · 2019-02-25 13:57:35Z

1

Try the below code.Let me know if it work.

import bs4 as bs
import re
username = "unboxtherapy"
driver = webdriver.Chrome('C:/Users/Chrome Web Driver/chromedriver.exe')
api_url = "https://www.youtube.com/user/"+username+"/about"
driver.get(api_url)
html = driver.page_source
soup=bs.BeautifulSoup(html,'html.parser')
findtext=soup.find_all('yt-formatted-string',id=re.compile("description"))
for txt in findtext:
    print(txt.text)

Output :

Where products get naked.

Here you will find a variety of videos showcasing the coolest products on the planet. From the newest smartphone to surprising gadgets and technology you never knew existed. It's all here on Unbox Therapy.

Business / professional inquiries ONLY - business [at] unboxtherapy.com
(please don't use YouTube inbox)

answered Feb 25, 2019 at 13:57

KunduK

33.4k5 gold badges19 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Nullman Over a year ago

is there something special with the use of webdriver? when i get the html using requests i am unable to find the id even though its in the received html. never mind i found the answer

tezzaaa Over a year ago

Out of interest, how did you find 'yt-formatted-string' ? (sorry i'm not very knowledgeable about web stuff in general)

Nullman Over a year ago

@tezzaaa literraly hit F12 and selected the object in my web browser browser

KunduK Over a year ago

@Nullman: I have just done chrome inspect to find the element and then provided the tag name and id value.

tezzaaa Over a year ago

@Nullman thanks I did that also. In Chrome, inspect works well. I'm trying to pull views etc from the same page

trincot · Accepted Answer · 2019-02-25 18:23:11Z

1

Simple parsing can be completed using only selenium.

driver.get(api_url)
description = drvier.find_element_by_id('description')
print(description.text)

(if you use chrome and know about inspect)
to know the tag name, id or attribute value:

just right click on description text (you want to find element)
select 'inspect' like this:

like this

Then you can check value like this:

like this

pink color text : tag name
'#' & orange color text : id
'.' % blue color text : attribute value

Now use the driver method

driver.find_by_elemeent_by_tag_name()  
driver.find_by_elements_by_tag_name()  
driver.find_by_element_id()  
driver.find_by_elements_id()  
driver.find_by_element_class_name()  
driver.find_by_elements_class_name()

edited Feb 25, 2019 at 18:23

trincot

357k38 gold badges282 silver badges338 bronze badges

answered Feb 25, 2019 at 17:34

linizio

1856 bronze badges

1 Comment

tezzaaa Over a year ago

thank you very much, that's useful to me and also to other people who are new to parsing :-)

Collectives™ on Stack Overflow

WebScraping with Python / Selenium

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related