0

I'm trying to pull some data from Youtube, but i'm struggling with catching the text, here is my code:

username = "unboxtherapy"
driver = webdriver.Chrome('C:/Users/Chrome Web Driver/chromedriver.exe')
api_url = "https://www.youtube.com/user/"+username+"/about"
driver.get(api_url)
html = driver.find_element_by_tag_name('html')
soup=bs(html.text,'html.parser')
text=str(soup)

In the example above, I'm trying to capture the description shown on the page.

soup

returns all the text on the page i.e. the description that I want + a ton of other things which I don't want.

text

returns all the following text:

"GB\nSIGN IN\nUnbox Therapy\n13,802,667 subscribers\nJOIN\nSUBSCRIBE\nTwitter\nHOME\nVIDEOS\nPLAYLISTS\nCOMMUNITY\nCHANNELS\nABOUT\nDescription\nWhere products get naked.\n\nHere you will find a variety of videos showcasing the coolest products on the planet. From the newest smartphone to surprising gadgets and technology you never knew existed. It's all here on Unbox Therapy.\n\nBusiness / professional inquiries ONLY - business [at] unboxtherapy.com\n(please don't use YouTube inbox)\nLinks\nTwitter Facebook Instagram The Official Website\nStats\nJoined Dec 21, 2010\n2,698,921,226 views\nOTHER COOL CHANNELS.\nLew Later\nSUBSCRIBE\nMarques Brownlee\nSUBSCRIBE\nJonathan Morrison\nSUBSCRIBE\nAustin Evans\nSUBSCRIBE\nDetroitBORG\nSUBSCRIBE\nLooneyTek\nSUBSCRIBE\nSoldier Knows Best\nSUBSCRIBE\nUrAvgConsumer\nSUBSCRIBE\nRELATED CHANNELS\nLinus Tech Tips\nSUBSCRIBE\nJerryRigEverything\nSUBSCRIBE\nMrwhosetheboss\nSUBSCRIBE\nTechSmartt\nSUBSCRIBE"

Is there a way to capture just the description? is that possible at all?

Thank you in advance to whoever can help me.

Best Wishes

3
  • 1
    you can get element by ID and a quick F12 on youtube shows that the ID you are looking for is description Commented Feb 25, 2019 at 13:32
  • thank you. When I try: a= driver.find_element_by_id('description'), it returns the text but also a lot of "\n". Is there a way to remove them? here is the text that is returned: "Where products get naked.\n\nHere you will find a variety of videos showcasing the coolest products on the planet. From the newest smartphone to surprising gadgets and technology you never knew existed. It's all here on Unbox Therapy.\n\nBusiness / professional inquiries ONLY - business [at] unboxtherapy.com\n(please don't use YouTube inbox)" Commented Feb 25, 2019 at 13:50
  • replace "\n" with new lines? or with spaces? Commented Feb 25, 2019 at 13:55

2 Answers 2

1

Try the below code.Let me know if it work.

import bs4 as bs
import re
username = "unboxtherapy"
driver = webdriver.Chrome('C:/Users/Chrome Web Driver/chromedriver.exe')
api_url = "https://www.youtube.com/user/"+username+"/about"
driver.get(api_url)
html = driver.page_source
soup=bs.BeautifulSoup(html,'html.parser')
findtext=soup.find_all('yt-formatted-string',id=re.compile("description"))
for txt in findtext:
    print(txt.text)

Output :

Where products get naked.

Here you will find a variety of videos showcasing the coolest products on the planet. From the newest smartphone to surprising gadgets and technology you never knew existed. It's all here on Unbox Therapy.

Business / professional inquiries ONLY - business [at] unboxtherapy.com
(please don't use YouTube inbox)
Sign up to request clarification or add additional context in comments.

5 Comments

is there something special with the use of webdriver? when i get the html using requests i am unable to find the id even though its in the received html. never mind i found the answer
Out of interest, how did you find 'yt-formatted-string' ? (sorry i'm not very knowledgeable about web stuff in general)
@tezzaaa literraly hit F12 and selected the object in my web browser browser
@Nullman: I have just done chrome inspect to find the element and then provided the tag name and id value.
@Nullman thanks I did that also. In Chrome, inspect works well. I'm trying to pull views etc from the same page
1

Simple parsing can be completed using only selenium.

driver.get(api_url)
description = drvier.find_element_by_id('description')
print(description.text)

(if you use chrome and know about inspect)
to know the tag name, id or attribute value:

  1. just right click on description text (you want to find element)
  2. select 'inspect' like this:

like this

Then you can check value like this:

like this

  • pink color text : tag name
  • '#' & orange color text : id
  • '.' % blue color text : attribute value

Now use the driver method

driver.find_by_elemeent_by_tag_name()  
driver.find_by_elements_by_tag_name()  
driver.find_by_element_id()  
driver.find_by_elements_id()  
driver.find_by_element_class_name()  
driver.find_by_elements_class_name()  

1 Comment

thank you very much, that's useful to me and also to other people who are new to parsing :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.