0

I am parsing article's and megapost's metrics (likes, views, comments, dates) from the forum.

I am using Selenium and I'm trying to reach the datetime published.

dates = []

page_items = len(drv.find_elements_by_class_name("tm-articles-list"))
    for i in range(page_items):
        date_of_post = drv.find_elements_by_class_name("tm-article-snippet__datetime-published")
            for d in date_of_post:
               date_text = d.find_element_by_tag_name("time").text
               dates.append(date_text)

The problem is that there is a difference between the basic articles and megaposts in a HTTML class names. Datetime for articles class name is tm-article-snippet__datetime-published and for megaposts it's tm-megapost-snippet__datetime-published. I am wondering what is the possible way to parse the datetime regardless the type of class.

I tried to do it through the logical expression: date_of_post = drv.find_elements_by_class_name("tm-article-snippet__datetime-published" or "tm-megapost-snippet__datetime-published") but obviously it does not work.

Important remark: all megaposts on the forum are situated in the tm-articles-list class.

HTML for megaposts:

    <article id="424221" data-navigatable="" tabindex="0" class="tm-articles-list__item">
<div class="tm-megapost-snippet">
<div class="tm-megapost-snippet__wrapper" style="background: url(&quot;https://habrastorage.org/r/w780/getpro/tmtm/megapost/928/9f7/ad0/9289f7ad0d8e76bf87471d2dbf71401a.jpg&quot;) center center / cover no-repeat;">
<div class="tm-megapost-snippet__tint">
<header class="tm-megapost-snippet__header">
<a href="/ru/company/dins/" class="tm-megapost-snippet__link tm-megapost-snippet__company-blog router-link-active">
<span>Блог компании DINS</span>
</a>
<a href="/ru/article/424221/" class="tm-megapost-snippet__link tm-megapost-snippet__date">
<time datetime="2018-11-09T14:58:14.000Z" title="2018-11-09, 17:58" class="tm-megapost-snippet__datetime-published">9  ноября  2018</time>
</a>
</header>
<a href="/ru/article/424221/" class="tm-megapost-snippet__link tm-megapost-snippet__card">
<h2 class="tm-megapost-snippet__title">Жизнь С++</h2>
</a>
<ul class="tm-megapost-snippet__hubs">
<li class="tm-megapost-snippet__hub"><a href="/ru/hub/programming/" class="tm-megapost-snippet__link"><span>Программирование</span></a></li><li class="tm-megapost-snippet__hub"><a href="/ru/hub/read/" class="tm-megapost-snippet__link"><span>Читальный зал</span></a></li><li class="tm-megapost-snippet__hub"><a href="/ru/hub/history/" class="tm-megapost-snippet__link"><span>История IT</span></a></li><li class="tm-megapost-snippet__hub"><a href="/ru/hub/itcompanies/" class="tm-megapost-snippet__link"><span>IT-компании</span></a></li></ul></div></div><div class="tm-megapost-snippet__body"><div class="article-formatted-body article-formatted-body_version-1">IT-эволюция - шутка парадоксальная. Например, сначала на компьютерах моделировали нагрузку на АТС, затем программно управляли вызовами, а теперь телефония - это облачное решение, которое разворачивается за несколько минут и объединяет все корпоративные коммуникации. 
    
    Кажется, между этими изменениями мало общего. На самом деле они стали возможными благодаря принципам программирования, заложенным полвека назад. И чтобы лучше увидеть эту связь, мы решили вспомнить историю С++ - одного из самых “взрослых” языков программирования. Он может быть и удобным инструментом разработки, и ночным кошмаром, и частью корпоративной истории. std::begin( )
    </div><a href="/ru/article/424221/" class="tm-megapost-snippet__readmore"><span>Подробности — под катом</span></a></div></div><div class="tm-data-icons"><!----><div class="tm-votes-meter tm-data-icons__item"><svg height="16" width="16" class="tm-svg-img tm-votes-meter__icon tm-votes-meter__icon_small"><title>Всего голосов 72: ↑65 и ↓7</title><use xlink:href="/img/megazord-v24.cee85629.svg#counter-rating"></use></svg><span title="Всего голосов 72: ↑65 и ↓7" class="tm-votes-meter__value tm-votes-meter__value_positive tm-votes-meter__value_small">+58</span></div><span class="tm-icon-counter tm-data-icons__item" title="Количество просмотров"><svg height="16" width="16" class="tm-svg-img tm-icon-counter__icon"><title>Просмотры</title><use xlink:href="/img/megazord-v24.cee85629.svg#counter-views"></use></svg><span class="tm-icon-counter__value">43K</span></span><button title="Добавить в закладки" type="button" class="bookmarks-button tm-data-icons__item"><span title="Добавить в закладки" class="tm-svg-icon__wrapper bookmarks-button__icon"><svg height="16" width="16" class="tm-svg-img tm-svg-icon"><title>Добавить в закладки</title><use xlink:href="/img/megazord-v24.cee85629.svg#counter-favorite"></use></svg></span><span title="Количество пользователей, добавивших публикацию в закладки" class="bookmarks-button__counter">
        119
      </span></button><div class="tm-article-comments-counter-link tm-data-icons__item" title="Читать комментарии"><a href="/ru/company/dins/blog/424221/comments/" class="tm-article-comments-counter-link__link"><svg height="16" width="16" class="tm-svg-img tm-article-comments-counter-link__icon"><title>Комментарии</title><use xlink:href="/img/megazord-v24.cee85629.svg#counter-comments"></use></svg><span class="tm-article-comments-counter-link__value">
          189
        </span></a><a href="/ru/company/dins/blog/424221/comments/" class="tm-article-comments-counter-link__link"><span title="Читать новые комментарии" class="tm-article-comments-counter-link__unread-counter">
          +189
        </span></a></div><!----><div class="v-portal" style="display: none;"></div></div></article>

HTML for regular articles

<article id="433166" data-navigatable="" tabindex="0" class="tm-articles-list__item">
<div class="tm-article-snippet">
<div class="tm-article-snippet__meta-container">
<div class="tm-article-snippet__meta">
<span class="tm-user-info tm-article-snippet__author"><a href="/ru/users/640509-040147/" class="tm-user-info__userpic" title="640509-040147">
<div class="tm-entity-image">
<svg height="24" width="24" class="tm-svg-img tm-image-placeholder tm-image-placeholder_pink"><!----><use xlink:href="/img/megazord-v24.cee85629.svg#placeholder-user"></use></svg></div></a><span class="tm-user-info__user"><a href="/ru/users/640509-040147/" class="tm-user-info__username">
      640509-040147
    </a>
</span></span>
<span class="tm-article-snippet__datetime-published">
<time datetime="2018-12-25T11:36:26.000Z" title="2018-12-25, 14:36">25  декабря  2018 в 14:36</time></span></div><!---->
</div>
<h2 class="tm-article-snippet__title tm-article-snippet__title_h2"><a href="/ru/company/dins/blog/433166/" class="tm-article-snippet__title-link" data-article-link=""><span>Предсказываем время решения тикета с помощью машинного обучения</span></a></h2><div class="tm-article-snippet__hubs"><span class="tm-article-snippet__hubs-item"><a href="/ru/company/dins/blog/" class="tm-article-snippet__hubs-item-link router-link-active"><span>Блог компании DINS</span><!----></a></span><span class="tm-article-snippet__hubs-item"><a href="/ru/hub/python/" class="tm-article-snippet__hubs-item-link"><span>Python</span><span title="Профильный хаб" class="tm-article-snippet__profiled-hub">*</span></a></span><span class="tm-article-snippet__hubs-item"><a href="/ru/hub/data_mining/" class="tm-article-snippet__hubs-item-link"><span>Data Mining</span><span title="Профильный хаб" class="tm-article-snippet__profiled-hub">*</span></a></span><span class="tm-article-snippet__hubs-item"><a href="/ru/hub/machine_learning/" class="tm-article-snippet__hubs-item-link"><span>Машинное обучение</span><span title="Профильный хаб" class="tm-article-snippet__profiled-hub">*</span></a></span></div><div class="tm-article-snippet__labels"><!----></div><!----><div class="tm-article-body tm-article-snippet__lead">

Sorry for may be a very simple, silly, or even duplicated question (however, have not find anything related to the topic). I feel more confidently in R rather than in Python but when things are going that I need to parse something, I go for Python :)

3
  • Is page url public ? Commented Sep 4, 2021 at 11:20
  • @cruisepandey yeah, here it is: habr.com/ru/company/dododev/blog/page6 Commented Sep 4, 2021 at 11:21
  • upd: habr.com/ru/company/dododev/blog/page3 Commented Sep 4, 2021 at 11:37

1 Answer 1

2

to parse the datetime regardless the type of class, you may consider to use xpath.

there is a xpath or operator.

//*[contains(@class, 'tm-article-snippet__datetime-published') or contains(@class, 'tm-megapost-snippet__datetime-published')] 

here * means any node. But I believe they are part of span tag, right ?

if so, you can replace * with span.

try below :

date_of_post = drv.find_elements_by_xpath("//*[contains(@class, 'tm-article-snippet__datetime-published') or contains(@class, 'tm-megapost-snippet__datetime-published')]")

Update 1 :

date_time = []
lst = driver.find_elements(By.XPATH, "//*[contains(@class, 'tm-article-snippet__datetime-published') or contains(@class, 'tm-megapost-snippet__datetime-published')]")
try:
    for item in lst:
        try:
            if len(item.get_attribute('datetime')) > 0:
                print("meaning inner attribute date time is present")
                date_time.append(item.get_attribute('datetime'))
            else:
                print("Now we should look for child node")
                item.find_element(By.XPATH, ".//child::time").get_attribute('datetime')
                date_time.append(item.get_attribute('datetime'))
        except:
            print("First exception")
            break
except:
    print("Final exception")
Sign up to request clarification or add additional context in comments.

14 Comments

looks great, thank you! but it still does not find the megapost's date even after replacing * with span
Are you sure tm-megapost-snippet__datetime-published this class name is correct or not ? I could not find it on the weblink shared by you.
yes, absolutely! this is on page number 3, the very last post
try this ` date_text = d.find_element_by_xpath(".//time").text` this will only work if d represent //*[contains(@class, 'tm-article-snippet__datetime-published') or contains(@class, 'tm-megapost-snippet__datetime-published')] and the time should be next child node in HTML DOM.
@rg4s : I have tried to implement this as well, see above that may help you past this issue.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.