0

The script below parses the content of some markdown files in a directory. It extracts the separate components of each file and places them into a dictionary, and then converts the dictionary to JSON.

from datetime import datetime
import glob
import json
import os
import re


def entries():
    entries = glob.glob("/home/user/temp/wiki/" + "*.md")
    regexp = r"^\s*(?:-{3})(.*?)(?:-{3})\s*(.+)$"

    for entry in entries:
        with open(entry, "r", encoding="utf-8") as file:
            file_content = file.read()

            try:
                # Regular expression to use:
                match = re.compile(regexp, re.S | re.M)

                # Find matches:
                result = match.search(file_content)

                # Convert frontmatter into dictionary:
                frontmatter = dict(re.findall(r"(.*): (.*)", result.group(1)))

                # Convert individual tags to list items:
                frontmatter["tags"] = frontmatter["tags"][1:-1].split(",")

                # Add content to dict:
                frontmatter["content"] = result.group(2)

                # Create JSON object:
                search_index = json.dumps(frontmatter, indent=4, default=str)

                print(search_index)

            except:
                print(f"Error: No YAML frontmatter found in '{entry_path}'")


entries()

When the script is run, it returns the below output:

{
    "id": "20210131141200",
    "title": "Nulla id feugiat mauris.",
    "date": "2021-01-31 14:12:00",
    "tags": [
        "nulla",
        " id"
    ],
    "content": "Fusce eu pulvinar velit. Praesent vel velit quis risus euismod pulvinar. Vestibulum nisl sapien, scelerisque vitae ornare ut, feugiat at tellus. Sed scelerisque tellus molestie, rhoncus neque eu, condimentum eros. Quisque sapien tellus, volutpat a gravida quis, iaculis et erat. \n\nQuisque porttitor euismod odio ut eleifend. In semper sagittis cursus. Donec iaculis blandit fringilla. Donec lobortis lectus orci, gravida blandit risus fermentum vitae. \n"
}
{
    "id": "20210202144523",
    "title": "Lorem ipsum dolor sit amet",
    "date": "2021-02-02 14:45:23",
    "tags": [
        "lorem",
        " ipsum"
    ],
    "content": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi posuere turpis et mattis vehicula. Quisque consectetur purus auctor varius sagittis. Mauris cursus turpis ac massa luctus bibendum. \n\nDonec eu varius justo. Aliquam vel rutrum urna, in pellentesque mauris. Nulla eu mollis turpis. Duis lacinia laoreet tortor eget laoreet. Aenean vel velit lacus. Duis mollis eros sed ex cursus auctor. Quisque tristique metus nec ex sodales malesuada. In pharetra bibendum turpis vel auctor."
}
{
    "id": "20210201132608",
    "title": "Sed vulputate arcu eu iaculis auctor",
    "date": "2021-02-01 13:26:08",
    "tags": [
        "sed",
        " vulputate",
        " arcu",
        " eu"
    ],
    "content": "Proin ullamcorper massa enim, vel dignissim dui tempus at. Pellentesque nec metus quis massa sodales tempor. Fusce mauris lectus, hendrerit et rhoncus sit amet, aliquam non arcu. Aenean et velit sit amet neque malesuada consequat eu scelerisque magna. Aliquam varius maximus dolor non ullamcorper. Nullam interdum sed dolor eu iaculis.\n\nDuis vel cursus velit. Sed interdum massa nunc, vel aliquam magna placerat in. Vestibulum egestas magna ligula, ut fringilla erat faucibus eu. Phasellus luctus laoreet velit, et imperdiet magna tincidunt et. Nullam vitae diam at arcu faucibus consectetur commodo suscipit magna. Curabitur rhoncus in elit vitae vestibulum. Fusce luctus mattis fringilla. Curabitur feugiat tristique odio. \n"
}

This isn't quite in the format I need it to be. I'm trying to output the JSON exactly as you can see it below, but I'm not having much luck.

{
    "entries": [
        {
            "id": "20210131141200",
            "title": "Nulla id feugiat mauris.",
            "date": "2021-01-31 14:12:00",
            "tags": [
                "nulla",
                " id"
            ],
            "content": "Fusce eu pulvinar velit. Praesent vel velit quis risus euismod pulvinar. Vestibulum nisl sapien, scelerisque vitae ornare ut, feugiat at tellus. Sed scelerisque tellus molestie, rhoncus neque eu, condimentum eros. Quisque sapien tellus, volutpat a gravida quis, iaculis et erat. \n\nQuisque porttitor euismod odio ut eleifend. In semper sagittis cursus. Donec iaculis blandit fringilla. Donec lobortis lectus orci, gravida blandit risus fermentum vitae. \n"
        },
        {
            "id": "20210202144523",
            "title": "Lorem ipsum dolor sit amet",
            "date": "2021-02-02 14:45:23",
            "tags": [
                "lorem",
                " ipsum"
            ],
            "content": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi posuere turpis et mattis vehicula. Quisque consectetur purus auctor varius sagittis. Mauris cursus turpis ac massa luctus bibendum. \n\nDonec eu varius justo. Aliquam vel rutrum urna, in pellentesque mauris. Nulla eu mollis turpis. Duis lacinia laoreet tortor eget laoreet. Aenean vel velit lacus. Duis mollis eros sed ex cursus auctor. Quisque tristique metus nec ex sodales malesuada. In pharetra bibendum turpis vel auctor."
        },
        {
            "id": "20210201132608",
            "title": "Sed vulputate arcu eu iaculis auctor",
            "date": "2021-02-01 13:26:08",
            "tags": [
                "sed",
                " vulputate",
                " arcu",
                " eu"
            ],
            "content": "Proin ullamcorper massa enim, vel dignissim dui tempus at. Pellentesque nec metus quis massa sodales tempor. Fusce mauris lectus, hendrerit et rhoncus sit amet, aliquam non arcu. Aenean et velit sit amet neque malesuada consequat eu scelerisque magna. Aliquam varius maximus dolor non ullamcorper. Nullam interdum sed dolor eu iaculis.\n\nDuis vel cursus velit. Sed interdum massa nunc, vel aliquam magna placerat in. Vestibulum egestas magna ligula, ut fringilla erat faucibus eu. Phasellus luctus laoreet velit, et imperdiet magna tincidunt et. Nullam vitae diam at arcu faucibus consectetur commodo suscipit magna. Curabitur rhoncus in elit vitae vestibulum. Fusce luctus mattis fringilla. Curabitur feugiat tristique odio. \n"
        }
    ]
}

Everything I've tried (search_index = json.dumps({"entries": frontmatter}, indent=4, default=str) for example) get's close but because it's in a loop, it ends up outputting "entries": each time instead of "wrapping" the objects, as you can see below:

{
    "entries": {
        "id": "20210131141200",
        "title": "Nulla id feugiat mauris.",
        "date": "2021-01-31 14:12:00",
        "tags": [
            "nulla",
            " id"
        ],
        "content": "Fusce eu pulvinar velit. Praesent vel velit quis risus euismod pulvinar. Vestibulum nisl sapien, scelerisque vitae ornare ut, feugiat at tellus. Sed scelerisque tellus molestie, rhoncus neque eu, condimentum eros. Quisque sapien tellus, volutpat a gravida quis, iaculis et erat. \n\nQuisque porttitor euismod odio ut eleifend. In semper sagittis cursus. Donec iaculis blandit fringilla. Donec lobortis lectus orci, gravida blandit risus fermentum vitae. \n"
    }
}
{
    "entries": {
        "id": "20210202144523",
        "title": "Lorem ipsum dolor sit amet",
        "date": "2021-02-02 14:45:23",
        "tags": [
            "lorem",
            " ipsum"
        ],
        "content": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi posuere turpis et mattis vehicula. Quisque consectetur purus auctor varius sagittis. Mauris cursus turpis ac massa luctus bibendum. \n\nDonec eu varius justo. Aliquam vel rutrum urna, in pellentesque mauris. Nulla eu mollis turpis. Duis lacinia laoreet tortor eget laoreet. Aenean vel velit lacus. Duis mollis eros sed ex cursus auctor. Quisque tristique metus nec ex sodales malesuada. In pharetra bibendum turpis vel auctor."
    }
}
{
    "entries": {
        "id": "20210201132608",
        "title": "Sed vulputate arcu eu iaculis auctor",
        "date": "2021-02-01 13:26:08",
        "tags": [
            "sed",
            " vulputate",
            " arcu",
            " eu"
        ],
        "content": "Proin ullamcorper massa enim, vel dignissim dui tempus at. Pellentesque nec metus quis massa sodales tempor. Fusce mauris lectus, hendrerit et rhoncus sit amet, aliquam non arcu. Aenean et velit sit amet neque malesuada consequat eu scelerisque magna. Aliquam varius maximus dolor non ullamcorper. Nullam interdum sed dolor eu iaculis.\n\nDuis vel cursus velit. Sed interdum massa nunc, vel aliquam magna placerat in. Vestibulum egestas magna ligula, ut fringilla erat faucibus eu. Phasellus luctus laoreet velit, et imperdiet magna tincidunt et. Nullam vitae diam at arcu faucibus consectetur commodo suscipit magna. Curabitur rhoncus in elit vitae vestibulum. Fusce luctus mattis fringilla. Curabitur feugiat tristique odio. \n"
    }
}

For reference, the Markdown files are structured as below:

---
id: 20210131141200
title: Nulla id feugiat mauris.
date: 2021-01-31 14:12:00
tags: [nulla, id]
---

Fusce eu pulvinar velit. Praesent vel velit quis risus euismod pulvinar. Vestibulum nisl sapien, scelerisque vitae ornare ut, feugiat at tellus. Sed scelerisque tellus molestie, rhoncus neque eu, condimentum eros. Quisque sapien tellus, volutpat a gravida quis, iaculis et erat. 

Quisque porttitor euismod odio ut eleifend. In semper sagittis cursus. Donec iaculis blandit fringilla. Donec lobortis lectus orci, gravida blandit risus fermentum vitae. 

2 Answers 2

1

Rather than creating each element, gather all elements then convert, then to JSON all at once.

def entries():
    entries = glob.glob("/home/user/temp/wiki/" + "*.md")
    regexp = r"^\s*(?:-{3})(.*?)(?:-{3})\s*(.+)$"

    # Store entries to dump later
    entry_dicts = []

    for entry in entries:
        with open(entry, "r", encoding="utf-8") as file:
            file_content = file.read()

            try:
                ...  # your code as-is here

                # Do not create the JSON object, instead:
                entry_dicts.append(frontmatter)

            except:
                print(f"Error: No YAML frontmatter found in '{entry_path}'")

    entries_json = json.dumps({'entries': entry_dicts}, indent=4, default=str)
    print(entries_json)

entries()
Sign up to request clarification or add additional context in comments.

Comments

0

Rather than printing each search_index inside the loop, collect all the results in a single object. Something like:

def entries():
    results = []

    for entry in entries:
        result = dict()

        # do work

        results.append(result)

    return results


print(entries())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.