Parse XML data and update values of each node's children

Question

The code below iterates through the nodes of an XML file and updates values based on a Regex expression in the rule child node from an XPath expression. XML is included at the bottom.

Are there better alternatives to this approach? Would using LINQ be a good approach?

using System;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;

namespace XMLParser
{
    class Program
    {
        static void Main()
        {
            string ocrString = "";
            string rule = "";
            string output = "";
            string dataNodeIDValue = "";
            string dataNodeIDName = "";
            string xpathStr = "";
            Match match;
            int groupInt = 0;

            string filename = "C:\\Users\\name\\train\\dev\\offer\\TestParsing.xml";
            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.Load(filename);
            XmlElement root = xmlDoc.DocumentElement;
            XmlNodeList nodes = root.SelectNodes("//offer/data");
            XPathNavigator xnav = xmlDoc.CreateNavigator(); 
            
            // Read in all 'data' nodes and perform functions
            foreach (XmlNode node in nodes)
            {
                // Set to 0 so regex matches first match unless otherwise specified
                groupInt = 0;
                // Cycle through inner nodes of main node and pull in values
                foreach (XmlNode xmlNode in node.ChildNodes)
                {
                    switch (xmlNode.Name)
                    {
                        case "ocrstring":
                            ocrString = xmlNode.InnerText;
                            break;
                        case "rule":
                            rule = xmlNode.InnerText;
                            break;
                        case "group":
                            //groupInt = xmlNode.InnerText;
                            if (Int32.TryParse(xmlNode.InnerText, out groupInt)) { groupInt = Int32.Parse(xmlNode.InnerText); }
                            break;
                    }
                }

                // No rule given because ocr works effectively
                if (rule.Length < 2) { continue; }
               
                // If ocrstring is empty try finding text in pdf
                if (String.IsNullOrEmpty(ocrString) | String.IsNullOrWhiteSpace(ocrString)) // This is to iterate through pdf
                {
                    // TODO: Implement over full text doc <- ignore for now
                }
                else // This is to use XML string
                {
                    var regex = new Regex(rule);
                    match = regex.Match(ocrString);
                }

                //if (match.Groups.Count > 0) { };
                if (groupInt > 0 & match.Groups.Count > 0)
                {
                    output = match.Groups[groupInt].Value.ToString();
                }
                else
                {
                    output = match.Value.ToString().Trim();
                }

                dataNodeIDValue =  node.Attributes[0].Value;
                dataNodeIDName = node.Attributes[0].Name;
                xpathStr = "//offer/data[@" + dataNodeIDName + "='" + dataNodeIDValue + "']/output";

                if (String.IsNullOrEmpty(output))
                {
                    root.SelectSingleNode(xpathStr).InnerText = "NA";
                }
                else
                {
                    root.SelectSingleNode(xpathStr).InnerText = output;
                }
                
                xmlDoc.Save(filename);  // Save XML session back to file
            }
            Console.WriteLine("Exiting...");
        }
    }
}

XML Data

<?xml version="1.0" encoding="utf-8"?>
<offer>
  <data id="Salary">
    <ocrstring>which is equal to $40,000.00 if working 40 hours per week</ocrstring>
    <rule>.*(([+-]?\$[0-9]{1,3}(?:,?[0-9]{3})*\.[0-9]{2}))</rule>
    <group>1</group>
    <output></output>
  </data>
  <data id="DefaultWeeklyHours">
    <ocrstring></ocrstring>
    <rule><![CDATA["(?<=working).*?(?=hours)"]]></rule>
    <output></output>
  </data>
  <data id="RelocationAttachment">
    <ocrstring>LongWindingRoad222</ocrstring>
    <rule>Regex2</rule>
    <output></output>
  </data>
</offer>

Linq2XML (or XLINQ in short) would shift your code from imperative to more declarative. Is it a better approach? It depends. It might reduce the line of code by being more concise. This can increase (or decrease) readability depending on the reader's skills and Linq expression's complexity. Will it be more performant? It depends. It might be faster and it might be easier to move it into the world of parallelism. What you are really looking for? — Peter Csala
– Peter Csala, Commented Jun 23, 2020 at 6:16
@PeterCsala just suggestions and to hear what is more performant, your response answers my inquiry. — William Humphries
– William Humphries, Commented Jun 23, 2020 at 11:32

Peter Csala · Accepted Answer · 2020-06-23 13:39:52Z

If you would define a model like this:

public class Data
{
  public string Id { get; set; }
  public string OCR { get; set; }
  public string Rule {get; set; }
  public string Output {get; set; }
}

then you could easily separate your ETL job's different stages.

For example the Extract phase would look like this:

Document doc = XDocument.Parse(xml);
var parsedData = from data in doc.Descendants("Data")
                 select new Data()
                 {
                      Id = (string)data.Attribute("id"),
                      OCR = (string)data.Element("ocrstring"),
                      Rule = (string)data.Element("rule")
                 };

In your Transform phase you could perform the regex based transformations. The biggest gain here is that it is free from any input or output format. It is just pure business logic.

And finally in your Load phase you could simply serialize the whole (modified) data collection. Or if it is too large, then create logic to find the appropriate element (based on the Id property) and overwrite only the output child element.

What you have gained here is a pretty nice separation of concerns.

Your read logic is not mixed with the processing logic.
Because of the separation it is easier to spot where is the bottleneck of the application (if any).
Input format can be changed without affecting processing logic.
Pipeline like processing can be introduced to improve performance by invoking processing right after a Data object has been populated from the source.
Many other advantages. :)

Johnbot · Accepted Answer · 2020-06-23 13:51:19Z

I find using XDocument to be a lot simpler:

var fileName = @"C:\Users\name\train\dev\offer\TestParsing.xml";
var document = XDocument.Load(fileName);
var offerData = document.Descendants("offer").Descendants("data");

foreach (var d in offerData)
{   
    var rule = (string)d.Element("rule");
    if(rule.Length < 2)
    {
        continue;
    }

    var ocrString = (string)d.Element("ocrstring");
    if(string.IsNullOrWhiteSpace(ocrString))
    {
        continue;
    }
    
    var match = Regex.Match(ocrString, rule);
    var result = "NA";
    if (match.Success)
    {
        var group = (int?)d.Element("group");
        result = match.Groups[group.GetValueOrDefault(0)].Value;
    }
    
    d.SetElementValue("output", result);
}

document.Save(fileName);

The logic is no longer obscured by the XML-parsing and can be descerned more easily. All the parsing is done by just casting the elements to the desired type.

Stack Exchange Network

Parse XML data and update values of each node's children

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Parse XML data and update values of each node's children

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions