0

Having a random String, how to extract the XML document(s) from it ?

Consider that the String might hold none (incomplete), one (complete), or multiple documents.

Is there a template / tool to solve this problem ?

LE: consider the case when XML data is retrieved via TCP/IP

3 Answers 3

2

Multiple documents is a challenge... I'd wrap the String into an additional "root", this would at least transform the content to a valid xml-document:

 String xml = "<my-own-root-element>" + getString() + "</my-own-root-element>";

Just a start. Of course, forget about xml schemas and doctype. Different character encodings may be a challenge and you may have to filter out the <?xml ... ?> processing instructions.

Sign up to request clarification or add additional context in comments.

1 Comment

I had done something similar (adding an additional root) and then used StAX to parse what I had hoped would be a valid XML document
1

I know no existing solution that can handle broken XML documents automatically. XML is a very strict standard with little leeway when it comes to parse errors. You are on your own.

What you can try is looking at the code for XML editors; they must be able to handle corrupt documents but I doubt that any of them can handle things like missing start elements and such.

Comments

0

this is my C# version of it, hope it gives some direction... I'm using it for tcp/ip communication, and T stands for some generic type.

public List<T> ParseMultipleDocumentsByType<T>(string documents)
    {
        var cleanParsedDocuments = new List<T>();
        var stringContainsDocuments = true;
        while (stringContainsDocuments )
        {
            if(documents.Contains(typeof(T).Name))
            {
                var startingPoint = documents.IndexOf("<?xml");
                var endingString = "</" +typeof(T).Name + ">";
                var endingPoing = documents.IndexOf(endingString) + endingString.Length;
                var document = documents.Substring(startingPoint, endingPoing - startingPoint);
                var singleDoc = (T)XmlDeserializeFromString(document, typeof(T));
                cleanParsedDocuments.Add(singleDoc);
                documents = documents.Remove(startingPoint, endingPoing - startingPoint);
            }
            else
            {
                flag = false;
            }
        }


        return cleanParsedDocuments;
    }

    public static object XmlDeserializeFromString(string objectData, Type type)
    {
        var serializer = new XmlSerializer(type);
        object result;

        using (TextReader reader = new StringReader(objectData))
        {
            result = serializer.Deserialize(reader);
        }

        return result;
    }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.