I am using Jsoup to parse data from a website. We are talking a fairly big database with about 75000 entries. There are 19 categories, so 19 websites with thousands of entries to parse. The problem is that Jsoup seems to be very slow at times, not being able to parse a single website in seconds, but sometimes it easily achieves multiple sites per second. What exactly is causing this inconsistency here which also lead to the infamous "java.net.SocketTimeoutException: Read timed out"? My code:
public class Parser {
private int counter = 0;
private FileWriter fw;
private Elements elements;
void parse(String category) throws IOException {
try {
Document doc = Jsoup.connect("https://fddb.info/db/de/produktgruppen/" + category + "/index.html").get();
elements = doc.select("a[href^='https://fddb.info/db/de/lebensmittel']");
File file = new File("Data/" + category + ".txt");
fw = new FileWriter(file, true);
writeToFile();
fw.close();
} catch (Exception e) {
System.out.println("Timed out at " + counter);
writeToFile();
}
}
private void writeToFile() throws IOException {
try {
for (int i = counter; i < elements.size(); i++) {
Element element = elements.get(i);
Document elementDoc = Jsoup.connect(element.attr("href")).get();
// Headline
fw.write(elementDoc.select("#fddb-headline1").text() + "\n");
// Tags
Elements tags = elementDoc.select("a[href='https://fddb.info/db/de/lexikon/gesundheitsthemen/index.html']");
for (Element tag : tags) {
if (!tag.text().equals("Hinweis zu Gesundheitsthemen")) {
fw.write(tag.text() + "\n");
}
}
// Nutrition
Elements nutritions = elementDoc.select("div[style*='padding:2px 4px']");
for (Element nutrition : nutritions) {
fw.write(nutrition.text() + "\n");
}
counter++;
}
} catch (Exception e) {
System.out.println("Timed out at " + counter);
writeToFile();
}
}
}
I have already tried to face the exception by just recalling the parsing function in the catch-block, very hacky, I know...