My application needs to fetch resources (images, CSS, fonts, etc.) from given URLs and cache them locally based on the Cache-Control/ETag headers returned with the resource.
I’m using Apache HttpClient 5 with the cache module:
<dependency>
<groupId>org.apache.httpcomponents.client5</groupId>
<artifactId>httpclient5</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents.client5</groupId>
<artifactId>httpclient5-cache</artifactId>
<version>5.3.1</version>
</dependency>
Apache HttpClient successfully caches resources locally and fetches a new version once the old one has expired. However, old versions of the resources are not removed automatically from the local cache.
Here’s my test implementation:
@Component
public class ResourceCache {
private final CloseableHttpClient client;
public ResourceCache() {
CacheConfig cacheConfig = CacheConfig.custom()
.setMaxCacheEntries(2)
.setMaxObjectSize(10 * 1024 * 1024)
.setSharedCache(false)
.setHeuristicCachingEnabled(true)
.setHeuristicDefaultLifetime(TimeValue.ofMinutes(2))
.build();
ManagedHttpCacheStorage storage = new ManagedHttpCacheStorage(cacheConfig);
client = CachingHttpClients.custom()
.setCacheDir(new File("/my_cache"))
.setCacheConfig(cacheConfig)
.setHttpCacheStorage(storage)
.build();
ScheduledExecutorService ses = Executors.newSingleThreadScheduledExecutor();
ses.scheduleAtFixedRate(storage::cleanResources, 30, 30, TimeUnit.SECONDS);
}
public void fetchAndCache() {
List<String> resources = List.of(
"https://img.shields.io/npm/v/react.svg", // cache-control: max-age=300, s-maxage=300; no ETag
"https://jpeg.org/images/jpeg-home.jpg", // ETag only
"http://httpbin.org/image/png" // no cache-control, no ETag
);
for (String resource : resources) {
HttpGet request = new HttpGet(resource);
HttpClientResponseHandler<byte[]> handler = response -> {
if (response.getEntity() != null) {
return response.getEntity().getContent().readAllBytes();
}
return new byte[0];
};
try {
byte[] data = client.execute(request, handler);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Observed behavior:
https://img.shields.io/npm/v/react.svg→ HasCache-Control: max-age=300. After it expires, a new version is fetched, but the old one still exists. (Also, for some reason, this specific resource seems to be fetched twice each time.)https://jpeg.org/images/jpeg-home.jpg→ Has anETag. A new version is not fetched (since the ETag hasn’t changed). Expected.http://httpbin.org/image/png→ No cache-control/ETag. Apache client applies heuristic caching. After expiration, a new version is fetched, but again, the old version remains in the cache.
I’m creating my own ManagedHttpCacheStorage so I can call cleanResources() on it periodically from a scheduled job.However, cleanResources() does not remove files from the cache folder.
The Documentation says that this type of storage can deallocate resources. Does this mean it only removes references from memory but leaves the files on disk?
Is there a way to automatically clean expired resources from disk with Apache HttpClient 5? Or do I need to implement my own cleanup logic?
Additionally, I noticed that if I manually delete the cache folder while the application is still running, resources are no longer cached and are fetched from the web each time.
