I'm trying to quickly and efficiently find every recurring position of small byte arrays (4 bytes) in large binary files (several GBs). My current method is as follows:
Stream stream = File.OpenRead(filepath);
List<long> searchResults = new List<long>(); //The results as offsets within the file
int searchPosition = 0; //Track of how much of the array has been matched
int[] searchPattern = { 0x00, 0x01, 0x02, 0x03 }; // The array to search
while(true) //Loop until we reach the end of the file
{
var latestbyte = stream.ReadByte();
if(latestbyte == -1) break; //We have reached the end of the file
if(latestbyte == searchPattern[searchPosition]
{
searchPosition++;
if(searchPosition == searchPattern.Length)
{
searchResults.Add(stream.Position);
}
}
else
{
searchPosition = 0;
}
}
It's slow, and seems quite inefficient (3-4 seconds for a small 174MB file, 35 seconds for a 3GB one).
How can I improve the performance?
I looked into Boyer-Moore, but is it really worth it, considering the pattern i'm looking for is only 4 bytes?