Author Topic: File hash  (Read 8971 times)

Offline robyx

  • Newcomer
  • Posts: 47
File hash
« on: 2020 February 19 13:15:42 »
Hi all,

I am investigating a way to optimize WBPP in order to skip a step if no changes have been made to the source/dest files. For that I would rely on a kind of fast computed hash value that changes anytime the file content is changed (I don't want to rely on file's modified date since file can be moved/copied without any content change).

Is there anything available and efficient on PJSR to return that hash key given a file path?

Thanks,
Robyx

Offline Juan Conejero

  • PTeam Member
  • PixInsight Jedi Grand Master
  • ********
  • Posts: 7111
    • http://pixinsight.com/
Re: File hash
« Reply #1 on: 2020 February 19 14:07:11 »
Indeed:

Code: [Select]
#include <pjsr/CryptographicHash.jsh>

function main()
{
   let SHA1 = new CryptographicHash( CryptographicHash_SHA1 );
   let data = File.readFile( "/path/to/foo.bar" );
   console.writeln( SHA1.hash( data ).toHex() );
}

main();
Juan Conejero
PixInsight Development Team
http://pixinsight.com/

Offline robyx

  • Newcomer
  • Posts: 47
Re: File hash
« Reply #2 on: 2020 February 19 14:49:11 »
Thanks Juan,

this works but it means that I need to read all files data and generate the hashes each time I want to compare the hashes previously generated for the same set of files. That's time-consuming and I guess this will not be so fast as I would it to be.

I was guessing for something more efficient solution that would leverage file system features or in any case a solution to avoid to read the whole file each time I need to generate the hash (maybe store the hash in the files?)

But I guess there are no multiplatform reliable solutions other than that, right?

Offline Juan Conejero

  • PTeam Member
  • PixInsight Jedi Grand Master
  • ********
  • Posts: 7111
    • http://pixinsight.com/
Re: File hash
« Reply #3 on: 2020 February 19 15:06:47 »
Hi Roberto,

I can share the method I use for PixInsight's File Explorer, which has proven to be very efficient and secure. This is the member function that computes a cache hash (in C++ but pretty easy to implement in JavaScript):

Code: [Select]
IsoString FileExplorerCache::Hash( const String& filePath )
{
   const fsize_type blockSize = 256*1024;
   const fsize_type halfBlockSize = blockSize >> 1;
   const fsize_type dataSize = 4*blockSize;

   try
   {
      File file = File::OpenFileForReading( filePath );
      fsize_type n = file.Size();
      ByteArray data;
      if ( n > dataSize )
      {
         data = ByteArray( dataSize + sizeof( fsize_type ) );
         file.Read( data.Begin(), blockSize );
         file.Seek( n/3 - halfBlockSize, SeekMode::FromBegin );
         file.Read( data.At( blockSize ), blockSize );
         file.Seek( 2*n/3 - halfBlockSize, SeekMode::FromBegin );
         file.Read( data.At( 2*blockSize ), blockSize );
         file.Seek( n - blockSize, SeekMode::FromBegin );
         file.Read( data.At( 3*blockSize ), blockSize );
         memcpy( data.At( dataSize ), &n, sizeof( fsize_type ) );
      }
      else
      {
         data = ByteArray( n );
         file.Read( data.Begin(), n );
      }

      file.Close();
      return IsoString::ToHex( SHA1().Hash( data ) );
   }
   catch ( ... )
   {
      // Propagate no filesystem exceptions here.
      return IsoString();
   }
}

For files smaller than 1 MiB, the function computes the SHA1 digest for the entire file. For larger files, the function computes an SHA1 digest for 1 MiB of file data read from 4 blocks of 256 KiB each, distributed uniformly. The last 8 bytes are set equal to the 64-bit file size, which introduces an additional dependency on the exact file size in the computed hash. The probability that two different image files generate the same hash is virtually zero for all practical cases. That has not happened so far, AFAIK.

Let me know if you need help to implement this in JavaScript, in case you decide to use the same method.
Juan Conejero
PixInsight Development Team
http://pixinsight.com/

Offline robyx

  • Newcomer
  • Posts: 47
Re: File hash
« Reply #4 on: 2020 February 19 15:30:51 »
Hi Juan,

it makes sense, I will implement this solution, thanks a lot.

Hi Roberto,

I can share the method I use for PixInsight's File Explorer, which has proven to be very efficient and secure. This is the member function that computes a cache hash (in C++ but pretty easy to implement in JavaScript):

Code: [Select]
IsoString FileExplorerCache::Hash( const String& filePath )
{
   const fsize_type blockSize = 256*1024;
   const fsize_type halfBlockSize = blockSize >> 1;
   const fsize_type dataSize = 4*blockSize;

   try
   {
      File file = File::OpenFileForReading( filePath );
      fsize_type n = file.Size();
      ByteArray data;
      if ( n > dataSize )
      {
         data = ByteArray( dataSize + sizeof( fsize_type ) );
         file.Read( data.Begin(), blockSize );
         file.Seek( n/3 - halfBlockSize, SeekMode::FromBegin );
         file.Read( data.At( blockSize ), blockSize );
         file.Seek( 2*n/3 - halfBlockSize, SeekMode::FromBegin );
         file.Read( data.At( 2*blockSize ), blockSize );
         file.Seek( n - blockSize, SeekMode::FromBegin );
         file.Read( data.At( 3*blockSize ), blockSize );
         memcpy( data.At( dataSize ), &n, sizeof( fsize_type ) );
      }
      else
      {
         data = ByteArray( n );
         file.Read( data.Begin(), n );
      }

      file.Close();
      return IsoString::ToHex( SHA1().Hash( data ) );
   }
   catch ( ... )
   {
      // Propagate no filesystem exceptions here.
      return IsoString();
   }
}

For files smaller than 1 MiB, the function computes the SHA1 digest for the entire file. For larger files, the function computes an SHA1 digest for 1 MiB of file data read from 4 blocks of 256 KiB each, distributed uniformly. The last 8 bytes are set equal to the 64-bit file size, which introduces an additional dependency on the exact file size in the computed hash. The probability that two different image files generate the same hash is virtually zero for all practical cases. That has not happened so far, AFAIK.

Let me know if you need help to implement this in JavaScript, in case you decide to use the same method.