Viewing a single comment thread. View all comments

Dunbaratu t1_ixx0n9f wrote

There are two ways programs try to identify the format of a file.

One is to just naively trust the extension. This can sometimes lead to security problems because it causes the OS to open a file is the "wrong" program if the file extension isn't named right.

Another is to utterly ignore the extension and just read the starting bytes of the file's content. Pretty much all files use an ID code as the first few bytes of content that define the file type. They all do this other than plain old ASCII files (Or Unicode UTF-8 files which end up kind of being the same thing as plain old ASCII.) In the UNIX world the first two bytes of file content were referred to as the file's "Magic Number" and you could figure out the type just from those two bytes. But later the algorithm to identify a file by its first few bytes got more complex as some files started using longer byte patterns so it's not just the first two bytes now sometimes it's the first 4 or the first 8.

In the Windows world, it wasn't so common to use the "magic number" idea originally, but it tends to use it more nowadays than it used to.

The problem with using the "magic number" method is that while it may be much more reliable than using the file extension that any user can just name wrongly, you cannot see this "magic number" until you open the file and start reading it. When the file is just a directory entry in a folder and you haven't looked inside it yet, you don't have anything to go on other than the extension. It's impractical and slow for a program that's showing a list of files in a folder to open every file one at a time to read the magic number from all of them. It relies on the file extension to decide what to show you on the screen for the file's icon, and to guess what program should open the file.

The reason .webp and .jpg both work is because often the same program that can read a .jpg file can also read a .webp file, so once the program is opening the file and reading the content, it (should) no longer care what the filename extension was and just believe the magic ID number in the file's content. When it reads that it goes, "oh, this is webp. Well I know how to display those so I'll read this file as webp and show it" and it completely ignores that the filename claims it's a jpeg file.

The place where the file extension being wrong for the format causes big problems is when the program that can read a file of one type cannot read a file of the other type. Then the extension being wrong causes the system to open it in the wrong program, which can't deal with it.

1