A file format is a particular way to encode information for storage in a computer file.
Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor documents, there will typically be several different formats. Sometimes these formats compete with each other.
It is sometimes possible to cause a program to read a file encoded in one format as if it were encoded in another format. For example, one can play a Microsoft Word document as if it were a song by using a music-playing program that deals in "headerless" audio files. The result does not sound very musical, however. This is so because a sensible arrangement of bits in one format is almost always nonsensical in another.
Using file formats without a publicly available specification can be costly. Learning how the format works will require either reverse engineering it from a reference implementation or acquiring the specification document for a fee from the format developers. This second approach is possible only when there is a specification document, and typically requires the signing of a non-disclosure agreement. Both strategies require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.
Patent law, rather than copyright, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats require the encoding of data with patented algorithms. For example, the GIF file format requires the use of a patented algorithm, and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has resulted in a significant decrease in the use of GIFs, and is partly responsible for the development of the alternative PNG format. However, the patent expired in the US in mid-2003, worldwide in mid-2004; algorithms are themselves not currently patentable under European law.
Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.
One advantage of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly. This led more recent operating system shells, such as Windows 95 and Mac OS X, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.
An alternative method, often associated with Unix and its derivatives, is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition that starts with <!DOCTYPE, or, for XHTML, the XML identifier which begins with <?xml. The files could also begin with any random text or several empty lines, but still be usable HTML.
This approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. This is only useful, however, if the interface used to access the files allows the user to easily manipulate any file in a variety of ways—as opposed to double clicking automatically doing the "right" thing; it is therefore more often associated with command line interfaces than graphical ones. Since reliable "magic number" tests can be fairly complex, and each file must be tested against every possibility in the "magic file", this approach is also relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata based methods need check only one piece of data, and match it against a sorted index). And, as with the example of HTML, some filetypes just don't lend themselves to recognition in this way. It is, however, the best way for a program to check if a file it has been told to process is of the correct format: while the file's name or metadata may be altered indepently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type.
So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.
This approach keeps the metadata separate from both the main data and the name, but is also less portable than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.
Note that zip files or archive files solve the problem of handling metadata. A utiliy program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single ascii/text file across operating systems by ftp systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.
Format de fitxer | Formát souboru | Dateiformat | Fitxategi formatu | Format de données | Formato di file | Fájlformátum | Bestandsformaat | ファイルフォーマット | Format | Формат файла | Tiedostomuodot | 文件格式
This article is licensed under the GNU Free Documentation License.
It uses material from the
"File format".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world