As a
digital archivist, I need to keep my ear to the ground with regard to new file
formats, particularly when they are billed as being suitable
for long term preservation. This is why I attended a DPC event today on the new
version of the pdf/a standard (version 3). With pdf/a the clue is in the name,
the ‘a’ stands for ‘archive’.
The
original pdf/a file format was one that was the source of endless debate in my
previous job at the Archaeology Data Service (see summary blog post). It is a format that we eventually embraced as
an acceptable preservation format for documents deposited with us in standard pdf format. The self-contained nature of pdf/a also provides an excellent
format for providing on-line access to reports, having far greater longevity
than standard pdf files, some of which were starting to produce error messages
ten years after deposit – again there is a related blog post on this issue – a problem
I was grappling with in my last couple of months working at the ADS (not the cause of my leaving I might add!)
Today’s
event was very useful, giving me enough background information about the new
format to feel I could now hold my own in a discussion of its pros and cons.
The main
difference between pdf/a 3 and previous versions of the standard is the ability
to include embedded objects. You can for example include the raw data that sits behind one of the graphs in your report, the original MS Word document that
you created your pdf/a file from, or an alternative version of the report (an
audio file for example). The relationship of the embedded object to the pdf/a
file will be recorded in the associated metadata (whether it is data, source or alternative).
It is easy
to see the benefits of this, however the objects that you embed can be in any
format and may therefore not be in a suitable format for preservation. This
provides a headache for digital archivists as any file that was deposited in an
archive in pdf/a 3 format would then have to be assessed for the presence of
embedded files and a separate check on both their value and longevity would need to
be made. It was stated in the briefing today that material with long term
archival value should not be embedded in a PDF/a file, this would be a
difficult concept to express to our donors and depositors when negotiating
submission of their data into our archives.
I cannot
currently imagine a situation where I would want to embed data within a pdf
file in a preservation context. Having each element as separate files with
metadata that explains the relationships between them would always be my
preferred option.
The only
use case I could envisage right now for pdf/a version 3 would be as a future
dissemination option, allowing a user to download a report with associated data
(as embedded files) as a single bundle. Whether this would have any major
benefits over the use of zip files I am not sure. Before this happens, tools for
creating, reading and editing files of this nature would need to be widely and
freely available allowing pdf/a 3 to become main stream. I know from my previous job that educating depositors about
the benefits of creating pdf/a files over standard pdf files was a long process
and my concern is that this new standard might give us even more explaining to
do. Rather than advising depositors that pdf/a is simply ‘A Good Thing’, we may
need to add on caveats relating to which version they should use or advice on the
format of embedded objects. Confusion may well ensue!
No comments:
Post a Comment