From Print to Digital: PDF, HTML, or XML – What Should Academic Journals Publish Today?
A practical guide for journal editors navigating the format transition in scholarly publishing
Over the past year, our team at FullTextCreator has been building a system that converts academic article PDFs and Word documents into full-text HTML and JATS XML. We built it because the demand was there — and it keeps growing.
Journals want to publish in multiple formats. Indexing databases increasingly require structured content. Authors expect their work to be discoverable, readable, and citable across platforms. And yet, when we process hundreds of articles from dozens of journals, we keep seeing the same reality: most academic content is still locked inside PDFs designed for a printer that doesn’t exist anymore.
This post is about why that matters — and what to do about it.
If you’re looking for practical guidelines on how to prepare a well-structured academic PDF, we’ve already written a detailed guide on creating machine-readable academic PDFs. Consider that a companion piece to this one.
PDF: A Tool Designed for Print
The PDF format was created in the early 1990s to solve a very specific problem: how do you ensure a document looks exactly the same on every screen and every printer, regardless of operating system or software? The answer was to freeze the layout — to treat the page like a photograph.
For print publishing, this was revolutionary. For digital-first academic publishing, it’s increasingly a limitation.
A PDF is a visual container. It knows where every word is on the page. It does not know that a bold centered line is a section heading, that a superscript number is a citation, or that a block of indented text is an abstract. Machines can see a PDF, but they cannot easily understand it.
This matters more than ever. The academic ecosystem now includes:
- Search engines that need to parse and rank your content
- Indexing databases (PubMed, Scopus, Web of Science) that need to extract metadata
- AI-powered research tools (Semantic Scholar, ResearchRabbit, Perplexity, Elicit) that need to read and summarize full text
- Reference managers (Zotero, Mendeley) that need to extract citations
- Accessibility tools that need to serve content to readers with visual impairments
- Translation pipelines that need clean, structured text
PDF can serve some of these purposes — but only if the PDF itself is carefully structured. In most cases, it serves none of them particularly well.
We Are in a Transition Period
Let’s be honest: PDFs are not going away. And for good reasons.
Many indexing requirements still mandate a PDF full-text submission. TR Dizin, for example, requires a PDF. Several EBSCO and DOAJ policies are built around PDF availability. Authors are accustomed to downloading and sharing PDFs. Peer reviewers expect them. Citation managers import them. Institutions archive them.
More importantly: the habits of a generation of academics are built around the PDF paradigm. You cannot simply replace it overnight, even if you wanted to.
What we are seeing — across PKP’s Open Journal Systems community, among Scopus-indexed journals, and among publishers preparing for PubMed Central submission — is a gradual shift toward multi-format publishing. The PDF stays, but it is no longer alone. Alongside it:
- A full-text HTML version optimized for browser reading
- A JATS XML file for databases and long-term interoperability
- Sometimes an ePub for mobile reading apps
This is not a radical proposal. It is what Nature, PLOS, eLife, and virtually every major academic publisher already does. The question for smaller and mid-sized journals is: when do we start, and how?
The Case for Full-Text HTML
HTML is how the web was built. It is what browsers read natively. And it has specific advantages for academic content that are easy to underestimate.
Readability without a download. A reader on a mobile phone can open an HTML article directly in their browser. No app needed, no 5MB file to download, no PDF viewer struggling to reflow text for a small screen. On mobile — which now accounts for a significant portion of academic reading — HTML wins by a large margin.
Better indexing by search engines. Google Scholar, Bing Academic, and general web crawlers index HTML content more accurately and more completely than PDFs. Headings, keywords, abstracts, author names, and citations embedded in structured HTML are more reliably extracted and ranked.
Accessibility. Screen readers, browser zoom functions, contrast settings, and dyslexia-friendly fonts all work seamlessly with HTML. PDF accessibility requires additional tagging work that most journals skip entirely.
Linking and interactivity. HTML enables clickable reference lists, internal section links, embedded figures with captions, supplementary material expansion, and DOI links that work inline. A PDF is static; HTML is alive.
Compatibility with AI research tools. Tools like Elicit, Consensus, Semantic Scholar, and even general-purpose LLMs process text far more accurately when it is clean HTML than when it is extracted from a PDF. As AI-assisted literature review becomes standard in academic workflows, HTML-published content will have a measurable advantage in discoverability and citation.
The Case for JATS XML
JATS (Journal Article Tag Suite) is the XML standard used by PubMed, PMC, Crossref, and most major academic databases for structured article exchange. If HTML is the format for readers, JATS XML is the format for machines and systems.
PubMed Central requires it. If your journal is applying for or maintaining PubMed/PMC indexing, JATS XML is mandatory. There is no workaround.
Metadata richness that PDF cannot match. A JATS XML file can encode author ORCID IDs, funding information with Funder Registry identifiers, contributor roles (CRediT taxonomy), structured reference lists with DOIs, figure descriptions, license terms, and dozens of other metadata fields — all in a machine-readable, standardized way. This is the difference between having metadata and communicating it to every system that touches your content.
Portability and long-term interoperability. JATS XML is format-agnostic. From a single well-formed XML file, you can generate HTML, PDF, ePub, or any future format. It is the master record from which everything else can be derived. This is why major publishers maintain JATS XML as their canonical content format.
Future-proofing for AI and data mining. Academic data mining — the large-scale analysis of scientific literature for trends, systematic reviews, and AI training — operates almost entirely on structured XML. Journals that publish JATS XML contribute to and benefit from this ecosystem. Journals that publish only PDF are largely invisible to it.
Crossref metadata deposit. When you register a DOI with Crossref, the metadata you deposit determines how well your article is matched in citation indexes. JATS XML enables richer, more accurate Crossref deposits than manual metadata entry.
What About Metadata? The Invisible Foundation
One pattern we see repeatedly in our conversion work: journals that are eager to publish HTML or XML, but whose source PDFs are missing the very metadata that makes structured publishing possible.
No ORCID IDs for authors. No clear received/accepted dates. No DOI on the first page. References without DOIs. Keywords buried in the abstract text without a clear label.
Structured formats are only as good as the data that goes into them. Before investing in HTML or XML publishing, it is worth auditing your current PDF production workflow against a basic metadata checklist. We wrote that checklist here: Creating Machine-Readable Academic PDFs.
A Practical Path Forward
For most journal editors reading this, the question is not philosophical — it is operational. You have a small team, a limited budget, and a backlog of articles to process. Here is a realistic framework:
Step 1: Fix the PDF first.
A well-structured PDF with complete metadata is the foundation. If your PDF production is inconsistent, HTML and XML outputs will inherit those problems. Use the checklist linked above.
Step 2: Add HTML for current issues.
Full-text HTML for new articles is achievable with existing tools. OJS supports HTML galley uploads natively. Even a basic, clean HTML version is a significant upgrade over PDF-only publishing.
Step 3: Pursue JATS XML for indexing goals.
If your journal is targeting PubMed, pursuing higher Scopus standing, or preparing a Plan S compliance statement, JATS XML is the path. This is more complex, but the infrastructure exists to support it.
Step 4: Don’t do it manually.
Manual conversion of PDFs to HTML or JATS XML is slow, expensive, and error-prone. Automation — whether through your own typesetting workflow, a conversion service, or a platform like FullTextCreator — is the only scalable approach.
Why We Built FullTextCreator
This is not a theoretical discussion for us. We built FullTextCreator precisely because we saw how much friction existed between a journal’s content and its full digital potential.
The service accepts PDF or Word document uploads and produces clean full-text HTML and JATS XML — structured, metadata-rich, and ready for OJS galley upload or database submission. The demand since launch has been consistent and growing, particularly from journals in Turkey, Eastern Europe, and Southeast Asia that are navigating indexing requirements for the first time or preparing PMC applications.
The problem we keep solving is the same: a journal has years of well-written, peer-reviewed content, but it is locked in PDFs that no database can properly read. Format conversion is the bridge between the content that exists and the visibility it deserves.
Conclusion
The academic publishing format question is not PDF versus HTML versus XML. It is PDF plus HTML plus XML — a multi-format strategy that serves readers, machines, indexing databases, and AI tools simultaneously.
PDF remains essential for the foreseeable future. But journals that publish only in PDF are leaving discoverability, accessibility, and indexing potential on the table. The transition is already underway at every level of the scholarly publishing ecosystem. The question is not whether to make the shift, but how to do it efficiently.
For journals serious about visibility, indexing, and future-proofing their content: the format upgrade is not optional. It is infrastructure.
Looking for hands-on help with format conversion? Visit fulltextcreator.com or explore our OJS hosting and services for academic journal support.
The post From Print to Digital: PDF, HTML, or XML – What Should Academic Journals Publish Today? first appeared on OPEN JOURNAL SYSTEM SERVICES.














