Thursday, April 20, 2006

Turning the Bookshelf into a Library

The Premise: Changing Hundreds of Word Documents into Thousands of Webpages

One of the steps in turning the Ananda Bookshelf into a website is converting all the books, articles, etc. into a format suitable for online use. The "old-style" Bookshelf that Satyaki created uses Microsoft Word documents, which look like this:


"*+ ! ! ! $ #" is not foul language – those are the markers that tell the Bookshelf where a new chapter or section starts. They, and other markers, are used in every Bookshelf chapter, article, etc. For the web, though, we can't use them, so we have to clean them up.

We also have to split up the Word documents. Yogananda's Autobiography of a Yogi is 48 chapters long, and the old Bookshelf, behind the scenes, used just one Word file. For the Ananda Library we'll need at least 48 webpage files instead.

Introducing Mamata (Pronounced "Mamta")

This is where Mamata comes in. Mamata, from Nepal via Washington D.C., is one of the newest members of IT Services. She's extremely sharp, and for months now has been working on generating financial reports for Ananda's fundraising department and retreat center. Mamata also played a big part in getting the new The Expanding Light meditation and yoga retreat website online.


Mamata uses a program called Word Cleaner to convert Word documents into webpages. It does a good job, but even after using it there's still hundreds of thousands of little bits of text to clean up.

So, How to Split and Clean Up Thousands of Webpages?

Automatically, using bit of computer esoterica called "Regular Expressions." With regular expressions you can say, for example, "Clean up a link or a title, but only if it has *+ ! ! ! $ # inside of it," or, "Remove all links that point to footnotes if the names of those footnotes are "_ftn00" through "_ftn99."

These are all things that are in the Word documents that we won't need in the webpages. (Real footnotes have different names.)

Regular expressions that clean up Bookshelf webpages look like this:


Mamata, already knowledgable about HTML, the language of webpages, is learning regular expressions. She'll use them not onto to clean the webpages but to split them up, with an "expression" that essentially says: "Make a new page every time you find a new chapter heading."

Final Steps, then It's Online

After that what happens is:
  • We put the webpage files of the books, articles, etc. into folders
  • We let Scott know
  • Scott diligently updates a few "sitemap" files
Then we can have a working, passworded, "beta" version of the bookshelf part of the site online for testing purposes.

0 Comments:

Post a Comment

<< Home