Saturday, December 6, 2008

[10] Sanitize XOG XML file before pushing to Clarity

At several ocassions while XOG'ing between environments, you face an issue of XOG returning an error because of special characters inside the XOG XML file being pushed. Mostly these are special characters that XOG doesn't like and those are mostly not detected via Visual inspection. For quite some time, I have been using simple but effective method to Remove those special characters from XOG file(s).

Step 1 : Open XOG file in TextPad (or any Text editor supporting Regular Expression Search/Replace feature)
Step 2 : Press F8 to open Replace window and Do Find/Replace as :-

Find : [^&;:()a-zA-Z0-9" =./><_-~-]

Note : If you find any other character which should be there but found in search, include in above expression after &. Be careful if you are choosing "Replace All" option as you may blank out character(s) which were actually needed to be XOG file. Secondly, if you have many files to 'Sanitize' then load all those document in Textpad and set the 'Scope' to 'All documents' in Find/Replace search dialog box.

As per Martti's question/suggestion I am sharing method for removing Non-English Language codes from XOG file.

Step 1 : In Textpad > Configure > Preferences .. > Editor > Check "Use POSIX Regular Expression Syntax"
Step 2 : Search and Replace blank for .*<nls.*languageCode="([^e].|.[^n]).*\n (This will remove languageCode appearing in Single line)
Step 3 : Search and Replace blank for .*<nls.*\n.*languageCode="([^e].|.[^n]).*\n (This will remove languageCode appearing in 2 lines)
Step 4 : Search and Replace blank for .*<nls.*\n.*\n.*languageCode="([^e].|.[^n]).*\n (This will remove languageCode appearing in 3 lines)

To Test you can try this sample XOG file. These regular expression examples above are for "en" languageCode and include/exclude other Language codes by making changes in above regular expressions. Explaining these expression would be lengthy so just leave your comments if you have any specific question or need, I would reply your queries.

10 comments:

  1. in our setup character "~" is used in the Project name and we use this for XOG.

    ReplyDelete
  2. Hello Sangeet,

    Would you please indicate me: How would you define the regular expression to allow the nls characters that are not in standard ascii a-z like üåäö

    With regards

    Martti K.

    ReplyDelete
  3. So that should read educate not indicate

    Martti "Typo" K.

    ReplyDelete
  4. Hi Martti,

    You can also try following : ( Check if it works for you )
    [^&;:()[:alnum:]" =./><_-~-]

    Further if you want to include Tab character you can add \t in above expression like
    [^&;:()[:alnum:]\t" =./><_-~-]

    If you just want to allow few more characters then you can also add them individually like
    [^&;:()a-zA-Z0-9üåäö" =./><_-~-]

    Likewise you can allow/disallow any character using Regular Expressions.

    - Sangeet

    ReplyDelete
  5. Hello Sangeet,

    Tried both with Texpad and they seem to work.
    Thanks.
    Seems to be more consistent than with Notepad2.
    The advantage of notedpad2 is that it comes with some of the Windows and does not require install, just copy.
    Notepad2 seems to have problem with [:alnum:] and maybe it does not make a difference wetween a and ä.

    For user and resource data you have add @and , to the allowed list.

    With regards

    Martti K.

    ReplyDelete
  6. Hello Sangeet (again)

    Many people have complained that they do not need the 16 supported languages in the XML file.
    The situation occurs when you read out something, either want to modify that before writing it back to the same or a dirferent system.

    If you are using just one (=US English) or two (a national language) can you simply strip the others? Like the samples only have English.

    It used to be easy to do in Excel when the < NSL attributes were all on one line. Then you just made a filter for the first characters accordingly, filtered the view and deleted the ones which you did not need, though leaving the en was enough work.

    Now the < NLS are many times on two lines and filtering it the same way does not work.

    Would you have a tip for that?

    With regards

    Martti K.

    ReplyDelete
  7. Hello again,

    One cleaning problem I have come across is having " that is quotes in the attribute values.

    Like the sample data projects in PMO accelerator have quotes in the description fields in German.
    If you are reading all projects from one system which has the sample data to another and do not clean out the nls attributes you get an error, and my recollection is that in the worst case that stops the whole write and no records are written until the problem is fixed. I find you can only do it the hard way because quotes are allowed elsewhere.

    With regards

    Martti K.

    ReplyDelete
  8. Hi Martti,
    I have written some regular expressions that will help in removing unnecessary LanguageCode lines from the XOG file. Added those regular Expressions in the Post. There are 3 of them which you need to run one after another to remove all undesired languageCode enteries.
    I took typical XOG output file and found that language code was appearing in 1, 2 or 3 lines tag. (Attached one sample XOG file) Wrote 3 respective regular expressions to remove each of non-english NLS enteries from XOG output fie.

    Thanks
    Sangeet

    ReplyDelete
  9. Martti,

    Can you please send me a sample XOG file that you are referring to have quotes in it and which is pain to remove.
    I will check if quotes can be removed/escaped by regular expressions.

    Thanks
    Sangeet Chourey

    ReplyDelete
  10. Hello Sangeet,

    Sorry it took so long.
    I'll email to this blog address the xml file, because I am not sure it would post here OK.

    The German description of the group csk.ProgramManager had quotes in PMO accelerator 8.1. sample data. That is the only place I've seen it.
    I tried to raise a support issue, but that did not fly. Looking at the sample data in later versions it seems to have been fixed.

    This brings another type of sanitizing issue.

    Even if you have coding set to UTF-8 which allows to use nls characters öäåü in XML data there are reserved characters which you should replace with the respective escape sequence like & \ People do have those every now and then in names of objects and especially in the names of OBS units.

    The backslash \ in the name of the OBS unit is even more problematic, because that used mainly as the separator between OBS levels in the OBS path. In some versions (possibly from 8.1 on) you can enter that through the GUI, but not with XOG.

    With regards

    Martti K.

    ReplyDelete