Preserve whitespace when using xinclude with docbook - whitespace

How to tokenize html tags with spacy?

I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for tag, but it can't work with all types of tags. If I write rule like:
[{'ORTH': '<'}, {}, {'ORTH': '>'}]
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">

IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body> vs body).
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup and not as the words s and oup.
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement and characters only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span> would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.

Zebra Stored Format - QR with [TAB] character from XML

We are in the midst of an Oracle ERP implementation and I have been tasked with redesigning our labels to work with the new system. Our product labeling needs to contain a QR code encoded with the Item and Lot number separated by a [TAB] character. Oracle will be sending XML files containing variable data to the printers for use with formats stored on the printers.
Hard coding works fine, I get a QR coded with 'FOO[TAB]BAR':
^XA
^FO50,50
^BQN,2,10^FH^FDQA,FOO_09BAR^FS
^PQ1,0,1,Y
^XZ
Strange things happen when I create a stored format on the printer:
^XA
^DFE:QR_TEST.ZPL
^MCY
^FO50,50
^BQN,2,10^FN1^FH^FDQA,$$QR_CODE$$^FS
^FO50,350
^AON,30,30^FN1^FD$$QR_CODE$$^FS
^XZ
And send the following XML file:
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE labels SYSTEM "label.dtd">
<labels _FORMAT="QR_TEST.ZPL" _QUANTITY="1" _PRINTERNAME="" _JOBNAME="TEST">
<label>
<variable name="$$QR_CODE$$">FOO_09BAR</variable>
</label>
</labels>
Interestingly without the text field definition the label does not print. After adding the text field the label prints but the QR code produced drops 'FOO' and scans as only '_09BAR' while the text field prints 'FOO_09BAR'.
Confusing matters further, if I drop the '_09' out of the $$QR_CODE$$ variable definition, the text field prints 'FOOBAR' as expected but the QR code scans as 'BAR', so for whatever reason the QR is dropping the first three characters of data.
I've been chasing my tail for a day on this with no progress so am wondering if anybody here with more experience than I can shed some light on the problem.
My two questions are:
Why is the QR code dropping the first three characters?
How do I embed the [TAB] character in the QR code using variable data
from XML?

For the benefit of anyone else trying to solve the same problem here's the solution that I arrived at. Unfortunately Zebra support was of no help as all they could do was to keep sending me pages from the manual I already had. They were unable to send me a working example.
The solution involved two things:
Select what Zebra calls "8-bit byte mode". The mode is set and a parameter indicating the number of characters to be printed is set within the data field. The manual states that it is set as Bxxxx where "xxxx = number of data characters is represented by two bytes of BCD code". In practice what works is an integer formatted to 4 digits, i.e. for 26 characters use B0026.
The second thing to know is that the variable passed must encompass all data that falls between "^FD" and "^FS" including parameters.
The format stored on the printer looks like this:
^XA
^DFE:QR_TEST.ZPL
^MCY
^FO50,50
^BQN,2,10^FN1^FH^FD$$QR_CODE$$^FS
^XZ
Send the following XML file to the printer:
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE labels SYSTEM "label.dtd">
<labels _FORMAT="QR_TEST.ZPL" _QUANTITY="1" _PRINTERNAME="" _JOBNAME="TEST">
<label>
<variable name="$$QR_CODE$$">QM,B0007FOO[TAB]BAR</variable>
</label>
</labels>
The variable $$QR_CODE$$ breaks down as:
Q = Error correction level High
M = Manual character mode
B = 8-bit byte mode
0007 = Seven data characters to follow
FOO[TAB]BAR = Characters to encode in QR code. Note that [TAB] is a
literal tab character, ASCII char 09
The SQL that provides the variable value has to provide all of the above, but that is pretty straightforward. So far this has worked flawlessly with the combinations of item and lot numbers that I've tested so far ranging in length from 10 to 25 characters in length. I'll update this post if I run into any difficulty or learn anything new.

Can I use <br> tag in markdown?

Can I use <br> tag in markdown?
I want to separate two paragraphs.

When you do want to insert a <br /> break tag using Markdown, you end a line with two or more spaces, then type return.
Source: http://daringfireball.net/projects/markdown/syntax#p

To render a break tag, from the documentation:
A paragraph is simply one or more consecutive lines of text, separated
by one or more blank lines. (A blank line is any line that looks like
a blank line — a line containing nothing but spaces or tabs is
considered blank.) Normal paragraphs should not be indented with
spaces or tabs.
The implication of the “one or more consecutive lines of text” rule is
that Markdown supports “hard-wrapped” text paragraphs. This differs
significantly from most other text-to-HTML formatters (including
Movable Type’s “Convert Line Breaks” option) which translate every
line break character in a paragraph into a <br /> tag.
When you do want to insert a <br /> break tag using Markdown, you end
a line with two or more spaces, then type return.
Yes, this takes a tad more effort to create a <br />, but a simplistic
“every line break is a <br />” rule wouldn’t work for Markdown.
Markdown’s email-style blockquoting and multi-paragraph list items
work best — and look better — when you format them with hard breaks.
However, you should separate paragraphs with their <p></p><p></p> tags, which allow the user agent to render them correctly.
Adding a <br />, like this <p></p><br /><p></p> you'll have an "extra" break between the tags.
By the way, this is not possible with markdown as any blank line will automatically be converted into a paragraph.
However, if you want it inside an existing paragraph, then you have to type some text (it cannot be a blank), then end it with two or more spaces.

pagebreak in markdown while creating pdf

I am creating a pdf of markdown text file using doxygen (1.8.6). Now I want to get page break at specific stage in markdown file, I used this link.
In the given link they have mentioned to use '>' for line break. But they haven't mentioned about page break. Yes I can use '>' for page break as well but I have to write this symbol 5 or more times on markdown text file and that makes markdown file in weird look. Is there any other ways to get page breaks in markdown files?

I encountered the same and solve it by inserting
\pagebreak
It's actually a LaTeX command, rather than a Markdown one.

Since Markdown accepts plain HTML and CSS, simply add this line wherever you want to force page break.
<div style="page-break-after: always;"></div>
If your Markdown editor have trouble exporting PDF correctly, first try to export as HTML, then open with your browser and print as PDF.

Blockquotes vs. line breaks
> is not used for line breaks in Markdown. It's used for blockquotes. For example, the following Markdown code
> A man provided with paper, pencil, and rubber, and
> subject to strict discipline, is in effect a universal machine.
becomes
A man provided with paper, pencil, and rubber, and
subject to strict discipline, is in effect a universal machine.
(This is a quote from Alan Turing.)
Note how it is rendered with a yellow background by Stack Overflow. If you examine the generated markup you will find <blockquote> tags being used.
Line breaks can usually be inserted by ending a line with two or more spaces (Markdown syntax) or by using raw <br> tags (most Markdown processors allow inline HTML). For example (using ␣ to represent a space), this Markdown
123 Fake Street␣␣
Springfield, USA
becomes
123 Fake Street
Springfield, USA
Page breaks
Since the original specification Markdown was designed for HTML output (which doesn't have the concept of pages) there is no support for page breaks.
Doxygen uses LaTeX to generate its PDFs. It doesn't seem to support inline LaTeX¹, but you should be able to modify the intermediate .tex file and then use pdflatex to generate your PDF:
Use Doxygen to generate a .tex file
Edit the .tex file manually add \newpage wherever you want page breaks
Run pdflatex documentation.tex
Examine documentation.pdf
This question may be a useful reference for step 3 above.
¹Doxygen does support inline LaTeX formulas, but I wasn't able to find any mention of arbitrary inline LaTeX commands like \newpage that operate in the text environment.

A last resource workaround is to manually edit the HTML output to render to PDF as you want.

Pandoc Markdown: ignore text in PDF (but include in HTML)

I am writing a Markdown report with both PDF and HTML output using Pandoc. I'd like some text will appear in HTML but not in PDF. The solution proposed here isn't working for me.
What I tried so far:
My text for both HTML and PDF <div>My HTML code for HTML only</div>
and I make PDF with
~/.cabal/bin/pandoc -s -N --toc --template=default.latex -f markdown-markdown_in_html_blocks -o reproducibleResearch.pdf reproducibleResearch.md
~/.cabal/bin/pandoc contains the last version of Pandoc (~/.cabal/bin/pandoc) while Ubuntu still has 1.9.

Original markdown HTML blocks (the kind pandoc parses when you turn of the markdown_in_html_blocks extension, as you are doing) must start at the beginning of the line. From the markdown syntax description:
The only restrictions are that block-level HTML elements — e.g. <div>, <table>, <pre>, <p>, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) <p> tags around HTML block-level tags.
So this is what you need:
My text.
<div>
This will be ignored in non-HTML output
</div>

There was another question on stackoverflow that was asking about markdown_in_html_blocks extension. According to that thread markdown isn't parsed in latex.
However, there is a link to pandoc's google groups page that recommend using something like the following '\begin{landscape}' and another for '\end{landscape}'.