InnerText identification and the HTML spec

Technology specific object identification, supported applications, web technologies, and 3rd party controls.
krstcs
Posts: 2683
Joined: Tue Feb 07, 2012 4:14 pm
Location: Austin, Texas, USA

InnerText identification and the HTML spec

Post by krstcs » Thu Mar 06, 2014 8:44 pm

I'm having a bit of an issue with identifying some objects using their InnerText attribute. The objects in question have an InnerText that has line-breaks and carriage-returns in the middle of sentences.

For example, the InnerText is
"Sign in to view your
designs »"
(as in "Sign in to view your\r\n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ designs »"), but the HTML rendering engine displays it as "Sign in to view your designs »" because the HTML spec requires that the rendering engine remove all multi-space and and other special characters (\r \n \t) and replace them with " " (a single space).

This means that the developers will add line-breaks to the code in order to make it readable and allow it to pass the code review system, and the browser will display the code without the line breaks, but Ranorex sees all of the characters (except in IE8, where it sees just what the browser displays). Before anyone says that we should just remove the special characters, this isn't going to happen, and shouldn't be necessary given the HTML spec.

My concern is how to identify elements using the InnerText (sometimes the only way to ID) without regard to the special characters.

Is there a way to do this without having to go through each and every repository object or new object and specifically change the regex to include "\s+" in place of all of the white-space characters? This would be a huge undertaking and potentially make the test prone to errors, as well as meaning that the text in the XPath would not be as easily human-readable.

I am not sure that this isn't a problem with Ranorex and how it is presenting the InnerText element, because the HTML renderer strips those special characters but Ranorex doesn't.

Does anyone else have this issue or has anyone found a clean way around it?
Shortcuts usually aren't...

User avatar
Support Team
Site Admin
Site Admin
Posts: 12145
Joined: Fri Jul 07, 2006 4:30 pm
Location: Houston, Texas, USA
Contact:

Re: InnerText identification and the HTML spec

Post by Support Team » Mon Mar 10, 2014 5:25 pm

Hello krstcs,

Unfortunately we are not able to reproduce your issue. Is it possible to provide us access to a sample website?

Thank you,
Robert

krstcs
Posts: 2683
Joined: Tue Feb 07, 2012 4:14 pm
Location: Austin, Texas, USA

Re: InnerText identification and the HTML spec

Post by krstcs » Mon Mar 10, 2014 6:42 pm

Copy and paste the following HTML into a HTML file and open in each version of IE and you will see the issue. Use developer mode in the browser and look at the difference between the rendered text (which will have all extra whitespace removed) and the actual innertext.

All browsers RENDER it without the line-break and extra spaces so it will be on one line with only 1 space between the lines, but the ACTUAL HTML innertext has the line-break and extra spaces (except in IE8 which re-formats the innertext itself) so the innertext doesn't match what is actually displayed.

NOTE: This HTML is correct and accurate according to the HTML specification and all of the browsers handle it correctly by stripping out the extra whitespace for rendering (although IE8 should NOT reformat the innertext itself, only the rendering).

Code: Select all

<html>
  <body>
    <div>Some InnerText
            And More Text</div>
  </body>
</html>
Edit to add: Would it be possible to add another operator that allows us to make Ranorex recognize the text as HTML, so it would remove all of the extra whitespace from the comparison (\t, \r, \n, etc.)?

Something like ":" so the Xpath could be "/div[@innertext:'Some InnerText And More Text'] and would match the above HTML by stripping out extra whitespace during the comparison?

(Edit for consistency in symbols.)
Shortcuts usually aren't...

User avatar
Support Team
Site Admin
Site Admin
Posts: 12145
Joined: Fri Jul 07, 2006 4:30 pm
Location: Houston, Texas, USA
Contact:

Re: InnerText identification and the HTML spec

Post by Support Team » Mon Mar 17, 2014 4:04 pm

Hello krstcs,

Please accept my apologies for the late reply. I had to cross check this issue with our development department.
Originally, Ranorex recognized the inner text in the same manner as the browser does. We changed that behavior because many customers wanted to retrieve the inner text in its original (un-rendered) form.

In order to overcome this issue I suggest replacing additional formatting information (/t, /n, …) using a regular expression:
Regex.Replace(Element.InnerText,@"\s{2,}"," ")
Regards,
Robert

krstcs
Posts: 2683
Joined: Tue Feb 07, 2012 4:14 pm
Location: Austin, Texas, USA

Re: InnerText identification and the HTML spec

Post by krstcs » Mon Mar 17, 2014 4:46 pm

Thanks Robert, I've already worked around it on my side.

I was just wondering if we could add another operator to the current symbology that would allow for this without coding, so we could choose to have it match disregarding whitespace or regarding it, so both parties would be satisfied.

I was just thinking that "=" and "~" would match regarding whitespace and something else (like ":=" or ":~") would match regardless of whitespace. Could use "==" or "~~", etc.

But I understand if you guys don't want to mess with adding another symbol just for this.

Thanks again!
Shortcuts usually aren't...

User avatar
Support Team
Site Admin
Site Admin
Posts: 12145
Joined: Fri Jul 07, 2006 4:30 pm
Location: Houston, Texas, USA
Contact:

Re: InnerText identification and the HTML spec

Post by Support Team » Tue Mar 18, 2014 4:21 pm

Hi krstcs,

In order to discuss feature requests I would prefer to continue the communication via email.

Regards,
Robert