Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Changes In Branch invalid_utf8_improvements Excluding Merge-Ins
This is equivalent to a diff from e1034c4c35 to 8bdd0abc7a
2016-06-26
| ||
17:05 | micro-optimizing invalid_utf8 function, should be as fast as possible now ... (check-in: 7c08a68503 user: jan.nijtmans tags: trunk) | |
17:04 | Improve comments ... (Closed-Leaf check-in: 8bdd0abc7a user: jan.nijtmans tags: invalid_utf8_improvements) | |
2016-06-25
| ||
03:56 | Full-text search for check-in diffs. This works, but it creates a huge index (2x the size of the BLOB table) in spite of being a contentless index. The index is slow to build because of all the diffs that must be computed. Because the index is contentless, the snippet generator runs very slowly on queries - a typical query with a couple hundred hits takes several minutes. ... (Closed-Leaf check-in: 68194175fb user: drh tags: diff-search) | |
2016-06-24
| ||
03:36 | If the FOSSIL_SECURITY_LEVEL environment variable is 2 or more, then present a simple substitution matrix when entering passwords, as a defense against key loggers. For FOSSIL_SECURITY_LEVEL of 1 or more, do not remember the remote-url password. ... (check-in: e1034c4c35 user: drh tags: trunk) | |
2016-06-23
| ||
07:43 | Replace some usage of <center> tags with align="center" attributes. ... (check-in: fcfaae37dc user: jan.nijtmans tags: trunk) | |
2016-06-18
| ||
16:50 | If the table is encoded as start-value/size, a variable and a comparison can be saved. Should be even faster .... ... (check-in: 758e3d3188 user: jan.nijtmans tags: invalid_utf8_improvements) | |
Changes to src/lookslike.c.
︙ | ︙ | |||
48 49 50 51 52 53 54 55 56 57 58 59 60 61 | #define LOOK_ODD ((int)0x00000080) /* An odd number of bytes was found. */ #define LOOK_SHORT ((int)0x00000100) /* Unable to perform full check. */ #define LOOK_INVALID ((int)0x00000200) /* Invalid sequence was found. */ #define LOOK_BINARY (LOOK_NUL | LOOK_LONG | LOOK_SHORT) /* May be binary. */ #define LOOK_EOL (LOOK_LONE_CR | LOOK_LONE_LF | LOOK_CRLF) /* Line seps. */ #endif /* INTERFACE */ /* ** This function attempts to scan each logical line within the blob to ** determine the type of content it appears to contain. The return value ** is a combination of one or more of the LOOK_XXX flags (see above): ** ** !LOOK_BINARY -- The content appears to consist entirely of text; however, | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | #define LOOK_ODD ((int)0x00000080) /* An odd number of bytes was found. */ #define LOOK_SHORT ((int)0x00000100) /* Unable to perform full check. */ #define LOOK_INVALID ((int)0x00000200) /* Invalid sequence was found. */ #define LOOK_BINARY (LOOK_NUL | LOOK_LONG | LOOK_SHORT) /* May be binary. */ #define LOOK_EOL (LOOK_LONE_CR | LOOK_LONE_LF | LOOK_CRLF) /* Line seps. */ #endif /* INTERFACE */ /* definitions for various UTF-8 sequence lengths, encoded as start value * and size of each valid range belonging to some lead byte*/ #define US2A 0x80, 0x01 /* for lead byte 0xC0 */ #define US2B 0x80, 0x40 /* for lead bytes 0xC2-0xDF */ #define US3A 0xA0, 0x20 /* for lead byte 0xE0 */ #define US3B 0x80, 0x40 /* for lead bytes 0xE1-0xEF */ #define US4A 0x90, 0x30 /* for lead byte 0xF0 */ #define US4B 0x80, 0x40 /* for lead bytes 0xF1-0xF3 */ #define US4C 0x80, 0x10 /* for lead byte 0xF4 */ #define US0A 0x00, 0x00 /* for any other lead byte */ /* a table used for quick lookup of the definition that goes with a * particular lead byte */ static const unsigned char lb_tab[] = { US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US2A, US0A, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B, US3A, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B, US4A, US4B, US4B, US4B, US4C, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A }; /* ** This function attempts to scan each logical line within the blob to ** determine the type of content it appears to contain. The return value ** is a combination of one or more of the LOOK_XXX flags (see above): ** ** !LOOK_BINARY -- The content appears to consist entirely of text; however, |
︙ | ︙ | |||
133 134 135 136 137 138 139 | } return flags; } /* ** Checks for proper UTF-8. It uses the method described in: ** http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences | | | | | | | | | | | < < < < < < < < < < < < < < < < < < < < < < < | < < | | | > > > < | | | | < < | < | | 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | } return flags; } /* ** Checks for proper UTF-8. It uses the method described in: ** http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences ** except for the "overlong form" of \u0000 which is not considered ** invalid here: Some languages like Java and Tcl use it. This function ** also considers valid the derivatives CESU-8 & WTF-8 (as described in ** the same wikipedia article referenced previously). For UTF-8 characters ** > 0x7f, the variable 'c' not necessary means the real lead byte. ** It's number of higher 1-bits indicate the number of continuation ** bytes that are expected to be followed. E.g. when 'c' has a value ** in the range 0xc0..0xdf it means that after 'c' a single continuation ** byte is expected. A value 0xe0..0xef means that after 'c' two more ** continuation bytes are expected. */ int invalid_utf8( const Blob *pContent ){ const unsigned char *z = (unsigned char *) blob_buffer(pContent); unsigned int n = blob_size(pContent); unsigned char c; /* lead byte to be handled. */ if( n==0 ) return 0; /* Empty file -> OK */ c = *z; while( --n>0 ){ if( c>=0x80 ){ const unsigned char *def; /* pointer to range table*/ c <<= 1; /* multiply by 2 and get rid of highest bit */ def = &lb_tab[c]; /* search fb's valid range in table */ if( (unsigned int)(*++z-def[0])>=def[1] ){ return LOOK_INVALID; /* Invalid UTF-8 */ } c = (c>=0xC0) ? (c|3) : ' '; /* determine next lead byte */ } else { c = *++z; } } return (c>=0x80) ? LOOK_INVALID : 0; /* Final lead byte must be ASCII. */ } /* ** Define the type needed to represent a Unicode (UTF-16) character. */ #ifndef WCHAR_T # ifdef _WIN32 |
︙ | ︙ |